trigraphs in X3J11

Sun May 29 01:09:37 AEST 1988

In article <3655 at pasteur.Berkeley.Edu> faustus at ic.Berkeley.EDU (Wayne A. Christopher) writes:
>Nobody has said what the existing practice is with regard to European
>character sets.
I posted an article the other day, but it maybe it didn't get past mcvax.
I shall include it here.

>I think trigraphs are a trick of American terminal manufacturers who
>want to fool Europeans into thinking they can use their terminals for
>writing programs.
Think again: If we use American ASCII-only terminals on an operating system
and compiler designed for ASCII, as most of them are, there's no problem
in writing C code, only in getting our national characters in the output.
I think a similar confusion may be part of the reason why trigraphs are so
badly concieved.

My prior article follows; I apologize if it's been seen before, but I
haven't seen any signs that it has.

As one who regularly uses a non-ASCII terminal setup, I'd better explain
a little. In Danish (my native language) we have three `extra' letters
which we much prefer to use when writing Danish text - it is possible to
get by with two-letter replacements, but it's not very readable. By the
way, these are not `accented letters;' they are separate letters of the
alphabet, with their own place at the end of the sorting sequence. Much
the same applies to German, Swedish, Norwegian, and many other European
languages.
  That's not usually a problem as most modern terminals have provisions
for various national character sets, which are defined in an ISO standard.
This standard allows the glyphs at some eight or ten positions to vary,
including @, $, [, \, ], {, | and }. The latter six are used for the non-
ASCII letters in Danish, as they follow the other letters nicely.

  So, the X3J11 people think, the poor Europeans can't use ASCII: we'll
have to invent some kludge to bring C to their benighted shores. The only
excuse for inventing something so horrible is that it only breaks a very
few programs, and that it won't be used anyway.
  You see, over here we get by just fine without trigraphs. The less
fortunate are stuck with a national character set, and have to put up
with seeing the various punctuation as letters - they are not as visually
distinctive (and the brackets and braces don't pair naturally), but with
a little attention to layout one gets by quite well. And it's _much_ better
than trigraphs.
  The lucky ones have terminals which can switch between ASCII and national
character sets. If not for the warped minds of the terminal manufacturers,
this would be the perfect solution. But we (at this institute) have yet to
see a terminal with an escape sequence to switch character sets, or (and
this is worse) one whose keyboard layout did _not_ change with the character
set shown on the screen. (And none of them had LCD keytops). So we have to
pay the importer to hack new PROMs to enable us to switch without moving the
keys around. But I digress.
  By the way, I find that it's easier to read Danish with ASCII characters
than it is to parse convoluted C code in Danish characters, so I hardly
ever bother to switch any more.

To make it pleasant to use C and national letters in the same file, there
would have to be _convenient_ replacements for the ASCII characters in
question, and it would have to allow the national letters to be used in
identifiers (trigraphs don't). This cannot be done as an extension of the
ASCII C input format because the national letters are punctuation in ASCII.
  Now we're talking about an alternate input format for C - we'll have to
tell the compiler if a given source file is in the `old' or the `new' format.
On the other hand this frees us to use extra keywords etc. The new format
shouldn't use any characters that may be replaced in national character sets.
  The tokens [ ] { } | || (and in some compilers |=) must be replaced; one
off-the-cuff possibility is (. .) beg end or cor (or=). We need a new
pre-processor escape and a new string escape, which can't very well be
keywords. // might be a possibility for both, as it's rare in C, but does
it look too much like JCL?
  This new format could probably be implemented by a little lex pre-pre-
processor; national characters in identifiers would have to be encoded
somehow (e.g. using Q as an escape), increasing the identifier length.
This would cause problems with symbolic debuggers and short-name compilers,
but could easily be retrofitted on old compilers (write your own cc ...).
Oh well, it wouldn't be portable anyway. Hey, anybody from GNU reading this?

By the way, Standard Pascal is designed to be possible to write without
specific ASCII characters: It allows (. .) for [ ] (indexing), and (* *)
for { } (comments). Since e.g. .5 is a legal constant, this may cause
unexpected parse errors for programmers who're unaware of the feature.
--
Lars Mathiesen, DIKU, U of Copenhagen, Denmark      [uunet!]mcvax!diku!thorinn
Institute of Datalogy -- we're scientists, not engineers.