Programming and international character sets.

Richard A. O'Keefe ok at quintus.uucp
Sun Nov 13 17:17:36 AEST 1988


In article <774 at wsccs.UUCP> terry at wsccs.UUCP (Every system needs one) writes:
>Second, vi in the US strips the 8th bit out, and is therefore not
>usable for programming international (8-bit) characters using either model.

AT&T announced clearly in the SVID that they were going to stop doing
that kind of thing, _and_they_have_.

>Problems with 16 bit characters:
>
>O	The Xerox model is 16-bit and only valid for bitmapped displays,
>	like Mac, and we all know how slowly that scrolls.
>
The Xerox model (XSIS 058404) has nothing to do with bitmapped displays.

>O	All of the current software would break without extensive rewrite

It's going to break _anyway_.  If you do one-character-equals-one-byte
operations on Kanji, the results just aren't going to make sense.  With
a 16-bit model (actually, the Xerox model already has provision for
24-bit characters, though the implementation I was familiar with didn't
provide them yet).  In fact, when XNS support was added to InterLisp,
most programs didn't even need to be recompiled, and those that needed
other changes mostly _could_ have been written to be independent of
character set using facilities already in the language.

>O	The internal overhead in a non-message passing operating system
>	(most of them) is so high that it's ridiculous.
>O	Think of pipes and all file I/O going half as fast.
>O	Think of your hard disks shrinking to half their size... source
>	files, after all, are text.

These are essentially the same point, and are equally mistaken.  There is
no reason why a _single_ character and a _sequence_ of characters need
to use the same coding.  There are three representations used for
character sequences in Interlisp-D:  thin strings (vectors of 8-bit
characters from "character set 0"), fat strings (vectors of 16-bit
characters), and files (sequences of characters drawn from the same
256-character block are stored as sequences of 8-bit codes, with
"font change" codes inserted as needed).  Since a file is presumed to
start in character set 0, files of 8-bit characters DIDN'T CHANGE AT ALL.
If you want to position randomly in a sequence, then you have to know
what the "font" is there, or a font change code could be inserted
at the start of every block.  It is only when a program picks up a single
character and looks at it on its own that it materialises as 16 bits.
[This coding wins if you tend to mix languages with small character sets,
e.g. if you have whole sentences in English, Russian, Hebrew, Greek, &c,
because then you can stay in the same "font" for at least a word at a time.
It does not pay off for Kanji, but with a certain amount of cunning you
can make it no worse than the ISO 2022 method.]

Now you can only achieve code-set independence as easily as that in a
high-level language, and font-compressed files really require all the
utilities in the system to be internationalised at once, so the ANSI
committee didn't really have the option of adopting a solution like this.



More information about the Comp.lang.c mailing list