Character sets (was Re: wchar_t values)

Thu Apr 4 16:31:41 AEST 1991

In the recent exchanges between keld at login.dkuug.dk (Keld J|rn Simonsen)
and gwyn at smoke.brl.mil (Doug Gwyn), there is evidence of misunderstanding
on both sides.  Mr. Simonsen missed a subtle English ambiguity in an ANSI
response that Mr. Gwyn quotes, and went off on a tangent.  Mr. Gwyn missed
a more important technical issue, and appears to have gotten all defensive
about Mr. Simonsen's complaint instead of recognizing the technical point.

dg>THE MAPPING BETWEEN INTERNAL CODE SET VALUES AND THE ENVIRONMENT CAN,
dg>AND SHOULD, BE DEFINED BY THE C IMPLEMENTATION.

ks>This I read to have the consequences that all characters in the
ks>source C program and every input and output file SHOULD have
ks>conversions applied to all widechar strings. This could be done in the
ks>mbstowcs() and other mb/wc functions.

Even ordinary characters (non-widechars) might need conversions.

ks>Thus the internal widechar representation of 'c' and the external
ks>multibyte representation SHOULD not be the same for character sets
ks>like ISO 10646, JIS X 0208, KS C 5601 and GB 2312.
ks>At least this should hold for characters in the C character set.

It seems to me that this is exactly the reason for the mb/wc functions.
However, something is missing for ordinary characters.

ks>ANSI and DS (Danish Standards, the Danish equivalent to ANSI in ISO)
ks>have disagreed on the necessity of a *readable* and *writeable* alternative
ks>to a representation of C source in invariant ISO 646.

Exactly.

dg>Indeed, X3J11 explained this to you in the third public review response
dg>document.

It did not.

dg>	In response to Letter #177, Doc. No. X3J11/88-134:
dg>	Summary of issue:
dg>		Proposal for more readable supplement to trigraphs.

Confusion of issue.

dg>	X3J11 response:
[...]
dg>	Trigraphs were intended to provide a universally portable format
dg>	for the *transmission* of C programs; they were never intended
dg>	to be used for day-to-day reading and writing of programs.

Exactly.  However, a readable and writable alternative is ALSO necessary.
By co-incidence (well, not entirely, but can be treated that way), it
would also make trigraphs unnecessary.  A readable and writable alternative
is necessary for programming, regardless of whether transmission is done.

[...]
dg>		Translation phase 1 actually consists of two parts, first
dg>	the mapping (about which we say very little) from the external
dg>	host character set to the C source character set, then the
dg>	replacement of C source trigraph sequences with single C source
dg>	characters.  (Note that the C source characters represented in
dg>	our documents in Courier font need not appear graphically the
dg>	same in the host environment, although a reasonable
dg>	implementation will make them as nearly so as possible.)
dg>	The kind of mapping you propose can in fact be done in the first
dg>	part of translation phase 1, and several such "convenience"
dg>	mappings are already common practice.  However, attempting to
dg>	standardize this mapping is outside the scope of the C Standard,
dg>	since what is appropriate may depend on the capabilities of the
dg>	specific hardware, availability of fonts, and so forth.

ks>This I read as just a story with irrelevant facts about fonts
ks>(Courier etc.).

No.  The clause about fonts was not intended to give importance to fonts
etc.  It is intended only to help identify which characters in the standard
are C source characters, as opposed to those which are English or BNF.
The main thrust of the statement is that C source characters do not have
to look like Roman letters and punctuation marks in the host environment,
though a "reasonable" implementation would do "as nearly so as possible."

(I find this use of the word "reasonable" offensive, an innuendo bordering
on racism.  Many programmers would not like programming in C on a machine
that does not have Roman letters, but that does not make the machine, or
an implementation of C thereon, unreasonable.)

ks>This X3J11 judgement is not at all "on technical issues" - all technical
ks>issues are admitted to be solveable! The decision is a political one.

If the misunderstanding was cleared up, then it was political.  However,
it is not clear from the postings in this group, whether the misunderstandings
were cleared up in time.

ks>The reason why the Japanese have not seen the problem before with
ks>JIS X 0208, but first with 10646, is beyond my understanding.
ks>Maybe some Japanese could enlighten us (me!) on this?

Maybe.  This Japanese resident can make a stab in the dark which sounds
plausible:  In JIS 2.6, there was no problem with ASCII characters.
A byte which had its high bit 0 was not part of a JIS 2.6 character.
It is possible that 0208 didn't have much of a following until recently.
(I don't know 0208 at all, so have to take the words of others about the
problems that arise.)
--
Norman Diamond       diamond at tkov50.enet.dec.com
If this were the company's opinion, I wouldn't be allowed to post it.