wchar_t values

Keld J|rn Simonsen keld at login.dkuug.dk
Thu Apr 4 08:59:44 AEST 1991


gwyn at smoke.brl.mil (Doug Gwyn) writes as a reply to an article of mine:

>>ANSI C does not handle DIS 10646, JIS X 0208, GB 2312 and KSC 5601 
>>correctly. So ANSI C multibyte specifications *cannot* be used on any
>>multibyte de jure character set.

>I think you have mixed multibyte character sequences with wchar_t.
>They are NOT the same thing!  That is why there are interconversion
>functions specified in the C standard.

I wrote "multibyte" to cover multibyte character sets,  and the
support in ISO (ANSI) C for these character sets; this support
consists of the multibyte support functions and the widechar
support functions.

>The advice X3J11 received during development of this aspect of the
>standard, from such organizations as NTSCJ who have a major stake in
>so-called multibyte character encodings, was that the mechanisms in
>the C standard were adequate for this purpose.  Unless you can explain
>WHAT it is that you think is wrong, I suggest that your comments be
>ignored.  I haven't seen a significant technical argument against
>the "wide character" mechanisms in the C standard; what I have seen
>are misunderstandings.  Perhaps you should refer to P.J. Plauger's
>model standard C library implementation in his new book, to see what
>is actually involved in exploiting and implementing these facilities.

OK, some hard facts:

The character 'c' has the following encodings in these basic 16-bit
East-asian de jure character standards:

GB 2312-80 (basic Chinese 16-bit  standard)  /035/099
JIS X 0208 (basic Japanese 16-bit standard)  /035/099
KS C 5601  (basic Korean 16-bit standard)    /035/099

ISO DIS 10646 has the following value for 'c':  /032/032/032/099

None of these values have the nice property of having ASCII 'c'
extend into these values when loading as a 16-bit or 32-bit int.

Either this is a problem for all of the above character sets, or
it is not a problem at all. I hope the latter is true, then there
is no problem to fix. Unfortunately quite some knowledgeable people
like Erik v.d.Poel and SC2/WG2 people plus WG14 think there is a problem
and they have not yet been able to solve it.

>>Also the character standards should be the base standards and
>>programming language standards build on these and provide appropiate
>>functionality to cover the standard character sets. 

>WHICH "character standard"?  There are so many to choose from, all
>of them botched in one way or another.  That is why programming
>language standards should be INDEPENDENT of any particular choice of
>character code set, rather than based on one choice that may not be
>appropriate for many of the potential users of the language.

Something we agree on, Doug!

> In the
>case of C, the only requirements on the basic source and execution
>character sets are that there be at least 96 distinct values, that
>the values assigned by the C implementation to represent digit glyphs
>be a contiguous ascending sequence, that there be three additional
>distinct values in the execution set, and that all the previously
>mentioned internal values be distinct from zero.  THE MAPPING BETWEEN
>INTERNAL CODE SET VALUES AND THE ENVIRONMENT CAN, AND SHOULD, BE
>DEFINED BY THE C IMPLEMENTATION.  Thus, a straight 6-bit external
>code set could not have a one-to-one correspondence between external
>"characters" and C source OR execution "character" values, and in
>such a system environment there would have to be at least one added
>convention for representing the full set of internal C characters,
>with support tools to facilitate working with such special text files.

This I read to have the consequences that all characters in the
source C program and every input and output file SHOULD have
conversions applied to all widechar strings. This could be done in the
mbstowcs() and other mb/wc functions.

Thus the internal widechar representation of 'c' and the external
multibyte representation SHOULD not be the same for character sets
like ISO 10646, JIS X 0208, KS C 5601 and GB 2312.
At least this should hold for characters in the C character set.

This interpretation of the C standard sounds OK to me,
and solves the problem mentioned by Erik v.d. Poel.

Doug continues to make an ad hominem attack at me (well that is not the
first time he does so). Some comments:

ANSI and DS (Danish Standards, the Danish equivalent to ANSI in ISO)
have disagreed on the necessity of a *readable* and *writeable* alternative
to a representation of C source in invariant ISO 646. Invariant ISO 646
is the same as ASCII with 12 positions left undefined - to be decided by
national ISO bodies. ANSI decided on the ASCII character set, a lot of
other ISO member bodies - especially in Europe - decided to use these
positions for national letters and the like. Then invariant ISO 646 is
the greatest common denominator for these character sets, all derived
from the international standard ISO 646. 

WG14 has supported DS in this opinion at several meetings, and also
SC22 has passed resolutions to require this support in the whole
area of programming languages. WG14 has passed resolutions of support
several times, latest in the WG14 meeting in Copenhagen Nov 1990.

>Indeed, X3J11 explained this to you in the third public review response
>document.  Judging from your continued pursuit of more obtrusive
>solutions for your own particular limited character set problem,
>it would appear that you either did not understand the X3J11 response
>or that, for reasons of your own, you wish to ignore it.  For purposes
>of documentation for those who have not seen the response X3J11 gave
>long ago, here it is:

>	In response to Letter #177, Doc. No. X3J11/88-134:

>	Summary of issue:
>		Proposal for more readable supplement to trigraphs.

>	X3J11 response:
>		The Committee discussed this proposal but decided
>	against it.

>		We cannot support this proposal for a number of reasons.
>	Trigraphs were intended to provide a universally portable format
>	for the *transmission* of C programs; they were never intended
>	to be used for day-to-day reading and writing of programs.

Here X3J11 is in disagreement with SC22 and SC22/WG14, who have both
passed resolutions on the desireability of alternate representation
of C source for reading and writing.

>	Should it be necessary to do so, however, the preprocessor can
>	already be used to improve their readability (exact macro names
>	and definitions are not provided as the Committee prefers to
>	avoid stylistic issues).  As larger character sets become more
>	and more popular, the chances of having to deal with a
>	"deficient" character set become smaller and smaller.

True to a great extent. But this is not a technical argument of
why an alternate representation is impossible, but a means on how
to implement it.

>		Conversion between the current trigraph representations
>	and "normal" representations can be done simply in a context-
>	free manner, but this is not possible with the proposed notation.

Well, it was not meant to be context-free substituton...

>	Also, there are a number of difficulties with the infix subscript
>	operator where empty brackets would have been used.  Either the
>	operator must be allowed as a postfix unary operator as well as
>	a binary operator, or the grammar must be extended to allow empty
>	parentheses to appear in those contexts where empty brackets can.
>	Although these problems are by no means insurmountable, we feel
>	that the current trigraphs are adequate for their intended use
>	and that no further enhancements are necessary.

Here X3J11 admits that the technical problems are solveable ("not by any
means insurmountable").

Instead the proposal is turned down for political reasons.

>		Translation phase 1 actually consists of two parts, first
>	the mapping (about which we say very little) from the external
>	host character set to the C source character set, then the
>	replacement of C source trigraph sequences with single C source
>	characters.  (Note that the C source characters represented in
>	our documents in Courier font need not appear graphically the
>	same in the host environment, although a reasonable
>	implementation will make them as nearly so as possible.)
>	The kind of mapping you propose can in fact be done in the first
>	part of translation phase 1, and several such "convenience"
>	mappings are already common practice.  However, attempting to
>	standardize this mapping is outside the scope of the C Standard,
>	since what is appropriate may depend on the capabilities of the
>	specific hardware, availability of fonts, and so forth.

This I read as just a story with irrelevant facts about fonts
(Courier etc.). What was proposed was support for a very central 
character set, namely invariant ISO 646.

>		Although the Committee regrets any "no" votes on either
>	the national or international proposed standards, we feel we
>	must represent our best judgement on technical issues.  We hope
>	you will reconsider your objection to the current specification.

This X3J11 judgement is not at all "on technical issues" - all technical
issues are admitted to be solveable! The decision is a political one.

>Note that your "trigraph alternative" proposals had been discussed many
>times in the standards committees, and still were resoundingly defeated
>during a joint X3J11/WG14 meeting. 

And other WG14 meetings have supported it, and asked ANSI X3J11 to 
implement it. X3J11 just ignored these ISO WG14 resolutions.

> The only reason this issue is still
>"on the table" for WG14 is that there was some political maneuvering at
>the SC22 level in the absence of anybody who could represent the actual
>issues and history, and SC22 mistakenly thought, on the basis of your
>argumentation, that there was a problem that needed to be solved, and
>thus directed that work toward a normative addendum to the ISO C
>standard begin to address this "problem".

I did not personally argue this in SC22, so others than me must
have been convinced. Actually X3J11 has been the only body to
not be convinced. And so the story about X3J11 says on quite some other 
issues (such as the British comments).

>Later, the Japanese in
>particular thought that it would be appropriate to add more support for
>multibyte character sequences to the ISO C standard as part of this
>normative addendum.  Your original hobby horse had nothing to do with
>multibyte character sequences, and so far as I can determine, the
>Japanese have not found any problem with them other than the desire for
>more standard library functions to make their use more convenient.

OK, so my hobby horse has nothing to do with multibyte support.
So I cannot address other issues than my "hobby horse"?
I do happen to have a fair amount of knowledge of character sets
and their usage in programming languages and communications.

The reason why the Japanese have not seen the problem before with
JIS X 0208, but first with 10646, is beyond my understanding.
Maybe some Japanese could enlighten us (me!) on this?

>It is also worth noting that there is continuing discussion of this
>issue on the X3J11 and WG14 electronic mailing lists.

Both you and I participate in these discussions. I actually think
it would have been more appropiate for you to share your valuable
technical insight at an earlier stage in this discussion,
e.g. in November when SC2/WG2 made their first letter about 10646.
Oh well, better later than never. I am, as always, grateful for your
contributions (although I would prefer your tone to be more gentle:-)

>>I hope that this problem will be a historical one with the appearance
>>of 10646.

>Surely you should be able to see the possible problems with 10646?
>The very idea of using 32 bits to represent a character is bound to
>meet stiff opposition, particularly from users of small systems,
>who already have more efficient solutions to the "problem" of a
>diversity of alphabets.  It seems to me that 10646 is one of the
>technically worst character-set standards yet to be adopted.  No
>wonder there has been renewed interest in other standards such as
>"Unicode" (about which I know little at present other than that it
>has a broad base of industry support).

Now you are talking about things that you know very little of, Doug!

>One does not solve "people problems" by simply adopting a technical
>standard.  History provides much evidence of that.

>DISCLAIMER:  None of the above should be construed as an official X3J11
>position, not even the attempt to cite from an X3J11 document.  However,
>I believe that I have correctly represented the situation as I
>understand it.

Keld Simonsen



More information about the Comp.std.c mailing list