wchar_t values

Doug Gwyn gwyn at smoke.brl.mil
Mon Apr 1 16:45:30 AEST 1991


In article <keld.670436534 at dkuugin> keld at login.dkuug.dk (Keld J|rn Simonsen) writes:
>erik at srava.sra.co.jp (Erik M. van der Poel) writes:
>>	C	ISO DIS 10646/4		wchar_t
>>	L'c'	032/032/032/099		000/000/000/099
>>	L'\t'	009/128/128/128		000/000/000/009
>Erik writes: ANSI C does not handle 10646 properly -> let's change 10646!

No, he didn't say that, and his suggestion seemed reasonable to me.

>ANSI C does not handle DIS 10646, JIS X 0208, GB 2312 and KSC 5601 
>correctly. So ANSI C multibyte specifications *cannot* be used on any
>multibyte de jure character set.

I think you have mixed multibyte character sequences with wchar_t.
They are NOT the same thing!  That is why there are interconversion
functions specified in the C standard.

The advice X3J11 received during development of this aspect of the
standard, from such organizations as NTSCJ who have a major stake in
so-called multibyte character encodings, was that the mechanisms in
the C standard were adequate for this purpose.  Unless you can explain
WHAT it is that you think is wrong, I suggest that your comments be
ignored.  I haven't seen a significant technical argument against
the "wide character" mechanisms in the C standard; what I have seen
are misunderstandings.  Perhaps you should refer to P.J. Plauger's
model standard C library implementation in his new book, to see what
is actually involved in exploiting and implementing these facilities.

>Also the character standards should be the base standards and
>programming language standards build on these and provide appropiate
>functionality to cover the standard character sets. 

WHICH "character standard"?  There are so many to choose from, all
of them botched in one way or another.  That is why programming
language standards should be INDEPENDENT of any particular choice of
character code set, rather than based on one choice that may not be
appropriate for many of the potential users of the language.  In the
case of C, the only requirements on the basic source and execution
character sets are that there be at least 96 distinct values, that
the values assigned by the C implementation to represent digit glyphs
be a contiguous ascending sequence, that there be three additional
distinct values in the execution set, and that all the previously
mentioned internal values be distinct from zero.  THE MAPPING BETWEEN
INTERNAL CODE SET VALUES AND THE ENVIRONMENT CAN, AND SHOULD, BE
DEFINED BY THE C IMPLEMENTATION.  Thus, a straight 6-bit external
code set could not have a one-to-one correspondence between external
"characters" and C source OR execution "character" values, and in
such a system environment there would have to be at least one added
convention for representing the full set of internal C characters,
with support tools to facilitate working with such special text files.

Mapping is an extremely important mathematical concept, with particular
relevance to applications involving multiple alphabets.  (As a former
cryptanalyst I am especially sensitive to this.)  That is why the VERY
FIRST STEP in translating a C program, as spelled out in the C standard
("Translation Phases", section 2.1.1.2 in X3.159-1989), is the
application of a MAPPING from physical (i.e. external) source file
characters to (internal) C source characters.  Systems with record-
oriented text files can exploit this mapping subphase to introduce
line delimiter internal characters ("new-line" characters in the C
source character set), and systems that lack standard representations
for some of the required C source characters can take advantage of
this mapping subphase to interpret, for example, digraph
representations for the characters not normally considered to be
represented in the "native" code set.  This is a simple and clean
approach to satisfying the C source character set requirements.

Indeed, X3J11 explained this to you in the third public review response
document.  Judging from your continued pursuit of more obtrusive
solutions for your own particular limited character set problem,
it would appear that you either did not understand the X3J11 response
or that, for reasons of your own, you wish to ignore it.  For purposes
of documentation for those who have not seen the response X3J11 gave
long ago, here it is:

	In response to Letter #177, Doc. No. X3J11/88-134:

	Summary of issue:
		Proposal for more readable supplement to trigraphs.

	X3J11 response:
		The Committee discussed this proposal but decided
	against it.

		We cannot support this proposal for a number of reasons.
	Trigraphs were intended to provide a universally portable format
	for the *transmission* of C programs; they were never intended
	to be used for day-to-day reading and writing of programs.
	Should it be necessary to do so, however, the preprocessor can
	already be used to improve their readability (exact macro names
	and definitions are not provided as the Committee prefers to
	avoid stylistic issues).  As larger character sets become more
	and more popular, the chances of having to deal with a
	"deficient" character set become smaller and smaller.

		Conversion between the current trigraph representations
	and "normal" representations can be done simply in a context-
	free manner, but this is not possible with the proposed notation.
	Also, there are a number of difficulties with the infix subscript
	operator where empty brackets would have been used.  Either the
	operator must be allowed as a postfix unary operator as well as
	a binary operator, or the grammar must be extended to allow empty
	parentheses to appear in those contexts where empty brackets can.
	Although these problems are by no means insurmountable, we feel
	that the current trigraphs are adequate for their intended use
	and that no further enhancements are necessary.

		Translation phase 1 actually consists of two parts, first
	the mapping (about which we say very little) from the external
	host character set to the C source character set, then the
	replacement of C source trigraph sequences with single C source
	characters.  (Note that the C source characters represented in
	our documents in Courier font need not appear graphically the
	same in the host environment, although a reasonable
	implementation will make them as nearly so as possible.)
	The kind of mapping you propose can in fact be done in the first
	part of translation phase 1, and several such "convenience"
	mappings are already common practice.  However, attempting to
	standardize this mapping is outside the scope of the C Standard,
	since what is appropriate may depend on the capabilities of the
	specific hardware, availability of fonts, and so forth.

		Although the Committee regrets any "no" votes on either
	the national or international proposed standards, we feel we
	must represent our best judgement on technical issues.  We hope
	you will reconsider your objection to the current specification.

Note that your "trigraph alternative" proposals had been discussed many
times in the standards committees, and still were resoundingly defeated
during a joint X3J11/WG14 meeting.  The only reason this issue is still
"on the table" for WG14 is that there was some political maneuvering at
the SC22 level in the absence of anybody who could represent the actual
issues and history, and SC22 mistakenly thought, on the basis of your
argumentation, that there was a problem that needed to be solved, and
thus directed that work toward a normative addendum to the ISO C
standard begin to address this "problem".  Later, the Japanese in
particular thought that it would be appropriate to add more support for
multibyte character sequences to the ISO C standard as part of this
normative addendum.  Your original hobby horse had nothing to do with
multibyte character sequences, and so far as I can determine, the
Japanese have not found any problem with them other than the desire for
more standard library functions to make their use more convenient.

It is also worth noting that there is continuing discussion of this
issue on the X3J11 and WG14 electronic mailing lists.

>I hope that this problem will be a historical one with the appearance
>of 10646.

Surely you should be able to see the possible problems with 10646?
The very idea of using 32 bits to represent a character is bound to
meet stiff opposition, particularly from users of small systems,
who already have more efficient solutions to the "problem" of a
diversity of alphabets.  It seems to me that 10646 is one of the
technically worst character-set standards yet to be adopted.  No
wonder there has been renewed interest in other standards such as
"Unicode" (about which I know little at present other than that it
has a broad base of industry support).

One does not solve "people problems" by simply adopting a technical
standard.  History provides much evidence of that.

DISCLAIMER:  None of the above should be construed as an official X3J11
position, not even the attempt to cite from an X3J11 document.  However,
I believe that I have correctly represented the situation as I
understand it.



More information about the Comp.std.c mailing list