wchar_t values

Fri Apr 5 03:16:57 AEST 1991

[ The quoting here is getting confusing, so I'm putting Keld's remarks with
 the greater than symbol and Doug's remarks with the percent symbol. :-) rja ]

In article <keld.670719584 at dkuugin> keld at login.dkuug.dk (Keld J|rn Simonsen) writes:
gwyn at smoke.brl.mil (Doug Gwyn) writes as a reply to an article of Keld's:

>>>ANSI C does not handle DIS 10646, JIS X 0208, GB 2312 and KSC 5601 
>>>correctly. So ANSI C multibyte specifications *cannot* be used on any
>>>multibyte de jure character set.

% I think you have mixed multibyte character sequences with wchar_t.
% They are NOT the same thing!  That is why there are interconversion
% functions specified in the C standard.

>I wrote "multibyte" to cover multibyte character sets,  and the
>support in ISO (ANSI) C for these character sets; this support
>consists of the multibyte support functions and the widechar
>support functions.

I see no reason that the support provided in ANSI X3.159 is not
adequate.  Keld has presented no such technical details and it
seems to me to be incumbent on Keld to present in purely technical
terms why he feels it won't work.

I've worked on multi-lingual applications for several years and
during the mid 1980s I wrote a thesis on the whole area of 
Chinese/Japanese language support in computer systems, so this
is not an area that I am unfamiliar with.

% The advice X3J11 received during development of this aspect of the
% standard, from such organizations as NTSCJ who have a major stake in
% so-called multibyte character encodings, was that the mechanisms in
% the C standard were adequate for this purpose.  Unless you can explain
% WHAT it is that you think is wrong, I suggest that your comments be
% ignored.  I haven't seen a significant technical argument against
% the "wide character" mechanisms in the C standard; what I have seen
% are misunderstandings. 

There are a lot of misunderstandings and I suspect that they are at the
heart of the problem.

>OK, some hard facts:

>The character 'c' has the following encodings in these basic 16-bit
>East-asian de jure character standards:

>GB 2312-80 (basic Chinese 16-bit  standard)  /035/099
>JIS X 0208 (basic Japanese 16-bit standard)  /035/099
>KS C 5601  (basic Korean 16-bit standard)    /035/099

>ISO DIS 10646 has the following value for 'c':  /032/032/032/099

>None of these values have the nice property of having ASCII 'c'
>extend into these values when loading as a 16-bit or 32-bit int.

>Either this is a problem for all of the above character sets, or
>it is not a problem at all. I hope the latter is true, then there
>is no problem to fix. Unfortunately quite some knowledgeable people
>like Erik v.d.Poel and SC2/WG2 people plus WG14 think there is a problem
>and they have not yet been able to solve it.

Alleging that someone else said they thought there was a problem isn't
nearly the same as stating an argument on technical grounds in "hard
facts".  I don't think it is a problem for any of the above (or in
support of the JIS C6220 and C6226 standards for Kanji and Kana for
that matter).

[ text deleted here for brevity ]

>This interpretation of the C standard sounds OK to me,
>and solves the problem mentioned by Erik v.d. Poel.

Here Keld appears to acknowledge that there is NO technical problem
with the standard.  But watch how he ignores this down below:

>ANSI and DS (Danish Standards, the Danish equivalent to ANSI in ISO)
>have disagreed on the necessity of a *readable* and *writeable* alternative
>to a representation of C source in invariant ISO 646. Invariant ISO 646
>is the same as ASCII with 12 positions left undefined - to be decided by
>national ISO bodies. ANSI decided on the ASCII character set, a lot of
>other ISO member bodies - especially in Europe - decided to use these
>positions for national letters and the like. Then invariant ISO 646 is
>the greatest common denominator for these character sets, all derived
>from the international standard ISO 646. 

However, all of western Europe is moving rapidly to the ISO 8859/1 
standard which has none of the ISO 646 problems at the source level
and moreover the trigraphs address the ISO 646 technical problem 
(which I emphasise is a temporary problem already starting to fade).

The POLITICAL issue is coming from the Danes who feel that their
local character set standard based in ISO 646 should be the focus
of the whole world and that long standing C practice should be 
broken to make them feel better.

WG14 passed resolutions both ways depending on who had spoken more
recently to the group, whether the history of the question at the
X3J11 level and the history of the C language was presented, 
and also some political grounds.  The Danes ignored the WG14 decisions
against them and then turn around and accuse X3J11 of being indifferent
for their insistence on sticking to technical merit rather than
political issues.

% Indeed, X3J11 explained this to you in the third public review response
% document.  Judging from your continued pursuit of more obtrusive
% solutions for your own particular limited character set problem,
% it would appear that you either did not understand the X3J11 response
% or that, for reasons of your own, you wish to ignore it.  

The continuing lack of a technical representation of a problem from the
Danes is amasing.  X3J11 addressed the technical issues adequately.

>This I read as just a story with irrelevant facts about fonts
>(Courier etc.). What was proposed was support for a very central 
>character set, namely invariant ISO 646.

ISO 646 is a character set whose use is rapidly diminishing and whose
use is supported adequately by the trigraph feature present in the
standard.  Keld keeps reiterating that there is a problem (except
above, which see) without specifying the problem technically.

% Note that your "trigraph alternative" proposals had been discussed many
% times in the standards committees, and still were resoundingly defeated
% during a joint X3J11/WG14 meeting. 

>And other WG14 meetings have supported it, and asked ANSI X3J11 to 
>implement it. X3J11 just ignored these ISO WG14 resolutions.

The Danes just ignored the X3J11 and the ISO WG14 resolutions that
favor of the existing solution.  There has NOT been a clear,
definitive position taken at the ISO level.  There has been a lot of
waffling for the reasons noted above.

% The only reason this issue is still "on the table" for WG14 is that
% there was some political maneuvering at the SC22 level in the absence
% of anybody who could represent the actual issues and history, and SC22
% mistakenly thought, on the basis of your argumentation, that there was
% a problem that needed to be solved, and thus directed that work toward
% a normative addendum to the ISO C standard begin to address this
% "problem".

Doug's comments here are essentially correct; there wasn't adequate
explanation at the ISO level of the issues and history of the
alleged ( but actually unstated in Keld's message ) problem.

>I did not personally argue this in SC22, so others than me must
>have been convinced. Actually X3J11 has been the only body to
>not be convinced. And so the story about X3J11 says on quite some other 
>issues (such as the British comments).

No.  See the WG14 decision against the Danish proposal above.

>OK, so my hobby horse has nothing to do with multibyte support.

[ Note that his original posting really alleged a multibyte character 
  support problem that in fact he conceded isn't a problem at all ]

% It is also worth noting that there is continuing discussion of this
% issue on the X3J11 and WG14 electronic mailing lists.

>Both you and I participate in these discussions. I actually think
>it would have been more appropiate for you to share your valuable
>technical insight at an earlier stage in this discussion,
>e.g. in November when SC2/WG2 made their first letter about 10646.

>>>I hope that this problem will be a historical one with the appearance
>>>of 10646.

% Surely you should be able to see the possible problems with 10646?
% The very idea of using 32 bits to represent a character is bound to
% meet stiff opposition, particularly from users of small systems,
% who already have more efficient solutions to the "problem" of a
% diversity of alphabets.  It seems to me that 10646 is one of the
% technically worst character-set standards yet to be adopted.  No
% wonder there has been renewed interest in other standards such as
% "Unicode" (about which I know little at present other than that it
% has a broad base of industry support).

As it happens, there is some discussion of UNICODE and ISO DIS 10646
merging together, though there I personally doubt it will come to pass.
Neither is UNICODE the solution yet because it doesn't even fully 
support languages with Romanised character sets (let alone all of
the non-Roman languages).  I think that the ISO 8859 family of 
compatible 8-bit standards will come into wide use long before any
of the 16-bit or 32-bit standards do for a number of technical
reasons and some political ones (but this isn't really germane
to the pseudo-problems raised about X3.159 :-).

I am not on any of the groups working on the C standard (neither
ISO nor ANSI nor any other group).  I have followed the multibyte
support in some detail because of my work in multilingual applications
development.  I don't see any problems and Doug is free to quote me
as one developer who thinks that the multilingual support is adequate
and doesn't want to see the Danish proposal accepted because it will
harm the standard.  It is annoying that Keld keeps posting these
vague allegations without clearly stating a technical (not political)
problem that is unresolved by the C standard.  If he would post
such a technical problem statement, it could then be discussed on
this list on its technical merits, if any.

Randall Atkinson
rja at edison.cho.ge.com

Comments are the author's and are not necessarily the opinions of
GE, Fanuc, or GE-Fanuc.