Programming and international character sets.

Mark H. Colburn colburn at src.honeywell.COM
Tue Nov 8 05:53:01 AEST 1988


In article <427 at sdrc.UUCP> scjones at sdrc.UUCP (Larry Jones) writes:
>You seem to have missed a key point in the internationalization
>stuff - you don't use multi-byte characters directly, you convert
>them into wchar_t's using the functions in sections 4.10.7 and
>4.10.8.  wchar_t is an integral type (probably short or int) that
>is large enough to hold ANY character value.

This is not always true, although it would make things much easier if it 
were.  You see, there is not way to take a converted string given back to
you by strxform() back to it's native form.

What that means is that there is no way to make modifications to multi-byte
strings.  This would be a serious deficiency (and the one which I was
attempting to address in my last article).  Strxform is only good for
reading stringss, not writing them.  For example, how would you do a
regular expression replacment if you do not know where the next character
is.  What if you need to parse a string and need to know what the data in
the string is?  Strxform translates characters into an implementation
defined format.  That means that there is now way to portably do anything
with the generated string, other than compare it to another string...

[ description of wchar_t types...]

>You can also pass them to the is*() and to*() functions provided
>you've setlocale() to a locale that supports additional
>characters.  If you look at sections 4.3 and 4.4, you will see
>that they are all locale dependent.

You can NOT pass a wchar_t type to is*() functions, at least not portably. 
The is*() functions and to*() functions are defined as:

	int	toupper(int c);

There is no guarentee that the width of a wchar_t is less-than-or-equal-to
and integer, or that it is able to be represented as an integer.  As a
matter of fact, in the (draft) C standard and the POSIX standards and drafts, 
there are hints that it may by at least 4 characters wide.  

One of the bugs which I pointed out, was that the draft C standard does
indeed say that the is*() and to*() functions are locale dependant, but I
see no way that they can be truely locale-dependant when the are defined as
they are.



More information about the Comp.lang.c mailing list