Programming and international character sets.
Mark H. Colburn
mark at jhereg.Jhereg.MN.ORG
Wed Nov 2 02:13:39 AEST 1988
In article <8804 at smoke.BRL.MIL> gwyn at brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
>In article <532 at krafla.rhi.hi.is> kjartan at rhi.hi.is (Kjartan R. Gudmundsson) writes:
>>How difficult is it convert american/english programs so that they can
>>be used to handle foreign text? [etc.]
>
>Where have you been the last few years? This subject area is known as
>"internationalization" and has been the featured topic of special issues
>of several journals, including UNIX Review and UNIX/World. The draft
>proposed ANSI/ISO C standard specifically addresses this issue (it is
>one of the reasons production of the final standard was delayed).
Unfortunately, the C standard is still lacking in this area. It is true
that the attempt was made, however, X3J11 will have to go through another
round if it is to be truly internationalized.
One problem is that, althougth the standard supports multi-byte characters
which are required for a number of languages around the world, especially
those in Asia, no support is provided to pass those characters to any of
the is...() or to...() functions. Since all the is...() and to...()
functions take an integer parameter, it would be impossible to evaluate
a multi-byte character.
Another problem is that an application has no way of portabily
determining where the current character in a string ends and the next
begins; you can't just use ch++ to advance to the next character anymore.
And it is even harder to move backwards though a string.
There are some other problems with collation as well, some language may
have several lowercase characters corresponding to a single uppercase
character, or vice-versa. This presents some problems when using toupper()
and tolower() to covert a character to it's opposite case. In addition in
some languages and/or collation sequences there are some characters which
do not have a corresponding opposite case (i.e. there is only an uppercase
character with no corresponding lowercase character in a code set)
To be fair, we did not uncover these deficiencies until just recently (just
after we sent our ballot in for the third public review), so these may not
have been issues specifically addressed by the commitee.
There are some solutions to these problems, which would allow for
internationalization without breaking any existing programs. Here are some
suggestions:
1. Develop some functions which provide the same functionatality as the
is...() functions but which take a character pointer as an argument.
For example:
int wcislower(char *string)
2. Develop some functions which provide the same functionality as the
to...() function but which return a character pointer. Unfortunatly,
these functions may need to allocate space in order for the
transformation to work, or they may need to pass back a pointer to a
static string which would then need to be copied. The latter is
probably the way most implementations would do it since it is
essentially a table lookup. For example:
char *wctolower(char *string)
3. Provide some functions to allow traversing a character string. These
functions would return a pointer to the next character in the string
as determined by the current local. For example:
char *nextchar(char *string)
char *prevchar(char *string, char *backup)
These last two functions were presented at the latest IEEE POSIX meeting
by one of the commitee members to cope with this problem. The backup
string in prevchar() provides a pointer to a known character boundry
that the function can use to scan forward in the string in order to
determine where the actual character boundry of the previous character
is.
4. Some of the string functions would need to be revised as well,
specifically strlen().
int wcstrlen(char *string)
This function would return the string length of the current string
according to the current locale setting. Therefore the string "abss"
would give a length of 4 in the C locale, but may return 3 in a
German local. The functionality of this could be put in the current
strlen, however, there are still requirements to get the number of
bytes in a string, as well as the number of characters, so the old
strlen should not be replaced.
Internationalization is a tricky and invovled problem. Unfortunately it is
not possible for an existing program to recompile under and ANSI compiler
and become internationalized. A number of changes to the application are
required in order to provide for maximally portable code. However, it is
possible to provide the internationalization without breaking any existing
code.
What has been discussed so far is character level internationalization,
which is only one side of the fence. The other side is language translation
of strings. This is known as "messaging" in the circles which talk about
internationalization (let's overload yet another computer science term...).
However, messaging can be accomplished by developing messaging libraries which
contain the strings required by the application, translated into every
language which your application needs to support. When you wish to display a
string, such as "press spacebar to continue" you call the messaging library
with a unique identifier which is associated with your string, and the
messaging library returns a string, based on the current local, which depicts
the same idea as "press space bar to continue".
This also requires some fancy footwork on the part of applications, since
displaying these messages is bound to be very difficult since some
languages read left-to-right, some read right-to-left, and some sucn as
Mongolian, do both and even go diagonally. Add string attributs such as
centering and justification and character attributes such as inverse, normal
and blinking and messaging becomes very interesting indeed.
Internationalization is a relatively new field, and a number of things
still need to be ironed out, but I think that we are making progress, and
that progress should continue.
--
Mark H. Colburn "They didn't understand a different kind of
NAPS International smack was needed, than the back of a hand,
mark at jhereg.mn.org something else was always needed."
More information about the Comp.lang.c
mailing list