Programming and international character sets.

Mark H. Colburn mark at jhereg.Jhereg.MN.ORG
Wed Nov 2 02:13:39 AEST 1988


In article <8804 at smoke.BRL.MIL> gwyn at brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
>In article <532 at krafla.rhi.hi.is> kjartan at rhi.hi.is (Kjartan R. Gudmundsson) writes:
>>How difficult is it convert american/english programs so that they can 
>>be used to handle foreign text? [etc.]
>
>Where have you been the last few years?  This subject area is known as
>"internationalization" and has been the featured topic of special issues
>of several journals, including UNIX Review and UNIX/World.  The draft
>proposed ANSI/ISO C standard specifically addresses this issue (it is
>one of the reasons production of the final standard was delayed).

Unfortunately, the C standard is still lacking in this area.  It is true
that the attempt was made, however, X3J11 will have to go through another
round if it is to be truly internationalized.

One problem is that, althougth the standard supports multi-byte characters
which are required for a number of languages around the world, especially
those in Asia, no support is provided to pass those characters to any of
the is...() or to...() functions.  Since all the is...() and to...() 
functions  take an integer parameter, it would be impossible to evaluate
a multi-byte character.

Another problem is that an application has no way of portabily 
determining where the current character in a string ends and the next
begins; you can't just use ch++ to advance to the next character anymore.
And it is even harder to move backwards though a string.

There are some other problems with collation as well, some language may
have several lowercase characters corresponding to a single uppercase
character, or vice-versa.  This presents some problems when using toupper()
and tolower() to covert a character to it's opposite case.  In addition in
some languages and/or collation sequences there are some characters which
do not have a corresponding opposite case (i.e. there is only an uppercase
character with no corresponding lowercase character in a code set)

To be fair, we did not uncover these deficiencies until just recently (just
after we sent our ballot in for the third public review), so these may not
have been issues specifically addressed by the commitee.

There are some solutions to these problems, which would allow for 
internationalization without breaking any existing programs.  Here are some
suggestions:

  1.  Develop some functions which provide the same functionatality as the
      is...() functions but which take a character pointer as an argument.
      For example:

	int	wcislower(char *string)


  2.  Develop some functions which provide the same functionality as the
      to...() function but which return a character pointer.  Unfortunatly,
      these functions may need to allocate space in order for the
      transformation to work, or they may need to pass back a pointer to a
      static string which would then need to be copied.  The latter is
      probably the way most implementations would do it since it is
      essentially a table lookup.  For example:

	char   *wctolower(char *string)


  3.  Provide some functions to allow traversing a character string.  These
      functions would return a pointer to the next character in the string
      as determined by the current local.  For example:

	char   *nextchar(char *string)
	char   *prevchar(char *string, char *backup)

      These last two functions were presented at the latest IEEE POSIX meeting
      by one of the commitee members to cope with this problem.  The backup 
      string in prevchar() provides a pointer to a known character boundry 
      that the function can use to scan forward in the string in order to 
      determine where the actual character boundry of the previous character 
      is.


  4.  Some of the string functions would need to be revised as well, 
      specifically strlen().

	int	wcstrlen(char *string)

      This function would return the string length of the current string
      according to the current locale setting.  Therefore the string "abss"
      would give a length of 4 in the C locale, but may return 3 in a
      German local.  The functionality of this could be put in the current 
      strlen, however, there are still requirements to get the number of 
      bytes in a string, as well as the number of characters, so the old
      strlen should not be replaced.

Internationalization is a tricky and invovled problem.  Unfortunately it is 
not possible for an existing program to recompile under and ANSI compiler 
and become internationalized.  A number of changes to the application are 
required in order to provide for maximally portable code.  However, it is 
possible to provide the internationalization without breaking any existing 
code.

What has been discussed so far is character level internationalization, 
which is only one side of the fence.  The other side is language translation 
of strings.  This is known as "messaging" in the circles which talk about 
internationalization (let's overload yet another computer science term...).  
However, messaging can be accomplished by developing messaging libraries which 
contain the strings required by the application, translated into every 
language which your application needs to support.  When you wish to display a 
string, such as "press spacebar to continue" you call the messaging library 
with a unique identifier which is associated with your string, and the 
messaging library returns a string, based on the current local, which depicts 
the same idea as "press space bar to continue".

This also requires some fancy footwork on the part of applications, since
displaying these messages is bound to be very difficult since some 
languages read left-to-right, some read right-to-left, and some sucn as
Mongolian, do both and even go diagonally.  Add string attributs such as 
centering and justification and character attributes such as inverse, normal 
and blinking and messaging becomes very interesting indeed.

Internationalization is a relatively new field, and a number of things
still need to be ironed out, but I think that we are making progress, and
that progress should continue.  
-- 
Mark H. Colburn                  "They didn't understand a different kind of 
NAPS International                smack was needed, than the back of a hand, 
mark at jhereg.mn.org                something else was always needed."



More information about the Comp.lang.c mailing list