Programming and international character sets.

Kjartan R. Gudmundsson kjartan at rhi.hi.is
Fri Oct 28 10:27:38 AEST 1988


How difficult is it convert american/english programs so that they can 
be used to handle foreign text? The answer of course depends on the language
one has in mind. In Europe most nations ues the Latin alfabet and english
is one of them. Unfortunately english uses very few charaters compered to
other european languages, therefore the code set that is widely used by
americans and english, the ASCII character set, only defines 128 characters.
It is a 7 bit character set. In other european countries than England
the ASCII character set is also widely used but with extension.
The character set is 8 bit thus allowing 256 characters. 
The problem is however that the extension is not standard.
We have one possability in the IBM-PC character set, other one from HP called
Roman-8, DEC gives us DEC-multinational character set and the Macintosh
has yet another. So if we have a program that for example converts lower case
letters to uppercase, it has to be coded diffrently for each character
set.

Let's look at some code from MicroEMACS:

input.org:        if (c>=0x00 && c<=0x1F)
input.org:        if (c>=0x00 && c<=0x1F)                 /* C0 control -> C-     */
main.org:				case 'a':	/* process error file */
main.org:        if ((c>=0x20 && c<=0xFF)) {	/* Self inserting.      */
random.org:		if (*scan >= 'a' && *scan <= 'z')
random.org:                else if (c<0x20 || c==0x7F)
random.org:                else if (c<0x20 || c==0x7F)
region.org:                                lputc(linep, loffs, c+'a'-'A');
region.org:                                lputc(linep, loffs, c-'a'+'A');
region.org:                        if (c>='a' && c<='z')
search.org:		else if (c < 0x20 || c == 0x7f)	/* control character */
word.org:					c += 'a'-'A';
word.org:				c += 'a'-'A';
word.org:				c -= 'a'-'A';
word.org:				c -= 'a'-'A';
word.org:			if (c>='a' && c<='z') {
word.org:			if (c>='a' && c<='z') {
word.org:		wordflag = ((ch >= 'a' && ch <= 'z') ||
word.org:	if (c>='a' && c<='z')
 

Ugly isn't it?

An other way of doing this is using "is.." functions that are
defined in ctype.h, include file that comes with (almost) all c-compilers
Some of the above lines would look like this:

basic.c:                else if (iscntrl(c))
display.c:			if (iscntrl(c))
display.c:	} else if (iscntrl(c)) {
eval.c:			*sp = tolower(*sp);
eval.c:			*sp = toupper(*sp);
eval.c:		if (islower(*sp) )
fileio.c:	  if (iscntrl( fn[tel++] ) )
input.c:				if (iscntrl(buf[--cpos]) ) {
input.c:				if (iscntrl(buf[--cpos])) {
input.c:			c = toupper(c);
input.c:			c = toupper(c);			/*Force to upper */
input.c:		if ( islower(c) && ( SPEC != (SPEC & c) ))
input.c:	        if (iscntrl(c) )		/* control key */
input.c:	        if (iscntrl(c) )		/* control key */
input.c:	        if (iscntrl(c) )		/* control key? */

This code is better (most of the is.. things are macros that mask
the argument and return the binary mask that is either zero or positve)
has more style to it and is easiear to port to a diffrent character set.

An other bad habit of american programmers is this:
character_value = (character_value & 0x7F ) 
don't do this!! If you must, you can use 0xFF insted:
character_value = (character_value & 0xFF )
(Unless of course your machine breaks to peaces if it gets
an 8 bit character in its io channels.)

###############################################################################
#                                     #
#	Kjartan R. Gudmundsson        #     
#	Raudalaek 12                  #     
#	105 Reykjavik                 #     Internet:  kjartan at rhi.hi.is      #
#                                     #     uucp:  ...mcvax!hafro!rhi!kjartan #
#                                     #                                       #
###############################################################################



More information about the Comp.lang.c mailing list