sizeof(char)

Guy Harris guy at sun.uucp
Wed Nov 12 21:38:51 AEST 1986


> If these data types were to have different sizes, then a few things
> would indeed break, as follows:
>	...

	declarations of pointers to fundamental storage units as
	"char *", "unsigned char *", etc. rather than as "storage_unit *".

Yes, they *can* be changed.  Programs that use "char" *can* also be changed
to use "long char".  The question is "which is more work"?  I am still not
convinced that changing those declarations is less work than changing code
that handles characters, especially since the latter code will have to be
changed anyway in many cases to make it work with non-ASCII character sets.

> With my proposal, VERY LITTLE need be changed in such code,
> since text handling is already being done with the idea that (char)
> represents a single character (see my NOTE above!);

I'm not talking about code that processes characters; I'm talking about code
that processes storage units.  Maybe I'm biased, since I've spent a fair bit
of time recently working with streams module code, where you do a *lot* of
stuffing of data structures into and extracting data structures from arrays
of storage units, but I'd rather not have to worry about that code, since it
is not the code I'd be changing to internationalize a system.

> with (long char) approaches, a SUBSTANTIAL amount of rework would be
> needed.  To be fair, the amount of rework for (long char) can be reduced
> if one artificially constrains (long char)s so that neither byte is
> allowed to be zero except for the "null character" string terminator.

How much rework is needed to change "strcpy" to "lstrcpy"?  Note that, with
proper ANSI C declarations in <string.h>, changing the string types from
types derived from "char" to types derived from "long char" will cause the
compiler to flag many of these anyway.

> I finally should remark that Guy Harris shows every sign of having made
> his mind up on the issue in advance of knowing what was proposed.

Oh, good grief.  The only thing I've "made up my mind on" is that the claim
that there isn't much work involved in making all C code work correctly if
"char" is not the fundamental unit of storage.

> The fact that he labeled my comments about implications of the strcoll()
> approach "bullshit" and proceeded to explain setlocale() to me indicate
> that he isn't LISTENING to what I'm saying; after all, I'm one of the
> people who decided how those facilities would be specified.

If it is indeed the case that there is more than one way of sorting text in,
say, Oriental languages, then either 1) "setlocale" is a poor name, because
it takes into account more than just the locale, or 2) it is a poor routine,
because it doesn't take into account more than just the locale.

I notice in my copy of "Inside Macintosh" that they *do* support more than
one collating sequence for their extended character set for the benefit of
German (the vowels equipped with diareses sort in the same place as the
unadorned vowels in the primary ordering sequence for non-German languages,
but sort in the same place as the ligature composed of that vowel and the
letter "e" in the primary ordering sequence for German).  Now I am not
willing to rule out the possibility that a site might want to have both
documents in French and in German.  As such, code that would sort lists of
names in these documents would have to set the "locale" based on an
indication of the language the document is in (not from, say, an environment
variable).

The claim you made was that "strcoll() amounts to a declaration that there
IS a natural multibyte collating sequence for any single environment" is a
little hard to parse.  I assume you mean that "by specifying that there
is such a routine, the proposers of strcoll() are declaring that there IS a
natural multibyte collating sequence for any single environment."  Given
that "setlocale" exists, I fail to see how it declares this, unless
"environment" is defined so that an environment always specifies a single
collating sequence.  In the latter case, the claim is true, but trivially so.
> I'm fully prepared to admit that there are pros and cons to any alternative
> solution to the multi-byte character issue (or to bitmap programming
> issues, if that's more your concern), and that one might rationally
> disagree with my proposal because of different value weighting of the
> trade-offs.

Fine.  Are you prepared to admit that there *is* a non-trivial trade-off
involved in the "short char" proposal (i.e., that it is not a given that
few, if any, lines of *existing* code need change so that it can work
equally well in an one-storage-unit "char" and a two-storage-unit "char"
environment), and that some people might rationally disagree with your value
weighting of the changes needed to existing code to make it work in a
two-storage-unit "char" environment and to make it work in a "long char"
environment?
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy at sun.com (or guy at sun.arpa)



More information about the Comp.lang.c mailing list