sizeof(char)

Doug Gwyn gwyn at brl-smoke.ARPA
Tue Nov 11 22:32:03 AEST 1986


In article <1305 at ttrdc.UUCP> levy at ttrdc.UUCP (Daniel R. Levy) writes:
>This seems too simple.  So, what have I missed?

There are a couple of factors.  First, if I know that accessing a bit
(whether by macro or by language-supported data type) is going to actually
load up a whole word, perform separate masking operations, then store
the word back in memory, as opposed to a direct hardware access of the
bit, I am likely to design my algorithms quite differently and explicitly
handle words as well as bits in my bitmap code.

Second, in order to support straightforward programming techniques, such
as looping through arrays, incrementing pointers, etc., the data type has
to be officially blessed as a basic or derived type by the compiler.

I should perhaps remind everyone that I am discussing explicitly NON-
portable bitmap programming, since I have NOT proposed that ALL C
implementations directly support bit-sized data objects.  For PORTABLE
bitmap programming (assuming you are concerned about it), one would
indeed have to assume the worst and be prepared to handle word-masking.
In case people have forgotten, C is not only a language for portable
application programming, but it is also (even foremost) a system
implementation language.  Nitty-gritty system-level programming often
has to deal with specifics of the hardware architecture.  Software
portability is important (most of you should be aware by now that I have
strong feelings about that), but concern for it should not be allowed to
limit the options of people who have an actual requirement for using C
in intrinsically "dirty" ways.

----------

Allow me to repeat:  proposal X3J11/86-136 is actually intended to help
solve the MULTI-BYTE CHARACTER PROBLEM (which DOES exist).  Its
possible ramifications for system-specific bitmap programming are really
a side issue, although such considerations can help clarify exactly what
implementation possibilities are opened up by the formal proposal.

Note that I am careful to distinguish between a character, by which I
mean an individually manipulable unit that represents a natural piece of
text, and a (char), which is a basic C data object.  I refer to
individually addressable storage units as bytes, no matter how many bits
they consist of.  If you don't keep these distinctions in mind, you will
NOT understand my proposal or explanations!

If I were developing programs to run on an Imagen print station, I would
very much prefer my compiler to support "Galactic ASCII" (16-bit data)
as a basic data type, in fact as a (char).  If I am developing DMD code,
I very much prefer my compiler/hardware to directly support the individual
bit as a basic data type/byte, but I also need characters separately; in
fact GASCII would be ideal for the DMD.  If I were developing a generic
operating system for world-wide distribution, I would very much prefer my
compiler to support individual text elements (characters; note that
"letters" is too limited a term for this) as basic data types.  All that
my proposal does is to allow compiler implementers the FREEDOM to choose
these trade-offs appropriately for the intended major application; it
doesn't force any particular choice for character or byte basic data
object sizes.  (However, if one uses an inappropriate choice for the
application, or if one doesn't have control over the compiler that will be
used, then one HAS to resort to "lowest common denominator" assumptions in
one's coding; this is also the current state of affairs.  I really don't
think insisting that a (char) must necessarily be an 8-bit byte, which is
ALREADY FALSE for K&R and X3J11 C, will help this situation.)

If you're worried about the possible impact of the proposal on your own
code, perhaps I should reassure you:  Of the approximately HALF-MILLION
lines of C code that I maintain (mostly written by others, practically
none of whom worried about these matters), not a single line is affected
by my proposal so long as the compiler implementer continues to choose to
make (char) and (short char) have the same size.  If these data types
were to have different sizes, then a few things would indeed break, as
follows:
	use of sizeof"string_constant" instead of
	strlen("string_constant")+1 : occurs in about 10 places (I had to
	find all these once, since an older Gould compiler insisted that
	sizeof"string_constant"==sizeof(char *) .)

	coercing of other pointer types to (char *), doing address
	arithmetic, then coercing the pointer back: this is atrocious
	practice in the first place, and seldom occurs; I estimate at most
	20 to 50 places would need to be fixed, by using (short char *)
	instead of (char *) (or better, by redesign of the code).

	specifically byte I/O routines, such as are required to meet
	predefined or machine-independent protocols: these occur in
	nearly 100 places, and most of them are written so that they
	make rather severe assumptions about the run-time environment,
	usually that getc/putc necessarily input/output precisely 8 bits
	at a time.  It is simple to adapt these to the multi-byte (char)
	environment, such as by using getsc/putsc, but such pieces of
	code are necessarily implementation-dependent anyway and should
	always be checked when porting to significantly different
	environments.
Actually, most of this code was developed for a 7-or-8 bit character,
one character per (char), environment and POTENTIALLY needs a fair amount
of rework for a more general character environment no matter WHAT approach
is used.  With my proposal, VERY LITTLE need be changed in such code,
since text handling is already being done with the idea that (char)
represents a single character (see my NOTE above!); with (long char)
approaches, a SUBSTANTIAL amount of rework would be needed.  To be fair,
the amount of rework for (long char) can be reduced if one artificially
constrains (long char)s so that neither byte is allowed to be zero except
for the "null character" string terminator.  Such a constraint is not at
all necessary with my approach, for which a "null character" is precisely
one that has 0 numeric value (without worrying about subfields), as in
current K&R and X3J11 C.  Note also that an artificial constraint also is
known by the pejorative name of "kludge"; some of us have an aversion,
not necessarily irrational, to kludges.

I finally should remark that Guy Harris shows every sign of having made
his mind up on the issue in advance of knowing what was proposed.  The
fact that he labeled my comments about implications of the strcoll()
approach "bullshit" and proceeded to explain setlocale() to me indicate
that he isn't LISTENING to what I'm saying; after all, I'm one of the
people who decided how those facilities would be specified.  Who does he
think he is?  The implication is that I must terribly stupid since I
don't understand stuff I helped design.  If instead one were to assume
the more likely theory that I DO understand the significance of those
facilities, then it would appear that Guy doesn't appreciate the point I
was making.  My guess is that he is so accustomed to responding to
ignorant amateurs in this newsgroup that he automatically assumes when
he doesn't immediately agree with someone they too must be "morons" and
their remarks are consequently not worth the effort or courtesy of
understanding before responding.  Because I have taken a lot of trouble
in choosing my exact wording, I also resent very much his apparent
assumption that my words represent sloppy approximate concepts; just
because many people write like that is no reason to assume that I do!

Rather than be misled by other people's misconceptions, if you seriously
want to evaluate my proposed solution to the multi-byte character problem
and don't have access to X3J11/86-136, then refer the the latter part of
my article <5310 at brl-smoke.ARPA> (pretty much skipping the discussion of
bitmap programming until after you understand the logical meaning of the
formal proposal), rather than relying on the hash made of the proposal in
some people's responses.  Try assuming that I have NOT made some trivial
blunder, then figure out what my point of view is that allows me to make
the claims that I have been making.  Once you understand precisely WHAT I
have in mind, only THEN go back and examine counter-responses.  (This is
the approach that you should be taking to intellectual issues anyway.)

I'm asking that you figure out this proposal from what I have presented,
rather than spending lots of net time arguing over misconceptions.  I'm
fully prepared to admit that there are pros and cons to any alternative
solution to the multi-byte character issue (or to bitmap programming
issues, if that's more your concern), and that one might rationally
disagree with my proposal because of different value weighting of the
trade-offs.  However, rational discussion first requires accurate
communication and understanding of the ideas in question.  I've done the
best I can to explain them; now it's your turn to do the best you can to
understand them.  Otherwise, let's end the discussion now.



More information about the Comp.lang.c mailing list