sizeof(char)

Guy Harris guy at sun.uucp
Mon Nov 10 10:03:11 AEST 1986


> Guy is still missing my point about bitmap display programming;
> I have NOT been arguing for a GUARANTEED PORTABLE way to handle
> individual bits, but rather for the ability to do so directly
> in real C on specific machines/implementations WITH THE FACILITY:

Why the hell does it matter whether "real C" is used or not?  You DON'T NEED
"standard" pointers in C just to use bit-addressing hardware on machines
that have it.  Since you state below that "portable graphics programming
SUCKS", the fact that vanilla ANSI C has no constructs to support bit
addressing is a non-issue.

> Now, MC68000 and WE32000 architectures do not support this (except for
> (short char)s that are multi-bit pixels).  But I definitely want the
> next generation of desktop processors to support bit addressing.

If you're going to convince Motorola, Intel, National Semiconductor, DEC,
MIPS, etc., etc.  to put bit-addressing into their next generation of chips,
you're going to have to give them a justification of why it makes life
better for graphic applications.  Some of them are building graphic chips
instead; why should they stuff that sort of thing into the CPU?

> I am fully aware that programming at this level of detail is non-portable,
> but portable graphics programming SUCKS, particularly at the interactive
> human interface level.  Programmers who try that are doing their users
> a disservice.

Would you please justify that statement?  The only way in which portability
affects the user interface is if it makes the application run more slowly.
There could very well be hardware where using your latest CPU chip's
bit-pointer instructions makes things run *more slowly* than using some
other piece of hardware in the system.  Designing the graphics code so that
most of it deals with the hardware portably at a high level, and shoving the
device dependencies down to a lower level that the bulk of the graphics code
uses, may make this run more efficiently.

> I say this from the perspective of one who is considered
> almost obsessively concerned with software portability and who has been
> the chief designer of spiffy commercial graphic systems (and who
> currently programs DMDs and REAL frame buffers, not Suns).

What is a "real frame buffer"?  Why is a DMD different from a Sun in this
respect?  Neither of their processors have bit addressing.

> 	ESSENTIAL:
> 		(1) New type: (short char), signedness as for (char).
> 		(2) sizeof(short char) == 1.
> 		(3) sizeof(char) >= sizeof(short char).
> 		(4) Clean up wording slightly to improve the
> 		    byte (storage cell) vs. character distinction.

None of these are "essential".  They may merely make certain kinds of
programming more convenient - maybe.  You still have provided no evidence
for your claim that bit addressing in the instruction set is some kind of
necessity; nor have you provided any justification whatosever for your claim
that you have to make "char *" be a bit pointer in order to make bit
pointers usable in a high-level language.  Since you're not writing portable
code, who *gives* a damn if the facility you're using is part of the
"standard" language or not?

> I've previously pointed out that this has very little impact on most
> existing code, although I do know of exceptions.  (Actually, until the
> code is ported to a sizeof(short char) != sizeof(char) environment,
> it wouldn't break in this regard.  That port is likely to be a painful
> one in any case, since it would probably be to a multi-byte character
> environment, and SOMEthing would have to be done anyway.  The changes
> necessary to accommodate this are generally fewer and simpler under my
> proposal than under a (long char)/lstrcpy() approach.)

One of those "changes" would be to track down all occurrences of "char" that
really mean "storage_unit" in *existing* code, and changing it.  The code
that uses "char" as "storage_unit" is code that does NOT, in general, know
about the text environment.  The code that would have to be changed for a
"long char" environment would be the code that *does* deal with text; since
this code has to change anyway, I see changing *this* code as simpler than
rewhacking all the code that's interested in storage units.

> I won't bother responding in detail on other points, such as use of
> reasonable default "DP shop" collating sequences analogous to ASCII
> without having to pack/unpack multi-byte strings.  (Yes, it's true
> that machine collating sequence isn't always appropriate -- but does
> that mean that one never encounters computer output that IS ordered by
> internal collating sequence?  Also note that strcoll() amounts to a
> declaration that there IS a natural multibyte collating sequence for
> any single environment.)

Bullshit.  The proposal *I* see in front of me, in the latest P1003 mailing,
says that "the appropriate ordering is determined by the program's current
locale"; "strcoll" does NOT always perform the same transformation - its
action is determined by what the last locale the process set the current
locale to was.  (Also note that "strcoll" does not have to map single
characters to single characters - in fact, it probably won't for many
languages.)

You also don't have to "pack/unpack multi-byte strings" a lot; you may have
to do that when reading from or writing to a file, but that's life.  If you
store 16-bit characters in files, even for plain ASCII text, you're going to
have to unpack ASCII files, at least, that come from other machines.

If something is ordered by internal collating sequence, there is no
guarantee that this ordering is necessarily meaningful to a human.  As such,
all the string comparison routine does is impose some total order on
character strings, so you can use any comparison routine you want, such as
"lstrcmp" or "strcmp".

> and discovered that much of it was due to the unquestioned assumption
> that "16-bit" text had to be considered as made of individual 8-bit
> (char)s.

I have yet to see this presented as an "unquestioned assumption".  The
impression *I* had of the AT&T proposal was that 16-bit text is considered
to be made of individual 16-bit "long char"s, and would be manipulated as
such.  You may have to pack and unpack strings made of 8-bit bytes when
reading from and writing to text files, but you don't have any choice unless
you want your whizzy 16-bit-"char" machine to be unable to export even plain
ASCII files to or import them from 8-bit-"char" machines.  You may make this
pack/unpack function a command, and require users to run this command on
files imported from other machines before they use those files; I don't
guarantee that requiring plain ASCII files to take 16 bits per characters on
these machines will be considered acceptable by the users of these machines.

> If one starts to write out a BNF grammar for what text IS, it becomes
> obvious very quickly that that is an unnatural constraint.

Since I have yet to see any less-formal notion of "what text is" that can be
used as a jumping-off point for such a formal definition, if one starts to
write out a BNF grammar for "what text is" one should be reminded that
they'd better know "what text is" before they start this project.  How does
one represent font or point size?  Does one consider text to be made up of
characters or glyphs?  If the latter, does one jump to 32-bit or 64-bit
characters?  (If this is done, a naive "strcmp" will consider the boldface
and italic forms of "foo" to be different, as well as a 10-point and a
12-point "foo", or a Times Roman and Helvetica "foo".)

> Before glibly dismissing this as not well thought out, give it a genuine
> try and see what it is like for actual programming; then try ANY
> alternative approach and see how IT works in practice.

Before glibly dismissing the effort involved in converting programs that use
"char" to represent the smallest unit of storage C lets you address (having
had no better choice) to use "storage_unit" or something like that, give it
a genuine try (this means porting UNIX to such an environment, as well as
several common commercially-available applications); then try writing code
for a 16-bit character environment with a "long char" C implementation.
Then see whether the effort involved in doing the former is greater than the
extra effort involved in using a "char"/"long char" environment rather than
a "storage_unit"/"char" environment.

> If you prefer, don't consider my proposal as a panacea for such issues,
> but rather as a simple extension

It's not a "simple extension", given the fact that "char" has historically
been used to represent the "storage_unit" data type.  If there had been a
standard "typedef" that everybody used for this, I might be more inclined to
consider allowing "char" to be larger than "storage_unit" to be simple.

> What I DON'T want to see is a klutzy solution FORCED on all implementers,
> which is what standardizing a bunch of simultaneous (long char) and (char)
> string routines (lstrcpy(), etc.) would amount to.  If vendors think it
> is necessary to take the (long char) approach, the door is still open
> for them to do so under my proposal (without X3J11's blessing), but
> vendors who really don't care about 16-bit chars (yes, there are vendors
> like that!) are not forced to provide that extra baggage in their
> libraries and documentation.

Why not make support for "long char" optional, then?  Vendors who don't care
about them need not provide them; *customers* who do can either pressure
this vendor to implement them or buy other machines.

> The fact that more future CPU architectures may support tiny data types
> directly in standard C than at present is an extra benefit from my
> approach to the "multi-byte character" problem; it wasn't my original
> motivation, but I'm happy that it turned out that way.  (You can bet
> that (short char) would be heavily used for Boolean arrays, for example,
> if my proposal makes it into the standard;

It would only be so used on C implementations that support bit addressing.
(Note that there is NOT necessarily a 100% correlation between support of
bit addressing in hardware and support of bit addressing in a C
implementation for that hardware.  Even if the notion of RISCs turns out not
to have been the right idea, one benefit it will leave behind is the end of
the naive notion that there *must* be a one-to-one correspondence between
hardware features and programming-language features.)

You can bet that "bit" (not "short char", PLEASE; as a name it says nothing
that conveys the notion of "smallest addressible unit") would NOT be used at
all for Boolean arrays on implementations that do not support it.  As such,
either every implementation would have to support it or it would not be used
in portable code.  If the latter is the case, then you might as well just
throw it in as a special extension; if lots of implementors want to add this
extension, they should come up with a common scheme, work with it, beat the
bugs out of it, and then see if they can get the notion of this as a
"standard extension" into X3J11.  If you can even implement this reasonably
well on machines without bit pointers (where "reasonably well" means "code
written naively for machines with bit pointers compiles into code about as
good as the code that fairly standard idioms for this sort of thing compile
into), it should perhaps become part of the core of C.

> However, since I've shown that a clean conceptual model for such text
> IS workable, there's no excuse for continued claims that explicit
> byte-packing and unpacking is the only way to go.

You *think* you've shown this.  I disagree, as do others whose judgement I
respect.  Furthermore, the only "explicit byte-packing and unpacking" that a
reasonable "long char" proposal would require would be for converting things
like ASCII text imported from other machines.

In some problem domains, formal elegance is one of the most important
criteria for a "good" solution.  Unfortunately, this problem is an
engineering problem; in this, as in other real-world problem domains, formal
elegance is nice but sometimes you just have to put other things first.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy at sun.com (or guy at sun.arpa)



More information about the Comp.lang.c mailing list