ambiguous ?

Richard O'Keefe ok at cs.mu.oz.au
Wed Oct 25 15:50:21 AEST 1989


In article <14115 at lanl.gov>, jlg at lanl.gov (Jim Giles) writes:
> From article <11398 at smoke.BRL.MIL>, by gwyn at smoke.BRL.MIL (Doug Gwyn):
> > It looks to me like Bill Wells was merely stating the facts that are
> > apparent to him.  Your inane comments about C's character processing
> > being less efficient (than in other programming languages) back him
> > up in his assessment.  

> I don't see how.  C character strings are null terminated rather
> that keeping the length of the string explicitly.  The result of this
> is that hardware with specialized instructions for character processing
> cannot be used as efficiently in C as with other languages.  The C
> strings always have to be prescanned to determine their length before
> the operation you are _really_ interested in can be performed.

You are both right.

    C is an exceptionally good language for CHARACTER processing.
    C is a rather bad language for STRING processing.

I have an anecdote:  a friend of mine spent a couple of months writing a
fairly "batch" editor in PL/I to run on an IBM mainframe.  He made extensive
use of PL/I's CHARACTER(LENGTH) VARYING, that's what the type is for, right?

(BEGIN DIGRESSION:

    #define DclCharVar(Vbl, N) struct { \
	int curlen; \
	char curtxt[N]; \
    } Vbl = {0}    
    #define DclCharCon(Vbl, S) struct { \
	int curlen; \
	char curtxt[-1 + sizeof S]; \
    } Vbl = {-1 + sizeof S, S}
    DclCharVar(a, 20);
    DclCharCon(b, "A literal value");

    gives you roughly the same effect as

    DCL A CHAR(20) VARYING, B CHAR(15) INITIAL("A literal value");

    in PL/I, except that you're still missing the library of built in
    functions and the compiler optimisations.
END DIGRESSION)

It was extremely painful for him to do this; none of the built in functions
was quite what he wanted, and if you wrote your own string functions they
ran an order of magnitude slower than the built in ones.  (Your functions
did not exploit special hardware and the compiler didn't know how to
optimise them.)

As an exercise in learning C, I implemented the same editor on a PDP-11/60
running V6+ UNIX.  It took me 3 days to write and 2 more to debug, and
was about 12 pages long compared with my friend's 60.  It was also faster
on the 11/60 than my friend's program on an IBM 4331 (I think it was a
4331; might have been bigger).

What happened?  All things considered, I think my friend was a better
programmer than I was.  The point was that he was starting from a STRINGS
language where the built-in functions were fast but anything else was hard,
whereas I was starting from a CHARACTERS language where I was able to
synthesise precisely the operations that I needed (3 pages to define the
``string'' functions I wanted).

If you insist on seeing array-of-byte-with-NUL-terminator as *THE*
equivalent in C of strings, you are going to be in big trouble.  The
C library actually supports THREE different representations:
	unbounded-array-of-byte-with-NUL terminator  (str* functions)
	at-most-N-bytes-with-NUL-terminator-if-short (strn* functions)
	array-of-exactly-N-bytes                     (mem* functions)
VMS C programmers use a fourth representation, "descriptor", which has
extensive support in the VMS runtime library.  C provides direct
syntax for literals of only one type, but as I showed above, it isn't
hard to come up with macros to declare named constants of the other
types.  (VMS C already has such a macro for "descriptors".)

If you insist on doing text processing with strings, you are making a
pretty big mistake no matter what language you are using.  For example,
in Lisp and Prolog I have been able to reduce program costs from O(N**2)
to O(N) by switching from "string" representation of character sequences
to linked lists.  In general, it is wise to use "implicit" representations
for character sequences where you can.  Instead of constructing strings
and writing them out, construct trees of some sort and have a tree-walker
that sends out the characters without putting them in a string.



More information about the Comp.lang.c mailing list