C not LALR(1) & compiler bugs

richw at ada-uts.UUCP richw at ada-uts.UUCP
Sat Jan 18 07:20:00 AEST 1986


C's grammar is CONTEXT SENSITIVE !?    Can it be ?!

The following is quoted from page 121 of "C: A Reference Manual" by
Harbison & Steele (which, by the way, beats the pants off of
Kernighan & Ritchie as a reference manual).  After the quote,
I've included a small program which just may reveal a minor bug
in your C compiler (it did for mine).


    Allowing ordinary identifiers, as opposed to reserved words only,
    as type specifiers makes the C grammar context sensitive, and
    hence not LALR(1).  To see this, consider this program line

                       A ( *B );

    If A has been defined as a typedef name, then the line is a
    declaration of a variable B to be of type "pointer to A."
    (The parentheses surrounding "*B" are ignored.)  If A is not
    a type name, then this line is a call of the function A with
    the single parameter *B.  This ambiguity cannot be resolved
    grammatically.
        C compilers based on UNIX' YACC parser-generator -- such
    as the Portable C Compiler -- handle this problem by feeding
    information acquired during semantic analysis back to the
    lexer.  In fact, most C compilers do some typedef analysis
    during lexical analysis.


All I have to say, concerning the design of C's syntax, is "Oops".

I also realized that this, combined with that real spiffy feature
of C that identifiers are the same if the first 8 characters are
the same, could be combined to really confuse C compilers.  I tried
the following program on the compiler I use:

    typedef int long_type_name;
    
    f(a)
    int *a;
    {
        long_type_of_function_name (*a);

        printf("Bye");
    }

According to H&S, a correct C compiler should say that this is a
redeclaration of "a" (since "long_type_of_function_name" and
"long_type_name" are, uh, the same identifer).  However, the
compiler I use simply eats it up, thinking that the line in
question is a call to some external function (which, since it
wasn't explicitly declared, C gratiously assumes returns an
int -- isn't C just so helpful !).  My guess is that when the
lexer checks to see if the function name is really a typedef'd
name, it checks ALL of the characters in both names (i.e. strcmp)
instead of checking just the first 8 (i.e. strncmp).

Of course, since the identifiers really ARE different, it SEEMS
as if the compiler's thinking it's a function call IS correct.
Technically, it's a buggy compiler, though.

Isn't it strange that it seems better for the compiler to be wrong?

Doesn't that make you wonder if something is SERIOUSLY wrong with C?

Personally, I think that the real fault for my "buggy" compiler
lies not with the compiler writer, but in the shoddy language design
that haunts the deep-dark corners of C.  I mean, is there any excuse
for the grammar being context sensitive?  Or, for that matter, for
identifiers having only 8 significant characters?

-- Rich  "Picky-Picky-Picky"  Wagner


P.S.  Forgive me if this piece of C trivia has been already discussed
      (or flamed, as in this case) in net.lang.c -- I just found out
      about it and was amazed.



More information about the Comp.lang.c mailing list