yacc & lex - cupla questions

Mon Jul 30 23:13:41 AEST 1990

In article <2481 at onion.reading.ac.uk> ac1 at rosemary.cs.reading.ac.uk (Andrew Cunningham) writes:
[Q+A for some problems with lex and yacc; refer to
 previous articles in this thread for more details]

[reading from other source than `yyin']

>[You can] #define yyinput to be
>something which returns the character from your file.  Then, when
>lex.yy.c is compiled, instead of calling the yyinput function your
>#define is called instead.  E.g.
>
>#define yyinput my_yyinput
>int my_yyinput()
>  {
>    /* get the character you want and return it */
>  }
>  
>You'll also have to redefine yyunput(c)  if you want to do this.

>From this and one more article in this thread I conclude that there's a
widespread misconception about how things work together. Maybe, the above
works with some versions of lex outthere, but from looking to the details
of the generated lex-source (lex.yy.c) of several systems (XENIX derived
from SysIII; AT&T UNIX SysV; ISC 386/ix derived from SysV), I see that
the above CAN NOT WORK as desired.

Here are some details, how the individual routines and functions call
each other:

        main (from lex-library or own)
         |
         V
        yylex --------------------------+-+
         | |                            | |
  +---+  | |  +-----+                   | |
  |   V  V V  V     |           ......  V V .........
  |  input unput    |           :  yyless yyreject  : in the
  |                 |           :.... | ... | . | ..: lex-library
  |                 |                 V     V   V
  |                 |                 yyunput   yyinput
  |                 |                    |         |
  |                 +--------------------+         |
  +------------------------------------------------+

What we should note first is:

	When the next character is needed in yylex, input
	(NOT yyinput!) is called. Normally, input is #defined
	as macro but you can re-#define it, or #undef-ine it
	and make a function with this name visible when you
	compile and link lex.yy.c.

	There is another macro, unput, that must properly undo
	the actions of input, though unput is only called if your
	regular expressions require look-ahead. (If you are not
	*very* experienced with regular expressions, assume that
	there will *allways* will be look-ahead.)

So if we want to change things here, we must find the right place
for our re-definition, that is, we must write it somewhere into the
".l"-file (with the lex-source), so that it appears *after* the #define
that is automatically generated by lex, but *before* the first use
of input/unput. As the order in which the parts of your ".l"-file
appear in lex.yy.c changed with the evolution of lex, you should
check for the right place if you try this the first time!

The "safest" (ie. most portable) place I've found is right at the
beginning of the second part of the ".l"-file, immediatly before
the first regular expression.

file.l ------------------------------------------------
	first part
%%
%{
#undef	input /* ANSI-C requires that, though */
#undef	unput /* other compilers may do without */
#define	input .... whatever ....
#define unput ... as you like ..
%}
first-regex {
	... action ...
}
second-regex {
	.... action ...
}
	... etc. ......
%%
	third part
------------------------------------------------------------

Now for the tricky part: As you see from the above, there are
some routines in the lex-library which need sometimes to input
or unput characters. These routines *must* use exactly the re-
defined versions of input and unput. How can this routines "link"
to something that is defined as macro?

The solution can again be seen if we carefully study lex.yy.c, the
source generated by lex. At the end of this source we find the
functions, yyinput und yyunput (note the yy-prefix now!), which do no
more and no less than calling input and unput. As the two functions are
compiled where our macro-definitions are visible, they are the "stubs"
thru which the functions in the lex-library access our macros.

Again: Look at the above scheme showing the calling hierarchie and
try to understand the dependencies. Eventuelly study lex.yy.c a bit
further. THEN you might consider writing your own input/unput macros!

>  
>>---------------------------------------------------------------------
[managing yacc projects with make]

>You'll need to specify an explicit rule to do this.  Or, at the
>expense of some processor time you might want to run:
>
>y.tab.h: yacc.y
>	yacc yacc.y
	    ^-- insert "-d"-switch
>	rm y.tab.c
>
>(This shouldn't take too long, yacc is *fast* compared with the cc stage)

First, the above is a good advice in so far as it generally doesn't
hurt to run yacc only for the purpose of generating y.tab.h. Just for
that the "-d" switch should be specified, but IMHO that is simply a
typo here.

What I have to criticize is that y.tab.h (as well as y.tab.c) is
more some kind of "workfile" (IMHO at least) and should be renamed
to something else. So we get:

yacc.h: yacc.y
	yacc -d yacc.y
	mv y.tab.h yacc.h
	rm -f y.tab.c
	   ^^----------- add this for portability, as on some older
	                 systems the exit status of rm is not set
	                 cleanly otherwise (`make' may complain).

(BTW: I'm not quite happy with file names like yacc.y, yacc.h etc. in
the presence of a command called yacc in the same lines here, but I
didn't want to change the original example too much.)

Some fine point I allready mentioned in an erarlier positing follows:
Generally the situation is that typical changes in yacc.y will change
y.tab.c, but not y.tab.h (resp. yacc.h in the above example.) The latter
will only occur if new tokens or new types for the value stack are
introduced, which is by far less frequently done as changes in the
actions of the grammar rules. So it is recommendable to extend the
above further to:

yacc.h: yacc.y
	yacc -d yacc.y
	test -f yacc.h && cmp -s y.tab.h yacc.h || mv y.tab.h yacc.h
	rm -f y.tab.c

Here the mv is only done if yacc.h doesn't exist or is different from
y.tab.h

>>---------------------------------------------------------------------
[making yytext available in grammar actions]

>
>line3: 
>	A {atext=strdup(yytext);}
>	B {btext=strdup(yytext);}
>	C {ctext=strdup(yytxet);}
>	
>Note: if you're grammar is more comlex than this you can lead to
>all sorts of comflicts in the compiler - when the parser executes an
>action it is `committed' to that branch of the parse tree and cannot
>backtrack to resolve any ambiguity that might occur (the classic
>problem here is if ... then ... else in programming languages).

Again the poster tells something very true here ... but forgets to
mention something *much* more important:

Never, again NEVER, again ***NEVER*** depend on an unchanged contents
of yytext in the actions of yyparse(%): In yyparse the calls to yylex which
in turn change the contents of yytext are slightly "asynchroneous", ie.
there might be a read-ahead of one token and yytext doesn't contain what
you think! (Note: There's not ALLWAYS a read-ahead, it just depends if
yyparse needs one to decide what to do further!) The only place where
yytext is valid is in the action-block following the regular expression
in the lex-source.

%: Small note to Chris Torek who some time ago gave a similar
recommendation in one of his postings: You and a few others who
understand the LALR(1) parsing algorithm used by yyparse and hence
can decide under which circumstances read-ahead will occur, are
explicitly excempt from the above "never"-rule :-)

>
>Hope this information helps.
>
>AndyC

Hope this corrections avoid frustration.

P.S. to AndyC: I didn't intend to make your recommendations look bad.
Topics like lex and yacc are really not well covered by the docs, or
at least you have to look very hard to get to the information you need.
Stay in tuned ...
-- 
Martin Weitzel, email: martin at mwtech.UUCP, voice: 49-(0)6151-6 56 83