yacc & lex - cupla questions

Fri Jul 27 20:05:47 AEST 1990

In article <1990Jul26.175831.1216 at uicbert.eecs.uic.edu> woodward at uicbert.eecs.uic.edu writes:
>
>i have been trying to parse a straightforward stream of bytes using the
>c-preprocessors lex & yacc.  being a new user of these utilities, i have
>a couple of problems for which i'd like to solicit your suggestions:

Since the "standard docs" for lex + yacc are very terse (not to say:
incomplete in many places), I think I make this a followup rather
than an emailed answer. Now, let's see where the problems are ...

>---------------------------------------------------------------------
>1.)  how does one redefine the i/o in a yacc/lex piece of code?  i.e.
>the code which is generated defaults to stdin and stdout for input and
>output, respectively.  i'd like to redefine these defaults w/o having 
>to hack on the intermediate c-code, since this is a live production 
>project; i'd like to be able to update and modify the program simply by 
>saying "make". 

The "calling"-tree in a lex+yacc application, when it comes to read
input and you do not change anything, is normally:

   (main or whatever)
	---> yyparse
		---> yylex
			---> input[Macro]
				---> getc(yyin) [yyin defaults to stdin]

If you want to read from some other source as stdin, you have several
points where you can change something. (In the very simplest case you
could even change nothing and use the input redirection of UNIX.)

.sidenote on
Though there are often good reasons, I sometimes wonder why a program
cares about file arguments at all instead of using stdin only. I once
found it annoying that there were programs like "tr" which don't handle
file arguments in the unix style ... until I learned that it's so easy
to put such programs in a shell wrapper like "cat $* | tr .....".
.sidenote off

If your lex-generated program has to read from another source as stdin,
just fopen the file and assign the returned FILE-pointer to yyin. The
latter is a global symbol in the object file which results from compiling
what lex generated. If you prefer seperate compilation you must define
it as "extern FILE *yyin;" in the module where you assign to it. The
right place for this will probably be the one where you call yyparse
in the above example. Note that the standard main program from the
yacc-library is not linked if you supply your own. So you can play
any games you like before calling yyparse.

Next step of complexity is to change the input-macro of yylex. This
is useful sometimes, but I would not recommend to do so until you
have gathered a bit experience with lex and understand the implications
(but I'm willing to answer questions on this by email).

Finally, you can consider avoiding lex at all and roll your own
version of yylex. If you have only the "ancient" lex which is supplied
with most unix systems (contrary to the rewrite "flex" which is IMHO
in the public domain?), it could eventually be an advantage to do so,
since lex-generated programs are known to be not so much efficient
as hand-written scanners. (I have no exact metrics for that and
comparisions made are often based on trivial scanners, which are
easily written by hand. In any case I would recommend to use lex
during development as prototyping tool.)

For the redirection of output I see no problem at all, since this
is fully under control of the C-program fragments you write in your
actions of the lex+yacc source.

>---------------------------------------------------------------------
>2.)  how can one get the automagically-defined #defines, which can
>normally be created from yacc with the -d flag, to come out when you
>use a makefile?  i.e. suppose i have lex.l and yacc.y lex and yacc
>source files, respectively, and i have object files defined in my makefile 
>called lex.o and yacc.o such that "make" follows default rules to create 
>these from the aforementioned source files.   

If you use lex + yacc with the Unix tool make, you can add your
own explicit dependencies or change the default-rules and add your
own commands there. There is no "catch all" method for this - several
variations all with their specific advantages and drawbacks exist -
but if you know make or are willing to learn about make, you can
determine the dependencies between your files generated by lex + yacc in
the same way as if it were normal sources (BTW: I found the book in the
O'Reilly Nutshell Series, "Managing Projects with Make", excellent for
learning about make, though for the basics the "The Unix Programming
Environment" by Kernighan + Pike [K+P] is sufficient. The latter is
also recommendable because of its treatment of lex + yacc.)

One thing is to mention here (you also find this in K+P): During
development it's much more probable that the actions in your grammar
will change rather than you add new tokens or change the type of
the value stack. Hence when running yacc, the contents of y.tab.c
will often change, but y.tab.h will stay the same. Since both
are generated in one run (yacc -d), and some other targets may
depend on y.tab.h, you often will have unnecessary compiles caused
by this scheme (BTW: This is a mistake in the design of yacc. A
better choice would have been to let yacc -d create *only* y.tab.h.
If GNU's replacement for yacc, bison, hasn't allready done this, it
should add an option switch for that purpose. This would ease "clean"
integration into make-managed projects.)

K+P has a solution for this. Mine is basically the same, just in another
package: Write two shell-wrappers (or one with an option) for yacc which
generate the y.tab.c and y.tab.h seperately. For any grammar in a file
"grammar.y", this wrappers should generate appropriate "grammar.c" and
"grammar.h" files. Since yacc writes its output into y.tab.c and y.tab.h,
the wrappers must rename these files and before doing so for y.tab.h
this file should be compared (eg. with cmp(1) or diff(1)) to grammar.h
(if one allready exits). Leaving the existing one if nothing has changed
will avoid the unnecessary re-compiles of other modules.

>
>---------------------------------------------------------------------
>3.)  if i have a yacc construct such as:
> 
>line3	: 	A B C
>		{  yacc action sequence }
>
>
>which indicates that the construct line3 is composed of the 3 tokens
>A B and C, in that order ...
> 
>how can i now assign the values of A, B, and C into local vars of my
>choice?  the problem lies in the fact that each of A B and C represent
>three calls to lex, and if i pass back a pointer to yytext[] from lex, 
>i only retain the value of the last token in the sequence, in this case C, 
>when i get to the action sequence in my yacc code.  what if i want to 
>be able to select the EXACT ascii tokens for each of A B and C above in 
>my yacc code.  how do i do that?

Yes, that's a frequently asked one.

Transfering strings from yylex to yyparse (resp. the action which has
the relevant tokens on the RHS of its grammar rule) must be done with
care: Using pointers to yytext is not feasible here - you must copy the
contents to a safe place. For that purpose you could malloc some space
in the action of yylex (not yyparse!!) which recognizes the token (see
example following below). Your C-standard-library may also contain strdup,
which does malloc and strcpy all in one, but its not difficult to do
without. Of course you must be careful here:
   -	malloc may return a NULL-pointer because of memory limits
   -	you must not forget to allocate space for the terminating
	NUL-byte; malloc(yylen + 1) is the right thing!
   -	you must carefully plan for de-allocation, if your
	program should not run out of memory when it analyzes
	some large input

If you transfer pointers to the malloc-ed space via the value stack,
the last chance for free-ing is before the stack is cleared. So, if you
don't copy the pointers which correspond to A, B, and C in the above
example, your last chance is in the grammar action. A short code
excerpt should help to understand what is required:

lex-source -------------------------------------------------------
........
%%
regex-for-token-A	{
		yylval.str = malloc(yylen + 1);
		if (yylval.str == (char *)0) {
			srceam and shout and die horrible death
		}
		strcpy(yylval.str, yytext);
		return(A);
	}
......... etc, same for token B and C
-------------------------------------------------------------------

yacc-source ------------------------------------------------------
......
%union {
	.....
	char *str;
	.....
}
......
%token <str> A B C
......
%%
......
line3 : A B C {
		$1, $2, and $3 are pointers to "safe" copies of the
		original tokens now, but if you don't copy these
		pointers to variables that will SURVIVE THIS BLOCK,
		you must cleanup befor this action ends:
			free($1);
			free($2);
			free($3);
		Be especially careful if you create multiple references
		to the malloc-ed space or if you transfer one of these
		further out, say: $$ = $1. In this case you must of course
		*not* free $1 here, instead the action(s) of the rule(s)
		where the non-terminal "line3" appears on the RHS are now
		responsible to do so.
	}
......

>
>any comments or suggestions would be most heartily appreciated.

Enough?

Good, lex+yacc lesson ends for today :-).
-- 
Martin Weitzel, email: martin at mwtech.UUCP, voice: 49-(0)6151-6 56 83