yacc and lex bugs

R. Curtis Jackson rcj at burl.UUCP
Fri Apr 20 04:15:36 AEST 1984


FIRST OFF -- AN APOLOGY:  I have been informed that the Unix Hotline
folks processed my MR on yacc(1) promptly, and that after
sitting in Murray Hill for a year now it is considered "Under
Investigation" and the status is "We'll postpone judgement until
a later date".  The Hotline people did their job admirably, and I
am sorry I blasted them without having the MR checked first.

1) yacc
	a) Problem (history):
	In the 'good old days' (V6), yacc would not tell you in its
	debug output that it had found 'token ADDOP'; it would tell
	you that it had found 'token 426'; it was up to you to find
	out (via using the -d option and looking at y.tab.h) what
	token 426 really was.  So it was beneficial to define your
	own token numbers rather than letting yacc default them;
	that way they were in your source file for easy access.
	Even today, if you have one lexical analyzer feeding two or
	more parsers with the same tokens, you want to make sure
	that the token numbers are the same in both parsers, so this
	feature of yacc (being able to define your own token numbers)
	is still quite valid and useful.

	b) Problem:
	yacc uses tables of ints to transition from state to state, and
	it uses negative numbers based on the negative of the token number
	and on ( -(the_next_desirable_state) - 1000 ).  In other words,
	if you are to transition to state 53, the number in the table will
	be -1053.  [ I am about 90% sure this is accurate -- regardless
	I do know the problem is related to this ].  If you use token
	numbers > 1000, then yacc will run perfectly, generate proper
	y.output if you use the -v option, but when y.tab.c is compiled
	and executed, the results are totally unpredictable.  yacc will
	transition to wildly inappropriate states and start generating
	'Syntax error's at a phenomenal rate.

	c) Cure:
	Let yacc default its token numbers unless you absolutely cannot
	get around it.  If you really need that feature, don't use token
	numbers over 1000.  NOTE: remember to start your token numbers
	above the ascii code, or yacc will think that your ADDOP, to which
	you have assigned a token number of 040, is a space, and
	vice-versa.  If you have to use token numbers *AND* you have so
	many tokens that you are running over 1000, then wade through the
	yacc code and find the define for that number and increase it.
	(An extremely improbable situation)

2) lex
	a) Problem:
	lex has an input character buffer called yysbuf that is
	dimensioned to YYLMAX, defined to be 200.  Unfortunately, the
	routine that reads the input file [ yylook() ] does not, as
	far as I can tell, check to make sure that it has not gathered
	into yysbuf (or yytext, which is also dimensioned to YYLMAX)
	more than YYLMAX characters.  If it is matching a pattern that
	is more than YYLMAX characters, it writes them right past the
	end of yysbuf and on into 'The Memory Zone', usually producing
	Memory Faults or Bus Errors somewhere down the line.

	b) Cure:
	If you get a Memory Fault or Bus Error, and cannot seem to
	locate it, put the following lines into the declarations
	section of your lex program:

	%{
	blah;
	blah;
	blah;
	# undef YYLMAX
	# define YYLMAX 5000 /* or some other ridiculously large number */
	blah;
	blah;
	%}

	This will override lex's YYLMAX define (see the lex(1)
	documentation concerning overriding lex's input() macro and also
	look at the first 15 lines of any lex.yy.c for details).
	If your Memory Fault/Bus Error goes away, then either:

	1) Your pattern specs for lex are out of line -- you are not
	matching what you think you are matching -- check for rules
	containing things like [^x], where x is some character.  Remember
	that rules like these match ANY character but x, including
	newlines.

	2) Your pattern specs are OK, but you are simply trying to match
	more than 200 characters.  Use the above method to define YYLMAX
	to a reasonable number for your application and go on.

Hope this helps some people, please direct any questions/comments to
me at the address below,
-- 

The MAD Programmer -- 919-228-3313 (Cornet 291)
alias: Curtis Jackson	...![ ihnp4 ulysses cbosgd clyde ]!burl!rcj



More information about the Net.bugs.usg mailing list