lex(1) output statistics (Summary)

Tue Jun 11 06:35:11 AEST 1985

    Several weeks ago I requested info explaining lex output
statistics; e.g. the meanings of the "%" parameters,
what sorts of rules caused which tables to increase in size,
etc.  I didn't get many responses, but those I received
were most helpful.
A summary of responses follows.
Thanks much.
Doug Martin
arpanet: martin at nosc
------------------------------------------------------
From: (Vern Paxson [rtsg]) vern at lbl-csam
I can tell you what some of the mysterious %{e,p,n,...}'s in the Lex output are:

    %e is the number of NFA states needed to recognize the input patterns.  It
       is a linear function of the number of characters and operators in
       the patterns.

    %n is the number of DFA states needed for the recognizer.  It is
       potentially an exponential function of the number of NFA states,
       but in practice tends to be no greater and usually less than the
       %e figure.

    %t is the number of state transitions in the NFA.  It is a function of
       the complexity of your patterns (especially "*", and "+" operators)
       and especially the occurence of character classes inside of closures.
       One character class and closure is enough to do it.  For example,
       "keyword recognizer" inputs like:

	   begin	return( TOK_BEGIN );
	   end		return( TOK_END );
	   procedure	return( TOK_PROCEDURE );
	   .
	   .
	   .
	   [a-z]+	return( TOK_IDENTIFIER );

       will result in a very high %t figure.  It's Lex's job to compress
       all of these transitions into something much smaller, but it doesn't
       do an especially good job.  We have written our own Lex program in
       Ratfor and our experience shows that you can halve the size of the
       tables for complicated sets of rules.

    %o is basically the compressed version of %t.  The total amount of
       memory the tables generated by Lex will take is

	   3*(%n) + 2*(%o) + X  integer words

       where X is roughly 2*(%n).  Most of this stuff could be declared
       as short but isn't.  In addition, there's a constant overhead of
       two arrays of 128 characters each.

I could make guesses as to what %k, %p, and %a are, but all they'd be
is guesses.

If your concern is in staying within Lex's default maximums for these
values, the way to do it is to eliminate ambiguities and similarities between
rules.  The above example of rules that recognize keywords and identifiers
will use a great deal less of Lex's resources (and have much smaller output
tables) if you get rid of either the keyword rules or the identifier rules.
The complexity in the set of rules arises from the ambiguity of each
keyword also matching the identifier rule.

But if you want to keep your input rules intact, you can increase Lex's
maximum for a given value by putting, e.g,

    %a 3000

in the declarations section of the input.  In picking how much to raise
the value, I just increase by 500 or 1000 and see if that's enough.  Judging
from the operation of our Lex, I don't think any of these maximums have
a geometric effect on the amount of memory Lex uses.  They may very easily
have a geometric effect on Lex's run-time, though.

Hope this helps,

Vern Paxson
Real Time Systems Group
Lawrence Berkeley Laboratory

vern at lbl-csam.ARPA
From: <sdcsvax!sdcrdcf!hplabs!hao!seismo!maryland.ARPA!aplvax!ded>
I was confronted with this problem about one year ago.  I had to
increase the table parameters (I forget the name of the file, but
it was a .h file) and recompile lex.  I called my new lex "biglex" and
it worked just fine.  

How do you determine the correct size for the table parameters?  I didn't 
have the time or inclination to analyze the operation of lex, so I just 
used trial and error.  What can I say; it worked!

-- 

					Don Davis
					JHU/APL
				...decvax!harpo!seismo!umcp-cs!aplvax!ded
				...rlgvax!cvl!umcp-cs!aplvax!ded
--------------------------------------------------