LEX behaviour when given "large" automata

Sun Mar 27 06:51:42 AEST 1988

-In some article whose referent was deleted, I wrote:
->[rewriting keywords to use table lookup on generic identifiers]
->may (probably does) help in lex, but it is clearly the wrong way
->around in almost all cases.  The lexer has to traverse a DFA to decide
->that something is an ID.  While it is at it it could easily tell
->whether it is a keyword instead.  Obviously this makes the DFA table
->larger, but the result should run faster.

In article <10700014 at snail> carroll at snail.CS.UIUC.EDU writes:
-	I disagree. For a compiler class, we had to write a lex file, and
-the official way was to use the
-
-IF	 {return T_IF }
-
-etc. thing. It was big and slow. I changed mine to use the single rule for
-identifier, and look it up in a table, and I got a speed up of roughly 4
-in lex/compile time, and 40% reduction in compiled object size. I think that
-the point is that the DFA that decides if something is a token or not is
-so much simpler than one that has to check for all of the keywords, that
-the gain is positive from removing the keyword cases.

Remember, I said `may (probably does) help in lex' and `in almost all
cases'.  The fact that the DFA table is larger may slow you down, or
may not, all depending upon the machine on which the program is run
(does it have caches?  How big are they?  etc).  The fact that Lex
generates horrid DFA code is far more important in this case.

-In addition, I would think that for most cases, you have to find
-non-builtin "keywords" (e.g, variable names, etc.) as you go, and
-so have to check for something that is not a keyword but is still
-an identifier anyway.

Possibly.  This depends on the ratio of keywords to identifiers in the
`average' input.  Languages with many keywords (e.g., Pascal: begin,
end, do, then, var) may have a high ratio, making faster keyword-id to
token translation important; languages with few keywords (C) may have
a low ratio, making it unimportant.  Conversely, many keywords may
make the DFA table explode.  `Almost all cases' may have been an
overstatement; I have no hard data.  Try rewriting your lexer using
the LBL Fast Lex and see what happens: that will provide one data
point.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris at mimsy.umd.edu	Path:	uunet!mimsy!chris