Comment recognition in Lex, again

Ian F. Darwin ian at utcsstat.UUCP
Sat May 5 13:24:49 AEST 1984


	From: anderson at uwvax.ARPA
	
	I have received several replies to my request for a lex expression
	to recognize /* ... */ comments.  The only one that works (sent in
	by Jim Hogue) is
	
	"/*"([^*]*"*"*"*"[^/*])*[^*]*"*"*"*/"
	
	which I can't claim to fully understand.  Nor do I understand why my
	original,  "/*"([^*]|("*"/[^/]))*"*/", doesn't work.  The idea is that
	each character in the string between /* and */ can either be something
	other than *, or * followed by something other than /.
	
	Can anyone come up with an expression simpler than Hogue's that works?
	By "works", I mean put it in a real "lex" program, as in:
	
	(your expr)	printf("recognized (%s)\n",yytext);
	
	and try it on inputs such as /***/, /*/*/, etc.
	
	-- David Anderson (uwvax!anderson)

It's not clear what the goal of this exercise actually is.
On the assumption that you are trying to build part of a compiler,
here is a simple, *readable* code fragment which inputs a C program
and eats all the comments, which is what you do in a real
compiler. I have no wish to strain my eyes on Jim Hogue's excellent
APL program (nor the original, for that matter), so I wrote this.
It's actually more complex than it needs to be; this is more a
comment on the simple-mindedness of lex than on my coding style
(and no, I have nothing better to offer than lex; flames to /dev/null
given that I do acknowledge lex's authors' accomplishments).

Our approach is to use the lex ``start condition''. There can
be many start conditions defined, but only 001 of them can
be active at any time, and you can neither turn off a start
condition (you can only say BEGIN 0, which returns to ``the
normal state''), nor use rule 0 in the <...> prefix.

With that in mind, here is lexcom.l:
	%START INCOM  NOTIN
	%%
	<INCOM>"*/"	{BEGIN NOTIN;}
	<INCOM>"/*"	unput('*');
	<INCOM>.	;
	<NOTIN>"/*"	{BEGIN INCOM;}
	"/*"		BEGIN INCOM;
	<NOTIN>.	ECHO;
	%%

In order to understand this, read carefully section 10 of the Lex manual.
If you think you have a better (shorter) way, **try it before you post it**
because the most obvious optimisations do not work! Geoff Collyer
and I spent some time interpreting the manual and making the
minimum test case that would work *correctly*.

While our version is not as short as Jim Hogue's submission,
	%%
	"/*"([^*]*"*"*"*"[^/*])*[^*]*"*"*"*/"	;
	.	ECHO;
	%%
(I've made it into a full program) it has the following advantages:
	1) it handles multi-line comments correctly.
	2) it does not overflow lex's input buffer (see the
	admonition in section 5 of the Lex manual about
	``Don't try to defeat this with expressions like ...
	... or equivalents; the Lex-generated program will try to read
	the entire input file, causing internal buffer overflows.''
	This is what the APL version does.

	Nor does it suffice to simply ``expand Lex's buffers''.
	This will not work on some machines and certainly
	not on binary-only systems (they do exist!).

Here are the test cases used. Our program passes both.
'To pass' means to do the same thing that CPP does.
(CPP is the zeroeth pass of the C compiler, the preprocessor,
which normally eats the comments from live C programs
when they are being compiled. I diffed the output as follows:
	cc -P t.c	# produces t.i
	lexcom <t.c >t.i2
	diff t.i t.i2
(where t.c contains the first test case) and  got no differences.
The second test case produces nothing but null lines
(cpp and our program) and a core dump (the one line program).

1) A simple C fragment:
	/************************************/
	/* this is a test program */
	/**/int i=2; /* initialise an int */
	/*/*/int j=3; /* init an int */
	/*
	bletch
	end of file in the middle of a comment - a good test.

2) a longer fragment, but still valid C:
	(echo '/*'; cat /etc/passwd; echo '*/') | a.out
(assuming that */ does not appear in /etc/passwd at any
given moment; ours doesn't). Our program produced many
newlines, as did CPP. The one-liner dumped core!

Is there a moral in all this? I think it's that a
program that is a few lines longer, but is readable
and conforms to the manual, is in fact a better program.

Ian Darwin, Toronto Canada {ihnp4|decvax}!utcsstat!ian
-- 
Ian Darwin, Toronto   uucp: utcsstat!ian   Arpa: decvax!utcsstat!ian at Berkeley



More information about the Comp.unix.wizards mailing list