comments in lex

Fri Jul 12 04:56:38 AEST 1985

Thanks for all the contributions on this subject.
Many people recommended embedding C in Lex to solve
my problem, pointing out some precedents for this.
This, of course, is tantamount to saying that Lex
can't hack it, and indeed amdahl!drivax!alan says
I shouldn't expect a finite state machine to do so.

Paul Haahr of Princeton made a snappy answer which was:

"/*"([^*]|"*"[^/]*"*/"

but Glen Dudek of Harvard pointed out that this fails for

/***/, and it should have been:

"/*"("/"|("*"*[^*/]))*"*"+"/"

Several people pointed out the hazard of enormous block 
comments, and McQueer steered me to the use of START 
transitions, which I decided was sound advice when I 
discovered the yymore function.

I append the Lex code that I have arrived at, a program to
which you all contributed.  My special problem is a lexical
processor for a PL/1 pretty-printer, and in this environment
people typically have enormous comment blocks with some sort
of pattern or table in them.  Thus, although code can be torn 
to shreds and reformatted block comments must be left undisturbed.
The problem of large blocks is solved by entering a "comment" mode
and tokenising each line separately.  The residual problem is that
the first line of such a block has to respect leading white space
even before the comment starts. This is solved by tokenising the
whole line, not just the comment part. Elsewhere, white space is
discarded.

To my astonishment, a fault in Lex showed up which almost 
crippled me.  It is impossible to recognise just one \n
character under a START mode, although you can in normal mode.
Thus, my last rule looks for [\n]+ followed by any character and
then unputs that character back.  Is this a known wart?

startcom \/\*
endcom \*\/
%START com maybecom

%%
{startcom} 	      {yymore();BEGIN com;}
[\n]+		      {printf("%s%u%c","nl(",yyleng,')');BEGIN maybecom;}
[ \t]+		      ;

<maybecom>[\ \t]*{startcom}  {yymore();BEGIN com;}

<com>[^\*\n]*{endcom} 
          {printf("%s%u%s%s%s","cm(",yyleng,",\"",yytext,"\")");BEGIN 0;}
<com>[^\*\n]*	    printf("%s%u%s%s%s","cm(",yyleng,",\"",yytext,"\")");
<com>[^\*\n]*\*	    yymore();
<com>[\n]+.	    
 	  {unput(yytext[yyleng-1]);printf("%s%u%c","nl(",yyleng-1,')');}

%%
main() {while (2) yylex();}
-- 
Ray Reeves,
CCA-UNIWORKS,20 William St,Wellesley, Ma. 02181. (617)235-2600
emacs!ray at CCA-UNIX