v07i021: Ed(1)/regex(3)-compatible reg. exp. package

sources-request at mirror.UUCP sources-request at mirror.UUCP
Wed Sep 10 07:16:06 AEST 1986


Submitted by: ihnp4!yetti!oz (Ozan Yigit)
Mod.sources: Volume 7, Issue 21
Archive-name: regexp

#!/bin/sh
# This is a shell archive, meaning:
# 1. Remove everything above the #!/bin/sh line.
# 2. Save the resulting text in a file.
# 3. Execute the file with /bin/sh (not csh) to create the files:
#	README.txt
#	regex.3
#	regex.c
#	re_fail.c
#	grep.c
#	makefile
# This archive created: Thu Aug 14 17:41:27 1986
export PATH; PATH=/bin:$PATH
echo shar: extracting "'README.txt'" '(1925 characters)'
if test -f 'README.txt'
then
	echo shar: over-writing existing file "'README.txt'"
fi
sed 's/^X//' << \SHAR_EOF > 'README.txt'
XThis is a pre-release of a bunch of berkelix regex(3)/ed(1)
Xcompatible regular-expression routines.
X
XThese routines are completely public domain. You can do whatever
Xyou like with them, and hopefully you are professional enough not
Xto strip out the authorship information, acknowledgements and
Xreferences.
X
XThe reason for this being a *pre-release* is that I received a lot
Xof useful suggestions about packaging, about additional routines etc.
Xfrom a few people. I do not have too much time to do those changes
Xright now, so I am putting this package out for those who needed
Xit yesterday. Next release will include other routines, and better
Xpackaging.
X
XThese routines are *not* tested under SysV, but they are tested
Xunder PRO/Venix (2.0) and BSD 4.2.
X
XThose uEmacs V30 hackers should take notice of this package,
Xas well as those who would love to replace the "restricted" regex
Xroutines in Jove. [Since parts of these routines are based on Conroy's
Xoriginal work, it would be only fitting to put these into uEmacs.]
X
XIn general, these routines run just as fast, or faster than regex library
Xroutines under BSD 4.2. In some cases, they are slightly slower. I did not
Xtry too hard to optimize the re_exec routine.
X
XCoding style is a la K&R, with lotsa short identifiers. I like it
Xthat way. All flames should be fed to yetti!dragon.
X
XAcknowledgements: Henry Spencer, Hugh Redelmeier and Drew Sullivan made
X		  a lot of important suggestions, some of which will be
X		  incorporated into the next version.
X
X
XPackage:
X	----------------------------------------------------
X	README.txt		this file.
X
X	regex.3			comprehensive man page
X	regex.c			goodies
X
X	re_fail.c		sample failure routine
X
X	grep.c			a tiny grep for timing tests
X
X	makefile		guess.
X	----------------------------------------------------
X
Xenjoy..		  oz	[...!utzoo!yetti!oz]
X			[oz at yusol.BITNET || oz at yuyetti.BITNET]
X
X			Dept. of Computer Science
X			York University
SHAR_EOF
if test 1925 -ne "`wc -c 'README.txt'`"
then
	echo shar: error transmitting "'README.txt'" '(should have been 1925 characters)'
fi
echo shar: extracting "'regex.3'" '(7847 characters)'
if test -f 'regex.3'
then
	echo shar: over-writing existing file "'regex.3'"
fi
sed 's/^X//' << \SHAR_EOF > 'regex.3'
X.TH regex 3 local
X.DA Jun 19 1986
X.SH NAME
Xre_comp, re_exec, re_subs, re_modw, re_fail  \- regular expression handling
X.SH ORIGIN
XDept. of Computer Science
X.br
XYork University
X.SH SYNOPSIS
X.B char *re_comp(pat)
X.br
X.B char *pat;
X.PP
X.B re_exec(str)
X.br
X.B char *str;
X.PP
X.B re_subs(src, dst)
X.br
X.B char *src;
X.br
X.B char *dst;
X.PP
X.B void re_fail(msg, op)
X.br
X.B char *msg;
X.br
X.B char op;
X.PP
X.B void re_modw(str)
X.br
X.B char *str;
X
X.SH DESCRIPTION
X.PP
XThese functions implement
X.IR ed (1)-style
Xpartial regular expressions and supporting facilities.
X.PP
X.I Re_comp
Xcompiles a pattern string into an internal form (a deterministic finite-state
Xautomaton) to be executed by
X.I re_exec
Xfor pattern matching.
X.I Re_comp
Xreturns 0 if the pattern is compiled successfully, otherwise it returns an
Xerror message string. If
X.I re_comp
Xis called with a 0 or a \fInull\fR string, it returns without changing the
Xcurrently compiled regular expression.
X.sp
X.I Re_comp
Xsupports the same limited set of
X.I regular expressions
Xfound in
X.I ed
Xand Berkeley
X.IR regex (3)
Xroutines:
X.sp
X.if n .in +1.6i
X.if t .in +1i
X.de Ti
X.if n .ti -1.6i
X.if t .ti -1i
X.. 
X.if n .ta 0.8i +0.8i +0.8i
X.if t .ta 0.5i +0.5i +0.5i
X.Ti 
X[1]	\fIchar\fR	Matches itself, unless it is a special
Xcharacter (meta-character): \fB. \\ [ ] * + ^ $\fR
X
X.Ti 
X[2]	\fB.\fR	Matches \fIany\fR character.
X
X.Ti 
X[3]	\fB\\\fR	Matches the character following it, except
Xwhen followed by a digit 1 to 9, \fB(\fR, fB)\fR, \fB<\fR or \fB>\fR.
X(see [7], [8] and [9]) It is used as an escape character for all 
Xother meta-characters, and itself. When used
Xin a set ([4]), it is treated as an ordinary
Xcharacter.
X
X.Ti
X[4]	\fB[\fIset\fB]\fR	Matches one of the characters in the set.
XIf the first character in the set is \fB^\fR,
Xit matches a character NOT in the set. A
Xshorthand 
X.IR S - E
Xis used to specify a set of
Xcharacters 
X.I S 
Xup to 
X.IR E , 
Xinclusive. The special
Xcharacters \fB]\fR and \fB-\fR have no special
Xmeaning if they appear as the first chars
Xin the set.
X.nf
X	examples:	match:
X	[a-z]		any lowercase alpha
X	[^]-]		any char except ] and -
X	[^A-Z]		any char except 
X			uppercase alpha
X	[a-zA-Z0-9]	any alphanumeric
X.fi
X
X.Ti 
X[5]	\fB*\fR	Any regular expression form [1] to [4], followed by
Xclosure char (*) matches zero or more matches of
Xthat form.
X
X.Ti 
X[6]	\fB+\fR	Same as [5], except it matches one or more.
X
X.Ti 
X[7]		A regular expression in the form [1] to [10], enclosed
Xas \\(\fIform\fR\\) matches what form matches. The enclosure
Xcreates a set of tags, used for [8] and for
Xpattern substitution in
X.I re_subs. 
XThe tagged forms are numbered
Xstarting from 1.
X
X.Ti 
X[8]		A \\ followed by a digit 1 to 9 matches whatever a
Xpreviously tagged regular expression ([7]) matched.
X
X.Ti
X[9]	\fB\\<\fR	Matches the beginning of a \fIword\fR,
Xthat is, an empty string followed by a
Xletter, digit, or _ and not preceded by
Xa letter, digit, or _ .
X.Ti
X	\fB\\>\fR	Matches the end of a \fIword\fR,
Xthat is, an empty string preceded
Xby a letter, digit, or _ , and not
Xfollowed by a letter, digit, or _ .
X
X.Ti 
X[10]		A composite regular expression 
X\fIxy\fR where \fIx\fR and \fIy\fR
Xare in the form of [1] to [10] matches the longest
Xmatch of \fIx\fR followed by a match for \fIy\fR.
X
X.Ti 
X[11]	\fB^ $\fR	a regular expression starting with a \fB^\fR character
Xand/or ending with a \fB$\fR character, restricts the
Xpattern matching to the beginning of the line,
Xand/or the end of line [anchors]. Elsewhere in the
Xpattern, \fB^\fR and \fB$\fR are treated as ordinary characters.
X.if n .in -1.6i
X.if t .in -1i
X
X.PP
X.I Re_exec
Xexecutes the internal form produced by
X.I re_comp
Xand searches the argument string for the regular expression described
Xby the internal
Xform. 
X.I Re_exec
Xreturns 1 if the last regular expression pattern is matched within the string,
X0 if no match is found. In case of an internal error (corrupted internal
Xform), 
X.I re_exec 
Xcalls the user-supplied
X.I re_fail
Xand returns 0.
X.PP
XThe strings passed to both
X.I re_comp
Xand
X.I re_exec
Xmay have trailing or embedded newline characters. The strings 
Xmust be terminated by nulls.
X.PP
X.I Re_subs
Xdoes
X.IR ed -style
Xpattern substitution, after a successful match is found by
X.I re_exec.
XThe source string parameter to
X.I re_subs
Xis copied to the destination string with the following interpretation;
X.sp
X.if n .in +1.6i
X.if t .in +1i
X.Ti
X[1]	&	Substitute the entire matched string in the destination.
X
X.Ti
X[2]	\\\fIn\fR	Substitute the substring matched by a tagged subpattern
Xnumbered \fIn\fR, where \fIn\fR is between 1 to 9, inclusive.
X
X.Ti
X[3]	\\\fIchar\fR	Treat the next character literally,
Xunless the character is a digit ([2]).
X.if n .in -1.6i
X.if t .in -1i
X
X.PP
XIf the copy operation with the substitutions is successful,
X.I re_subs
Xreturns 1.
XIf the source string is corrupted, or the last call to
X.I re_exec
Xfails, it returns 0.
X
X.I Re_modw
Xis used to 
Xadd new characters into an internal table to
Xchange the re_exec's understanding of what
Xa \fIword\fR should look like, when matching with \fB\\<\fR and \fB\\>\fR
Xconstructs. If the string parameter is 0 or null string,
Xthe table is reset back to the default, which contains \fBA-Z a-z 0-9 _\fR .
X
X.I Re_fail
Xis a user-supplied routine to handle internal errors.
X.I re_exec
Xcalls
X.I re_fail
Xwith an error message string, and the opcode character that caused the error.
XThe default
X.I re_fail
Xroutine simply prints the message and the opcode character to
X.I stderr
Xand invokes
X.IR exit (2).
X.SH EXAMPLES
XIn the examples below, the
X.I dfaform
Xdescribes the internal form after the pattern is compiled. For additional
Xdetails, refer to the sources.
X.PP
X.ta 0.5i +0.5i +0.5i
X.nf
Xfoo*.*
X	dfaform:	CHR f CHR o CLO CHR o END CLO ANY END END
X	matches:	\fIfo foo fooo foobar fobar foxx ...\fR
X
Xfo[ob]a[rz]
X	dfaform:	CHR f CHR o CCL 2 o b CHR a CCL 2 r z END
X	matches:	\fIfobar fooar fobaz fooaz\fR
X
Xfoo\\\\+
X	dfaform:	CHR f CHR o CHR o CHR \\ CLO CHR \\ END END
X	matches:	\fIfoo\\ foo\\\\ foo\\\\\\  ...\fR
X
X\\(foo\\)[1-3]\\1	(same as foo[1-3]foo, but takes less internal space)
X	dfaform:	BOT 1 CHR f CHR o CHR o EOT 1 CCL 3 1 2 3 REF 1 END
X	matches:	\fIfoo1foo foo2foo foo3foo\fR
X
X\\(fo.*\\)-\\1
X	dfaform:	BOT 1 CHR f CHR o CLO ANY END EOT 1 CHR - REF 1 END
X	matches:	\fIfoo-foo fo-fo fob-fob foobar-foobar ...\fR
X.SH DIAGNOSTICS
X.I Re_comp
Xreturns one of the following strings if an error occurs:
X.PP
X.nf
X.in +0.5i
X\fINo previous regular expression,
XEmpty closure,
XIllegal closure,
XCyclical reference,
XUndetermined reference,
XUnmatched \e(,
XMissing ],
XNull pattern inside \e(\e),
XNull pattern inside \e<\e>,
XToo many \e(\e) pairs,
XUnmatched \e)\fP.
X.in -0.5i
X.fi
X.SH REFERENCES
X.if n .ta 3i
X.if t .ta 1.8i
X.nf
X\fISoftware tools\fR	Kernighan & Plauger
X\fISoftware tools in Pascal\fR	Kernighan & Plauger
X\fIGrep sources\fR [rsx-11 C dist]	David Conroy
X\fIEd - text editor\fR	Unix Programmer's Manual
X\fIAdvanced editing on Unix\fR	B. W. Kernighan
X\fIRegExp sources\fR	Henry Spencer
X.fi
X.SH "HISTORY AND NOTES"
XThese routines are derived from various implementations
Xfound in 
X.I "Software Tools"
Xbooks, and David Conroy's 
X.I grep. 
XThey are NOT derived from licensed/restricted software.
XFor more interesting/academic/complicated implementations,
Xsee Henry Spencer's 
X.I regexp 
Xroutines (V8), or 
X.I "GNU Emacs"
Xpattern
Xmatching module.
X.PP
XThe
X.I re_comp
Xand
X.I re_exec
Xroutines perform
X.I almost
Xas well as their licensed counterparts, sometimes better. 
XIn very few instances, they
Xare about 10% to 15% slower.
X.SH AUTHOR
XOzan S. Yigit (oz)
X.br
Xusenet: utzoo!yetti!oz
X.br
Xbitnet: oz at yusol || oz at yuyetti
X.SH "SEE ALSO"
Xed(1), ex(1), egrep(1), fgrep(1), grep(1), regex(3)
X.SH BUGS
XThese routines are \fIPublic Domain\fR. You can get them
Xin source.
X.br
XThe internal storage for the \fIdfa form\fR is not checked for
Xoverflows. Currently, it is 1024 bytes.
X.br
XOthers, no doubt.
SHAR_EOF
if test 7847 -ne "`wc -c 'regex.3'`"
then
	echo shar: error transmitting "'regex.3'" '(should have been 7847 characters)'
fi
echo shar: extracting "'regex.c'" '(18779 characters)'
if test -f 'regex.c'
then
	echo shar: over-writing existing file "'regex.c'"
fi
sed 's/^X//' << \SHAR_EOF > 'regex.c'
X/*
X * regex - Regular expression pattern matching
X *         and replacement
X *
X *
X * By:  Ozan S. Yigit (oz)
X *      Dept. of Computer Science
X *      York University
X *
X *
X * These routines are the PUBLIC DOMAIN equivalents 
X * of regex routines as found in 4.nBSD UN*X, with minor
X * extensions.
X *
X * These routines are derived from various implementations
X * found in software tools books, and Conroy's grep. They
X * are NOT derived from licensed/restricted software.
X * For more interesting/academic/complicated implementations,
X * see Henry Spencer's regexp routines, or GNU Emacs pattern
X * matching module.
X *
X * Routines:
X *      re_comp:        compile a regular expression into
X *                      a DFA.
X *
X *			char *re_comp(s)
X *			char *s;
X *
X *      re_exec:        execute the DFA to match a pattern.
X *
X *			int re_exec(s)
X *			char *s;
X *
X *	re_modw		change re_exec's understanding of what
X *			a "word" looks like (for \< and \>)
X *			by adding into the hidden word-character 
X *			table.
X *
X *			void re_modw(s)
X *			char *s;
X *
X *      re_subs:	substitute the matched portions in
X *              	a new string.
X *
X *			int re_subs(src, dst)
X *			char *src;
X *			char *dst;
X *
X *	re_fail:	failure routine for re_exec.
X *
X *			void re_fail(msg, op)
X *			char *msg;
X *			char op;
X *  
X * Regular Expressions:
X *
X *      [1]     char    matches itself, unless it is a special
X *                      character (metachar): . \ [ ] * + ^ $
X *
X *      [2]     .       matches any character.
X *
X *      [3]     \       matches the character following it, except
X *			when followed by a left or right round bracket,
X *			a digit 1 to 9 or a left or right angle bracket. 
X *			(see [7], [8] and [9])
X *			It is used as an escape character for all 
X *			other meta-characters, and itself. When used
X *			in a set ([4]), it is treated as an ordinary
X *			character.
X *
X *      [4]     [set]   matches one of the characters in the set.
X *                      If the first character in the set is "^",
X *                      it matches a character NOT in the set. A
X *                      shorthand S-E is used to specify a set of
X *                      characters S upto E, inclusive. The special
X *                      characters "]" and "-" have no special
X *                      meaning if they appear as the first chars
X *                      in the set.
X *                      examples:        match:
X *
X *                              [a-z]    any lowercase alpha
X *
X *                              [^]-]    any char except ] and -
X *
X *                              [^A-Z]   any char except uppercase
X *                                       alpha
X *
X *                              [a-zA-Z] any alpha
X *
X *      [5]     *       any regular expression form [1] to [4], followed by
X *                      closure char (*) matches zero or more matches of
X *                      that form.
X *
X *      [6]     +       same as [5], except it matches one or more.
X *
X *      [7]             a regular expression in the form [1] to [10], enclosed
X *                      as \(form\) matches what form matches. The enclosure
X *                      creates a set of tags, used for [8] and for
X *                      pattern substution. The tagged forms are numbered
X *			starting from 1.
X *
X *      [8]             a \ followed by a digit 1 to 9 matches whatever a
X *                      previously tagged regular expression ([7]) matched.
X *
X *	[9]	\<	a regular expression starting with a \< construct
X *		\>	and/or ending with a \> construct, restricts the
X *			pattern matching to the beginning of a word, and/or
X *			the end of a word. A word is defined to be a character
X *			string beginning and/or ending with the characters
X *			A-Z a-z 0-9 and _. It must also be preceded and/or
X *			followed by any character outside those mentioned.
X *
X *      [10]            a composite regular expression xy where x and y
X *                      are in the form [1] to [10] matches the longest
X *                      match of x followed by a match for y.
X *
X *      [11]	^	a regular expression starting with a ^ character
X *		$	and/or ending with a $ character, restricts the
X *                      pattern matching to the beginning of the line,
X *                      or the end of line. [anchors] Elsewhere in the
X *			pattern, ^ and $ are treated as ordinary characters.
X *
X *
X * Acknowledgements:
X *
X *	HCR's Hugh Redelmeier has been most helpful in various
X *	stages of development. He convinced me to include BOW
X *	and EOW constructs, originally invented by Rob Pike at
X *	the University of Toronto.
X *
X * References:
X *              Software tools			Kernighan & Plauger
X *              Software tools in Pascal        Kernighan & Plauger
X *              Grep [rsx-11 C dist]            David Conroy
X *		ed - text editor		Un*x Programmer's Manual
X *		Advanced editing on Un*x	B. W. Kernighan
X *		RegExp routines			Henry Spencer
X *
X * Notes:
X *
X *	This implementation uses a bit-set representation for character
X *	classes for speed and compactness. Each character is represented 
X *	by one bit in a 128-bit block. Thus, CCL or NCL always takes a 
X *	constant 16 bytes in the internal dfa, and re_exec does a single
X *	bit comparison to locate the character in the set.
X *
X * Examples:
X *
X *	pattern:	foo*.*
X *	compile:	CHR f CHR o CLO CHR o END CLO ANY END END
X *	matches:	fo foo fooo foobar fobar foxx ...
X *
X *	pattern:	fo[ob]a[rz]	
X *	compile:	CHR f CHR o CCL 2 o b CHR a CCL bitset END
X *	matches:	fobar fooar fobaz fooaz
X *
X *	pattern:	foo\\+
X *	compile:	CHR f CHR o CHR o CHR \ CLO CHR \ END END
X *	matches:	foo\ foo\\ foo\\\  ...
X *
X *	pattern:	\(foo\)[1-3]\1	(same as foo[1-3]foo)
X *	compile:	BOT 1 CHR f CHR o CHR o EOT 1 CCL bitset REF 1 END
X *	matches:	foo1foo foo2foo foo3foo
X *
X *	pattern:	\(fo.*\)-\1
X *	compile:	BOT 1 CHR f CHR o CLO ANY END EOT 1 CHR - REF 1 END
X *	matches:	foo-foo fo-fo fob-fob foobar-foobar ...
X * 
X */
X
X#define MAXDFA  1024
X#define MAXTAG  10
X
X#define OKP     1
X#define NOP     0
X
X#define CHR     1
X#define ANY     2
X#define CCL     3
X#define NCL     4
X#define BOL     5
X#define EOL     6
X#define BOT     7
X#define EOT     8
X#define BOW	9
X#define EOW	10
X#define REF     11
X#define CLO     12
X
X#define END     0
X
X/*
X * The following defines are not meant
X * to be changeable. They are for readibility
X * only.
X *
X */
X#define MAXCHR	128
X#define CHRBIT	8
X#define BITBLK	MAXCHR/CHRBIT
X#define BLKIND	0170
X#define BITIND	07
X
X#define ASCIIB	0177
X
Xtypedef /*unsigned*/ char CHAR;
X
Xstatic int  tagstk[MAXTAG];             /* subpat tag stack..*/
Xstatic CHAR dfa[MAXDFA];		/* automaton..       */
Xstatic int  sta = NOP;               	/* status of lastpat */
X
Xstatic CHAR bittab[BITBLK];		/* bit table for CCL */
X
Xstatic void
Xchset(c) register CHAR c; { bittab[((c)&BLKIND)>>3] |= 1<<((c)&BITIND); }
X
X#define badpat(x)	return(*dfa = END, x)
X#define store(x)	*mp++ = x
X 
Xchar *     
Xre_comp(pat)
Xchar *pat;
X{
X	register char *p;               /* pattern pointer   */
X	register CHAR *mp=dfa;          /* dfa pointer       */
X	register CHAR *lp;              /* saved pointer..   */
X	register CHAR *sp=dfa;          /* another one..     */
X
X	register int tagi = 0;          /* tag stack index   */
X	register int tagc = 1;          /* actual tag count  */
X
X	register int n;
X	int c1, c2;
X		
X	if (!pat || !*pat)
X		if (sta)
X			return(0);
X		else
X			badpat("No previous regular expression");
X	sta = NOP;
X
X	for (p = pat; *p; p++) {
X		lp = mp;
X		switch(*p) {
X
X		case '.':               /* match any char..  */
X			store(ANY);
X			break;
X
X		case '^':               /* match beginning.. */
X			if (p == pat)
X				store(BOL);
X			else {
X				store(CHR);
X				store(*p);
X			}
X			break;
X
X		case '$':               /* match endofline.. */
X			if (!*(p+1))
X				store(EOL);
X			else {
X				store(CHR);
X				store(*p);
X			}
X			break;
X
X		case '[':               /* match char class..*/
X
X			if (*++p == '^') {
X				store(NCL);
X				p++;
X			}
X			else
X				store(CCL);
X
X			if (*p == '-')		/* real dash */
X				chset(*p++);
X			if (*p == ']')		/* real brac */
X				chset(*p++);
X			while (*p && *p != ']') {
X				if (*p == '-' && *(p+1) && *(p+1) != ']') {
X					p++;
X					c1 = *(p-2) + 1;
X					c2 = *p++;
X					while (c1 <= c2)
X						chset(c1++);
X				}
X#ifdef EXTEND
X				else if (*p == '\\' && *(p+1)) {
X					p++;
X					chset(*p++);
X				}
X#endif
X				else
X					chset(*p++);
X			}
X			if (!*p)
X				badpat("Missing ]");
X
X			for (n = 0; n < BITBLK; bittab[n++] = (char) 0)
X				store(bittab[n]);
X	
X			break;
X
X		case '*':               /* match 0 or more.. */
X		case '+':               /* match 1 or more.. */
X			if (p == pat)
X				badpat("Empty closure");
X			lp = sp;                /* previous opcode */
X			if (*lp == CLO)         /* equivalence..   */
X				break;
X			switch(*lp) {
X
X			case BOL:
X			case BOT:
X			case EOT:
X			case BOW:
X			case EOW:
X			case REF:
X				badpat("Illegal closure");
X			default:
X				break;
X			}
X
X			if (*p == '+')
X				for (sp = mp; lp < sp; lp++)
X					store(*lp);
X
X			store(END);
X			store(END);
X			sp = mp;
X			while (--mp > lp)
X				*mp = mp[-1];
X			store(CLO);
X			mp = sp;
X			break;
X
X		case '\\':              /* tags, backrefs .. */
X			switch(*++p) {
X
X			case '(':
X				if (tagc < MAXTAG) {
X					tagstk[++tagi] = tagc;
X					store(BOT);
X					store(tagc++);
X				}
X				else
X					badpat("Too many \\(\\) pairs");
X				break;
X			case ')':
X				if (*sp == BOT)
X					badpat("Null pattern inside \\(\\)");
X				if (tagi > 0) {
X					store(EOT);
X					store(tagstk[tagi--]);
X				}
X				else
X					badpat("Unmatched \\)");
X				break;
X			case '<':
X				store(BOW);
X				break;
X			case '>':
X				if (*sp == BOW)
X					badpat("Null pattern inside \\<\\>");
X				store(EOW);
X				break;
X			case '1':
X			case '2':
X			case '3':
X			case '4':
X			case '5':
X			case '6':
X			case '7':
X			case '8':
X			case '9':
X				n = *p-'0';
X				if (tagi > 0 && tagstk[tagi] == n)
X					badpat("Cyclical reference");
X				if (tagc > n) {
X					store(REF);
X					store(n);
X				}
X				else
X					badpat("Undetermined reference");
X				break;
X#ifdef EXTEND
X			case 'b':
X				store(CHR);
X				store('\b');
X				break;
X			case 'n':
X				store(CHR);
X				store('\n');
X				break;
X			case 'f':
X				store(CHR);
X				store('\f');
X				break;
X			case 'r':
X				store(CHR);
X				store('\r');
X				break;
X			case 't':
X				store(CHR);
X				store('\t');
X				break;
X#endif
X			default:
X				store(CHR);
X				store(*p);
X			}
X			break;
X
X		default :               /* an ordinary char  */
X			store(CHR);
X			store(*p);
X			break;
X		}
X		sp = lp;
X	}
X	if (tagi > 0)
X		badpat("Unmatched \\(");
X	store(END);
X	sta = OKP;
X	return(0);
X}
X
X
Xstatic char *bol;
Xstatic char *bopat[MAXTAG];
Xstatic char *eopat[MAXTAG];
Xchar *pmatch();
X
X/*
X * re_exec:
X * 	execute dfa to find a match.
X *
X *	special cases: (dfa[0])	
X *		BOL
X *			Match only once, starting from the
X *			beginning.
X *		CHR
X *			First locate the character without
X *			calling pmatch, and if found, call
X *			pmatch for the remaining string.
X *		END
X *			re_comp failed, poor luser did not
X *			check for it. Fail fast.
X *
X *	If a match is found, bopat[0] and eopat[0] are set
X *	to the beginning and the end of the matched fragment,
X *	respectively.
X *
X */
X
Xint
Xre_exec(lp)
Xregister char *lp;
X{
X	register char c;
X	register char *ep = 0;
X	register CHAR *ap = dfa;
X
X	bol = lp;
X
X	bopat[0] = 0;
X	bopat[1] = 0;
X	bopat[2] = 0;
X	bopat[3] = 0;
X	bopat[4] = 0;
X	bopat[5] = 0;
X	bopat[6] = 0;
X	bopat[7] = 0;
X	bopat[8] = 0;
X	bopat[9] = 0;
X
X	switch(*ap) {
X
X	case BOL:			/* anchored: match from BOL only */
X		ep = pmatch(lp,ap);
X		break;
X	case CHR:			/* ordinary char: locate it fast */
X		c = *(ap+1);
X		while (*lp && *lp != c)
X			lp++;
X		if (!*lp)		/* if EOS, fail, else fall thru. */
X			return(0);
X	default:			/* regular matching all the way. */
X		while (*lp) {
X			if ((ep = pmatch(lp,ap)))
X				break;
X			lp++;
X		}
X		break;
X	case END:			/* munged automaton. fail always */
X		return(0);
X	}
X	if (!ep)
X		return(0);
X
X	bopat[0] = lp;
X	eopat[0] = ep;
X	return(1);
X}
X
X/* 
X * pmatch: 
X *	internal routine for the hard part
X *
X * 	This code is mostly snarfed from an early
X * 	grep written by David Conroy. The backref and
X * 	tag stuff, and various other mods are by oZ.
X *
X *	special cases: (dfa[n], dfa[n+1])
X *		CLO ANY
X *			We KNOW ".*" will match ANYTHING
X *			upto the end of line. Thus, go to
X *			the end of line straight, without
X *			calling pmatch recursively. As in
X *			the other closure cases, the remaining
X *			pattern must be matched by moving
X *			backwards on the string recursively,
X *			to find a match for xy (x is ".*" and 
X *			y is the remaining pattern) where
X *			the match satisfies the LONGEST match
X *			for x followed by a match for y.
X *		CLO CHR
X *			We can again scan the string forward
X *			for the single char without recursion, 
X *			and at the point of failure, we execute 
X *			the remaining dfa recursively, as
X *			described above.
X *
X *	At the end of a successful match, bopat[n] and eopat[n]
X *	are set to the beginning and end of subpatterns matched
X *	by tagged expressions (n = 1 to 9).	
X *
X */
X
Xextern void re_fail();
X
X/*
X * character classification table for word boundary
X * operators BOW and EOW. the reason for not using 
X * ctype macros is that we can let the user add into 
X * our own table. see re_modw. This table is not in
X * the bitset form, since we may wish to extend it
X * in the future for other character classifications. 
X *
X *	TRUE for 0-9 A-Z a-z _
X */
Xstatic char chrtyp[MAXCHR] = {
X	0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
X	0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
X	0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
X	0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
X	0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 
X	1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 
X	0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 
X	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
X	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
X	1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 
X	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
X	1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
X	1, 1, 1, 0, 0, 0, 0, 0
X	};
X
X#define inascii(x)	(0177&(x))
X#define iswordc(x) 	chrtyp[inascii(x)]
X#define isinset(x,y) 	((x)[((y)&BLKIND)>>3] & (1<<((y)&BITIND)))
X
X/*
X * skip values for CLO XXX to skip past the closure
X *
X */
X
X#define ANYSKIP	2 	/* CLO ANY END ...	   */
X#define CHRSKIP	3	/* CLO CHR chr END ...	   */
X#define CCLSKIP 18	/* CLO CCL 16bytes END ... */
X
Xstatic char *
Xpmatch(lp, ap)
Xregister char *lp;
Xregister CHAR *ap;
X{
X	register char *e;		/* extra pointer for CLO */
X	register char *bp;		/* beginning of subpat.. */
X	register char *ep;		/* ending of subpat..	 */
X	register int op, c, n;
X	char *are;			/* to save the line ptr. */
X
X	while ((op = *ap++) != END)
X		switch(op) {
X
X		case CHR:
X			if (*lp++ != *ap++)
X				return(0);
X			break;
X		case ANY:
X			if (!*lp++)
X				return(0);
X			break;
X		case CCL:
X			c = *lp++;
X			if (!isinset(ap,c))
X				return(0);
X			ap += BITBLK;
X			break;
X		case NCL:
X			c = *lp++;
X			if (isinset(ap,c))
X				return(0);
X			ap += BITBLK;
X			break;
X		case BOL:
X			if (lp != bol)
X				return(0);
X			break;
X		case EOL:
X			if (*lp)
X				return(0);
X			break;
X		case BOT:
X			bopat[*ap++] = lp;
X			break;
X		case EOT:
X			eopat[*ap++] = lp;
X			break;
X 		case BOW:
X			if (!(lp!=bol && iswordc(lp[-1])) && iswordc(*lp))
X				break;
X			return(0);
X		case EOW:
X			if ((lp!=bol && iswordc(lp[-1])) && !iswordc(*lp))
X				break;
X			return(0);
X		case REF:
X			n = *ap++;
X			bp = bopat[n];
X			ep = eopat[n];
X			while (bp < ep)
X				if (*bp++ != *lp++)
X					return(0);
X			break;
X		case CLO:
X			are = lp;
X			switch(*ap) {
X
X			case ANY:
X				while (*lp)
X					lp++;
X				n = ANYSKIP;
X				break;
X			case CHR:
X				c = *(ap+1);
X				while (*lp && c == *lp)
X					lp++;
X				n = CHRSKIP;
X				break;
X			case CCL:
X			case NCL:
X				while (*lp && (e = pmatch(lp, ap)))
X					lp = e;
X				n = CCLSKIP;
X				break;
X			default:
X				re_fail("closure: bad dfa.", *ap);
X				return(0);
X			}
X
X			ap += n;
X
X			while (lp >= are) {
X				if (e = pmatch(lp, ap))
X					return(e);
X				--lp;
X			}
X			return(0);
X		default:
X			re_fail("re_exec: bad dfa.", op);
X			return(0);
X		}
X	return(lp);
X}
X
X/*
X * re_modw:
X *	add new characters into the word table to
X *	change the re_exec's understanding of what
X *	a word should look like. Note that we only
X *	accept additions into the word definition.
X *
X *	If the string parameter is 0 or null string,
X *	the table is reset back to the default, which
X *	contains A-Z a-z 0-9 _. [We use the compact
X *	bitset representation for the default table]
X *
X */
X
Xstatic char deftab[16] = {	
X	0, 0, 0, 0, 0, 0, 377, 003, 376, 377, 377, 207,  
X	376, 377, 377, 007 
X}; 
X
Xvoid
Xre_modw(s)
Xregister char *s;
X{
X	register int i;
X
X	if (!s || !*s) {
X		for (i = 0; i < MAXCHR; i++)
X			if (!isinset(deftab,i))
X				iswordc(i) = 0;
X	}
X	else
X		while(*s)
X			iswordc(*s++) = 1;
X}
X
X/*
X * re_subs:
X *	substitute the matched portions of the src in
X *	dst.
X *
X *	&	substitute the entire matched pattern.
X *
X *	\digit	substitute a subpattern, with the given
X *		tag number. Tags are numbered from 1 to
X *		9. If the particular tagged subpattern
X *		does not exist, null is substituted.
X *
X */
Xint
Xre_subs(src, dst)
Xregister char *src;
Xregister char *dst;
X{
X	register char c;
X	register int  pin;
X	register char *bp;
X	register char *ep;
X
X	if (!*src || !bopat[0])
X		return(0);
X
X	while (c = *src++) {
X		switch(c) {
X
X		case '&':
X			pin = 0;
X			break;
X
X		case '\\':
X			c = *src++;
X			if (c >= '0' && c <= '9') {
X				pin = c - '0';
X				break;
X			}
X			
X		default:
X			*dst++ = c;
X			continue;
X		}
X
X		if ((bp = bopat[pin]) && (ep = eopat[pin])) {
X			while (*bp && bp < ep)
X				*dst++ = *bp++;
X			if (bp < ep)
X				return(0);
X		}
X	}
X	*dst = (char) 0;
X	return(1);
X}
X			
X#ifdef DEBUG
X/*
X * symbolic - produce a symbolic dump of the
X *            dfa
X */
Xsymbolic(s) 
Xchar *s;
X{
X	printf("pattern: %s\n", s);
X	printf("dfacode:\n");
X	dfadump(dfa);
X}
X
Xstatic	
Xdfadump(ap)
XCHAR *ap;
X{
X	register int n;
X
X	while (*ap != END)
X		switch(*ap++) {
X		case CLO:
X			printf("CLOSURE");
X			dfadump(ap);
X			switch(*ap) {
X			case CHR:
X				n = CHRSKIP;
X				break;
X			case ANY:
X				n = ANYSKIP;
X				break;
X			case CCL:
X			case NCL:
X				n = CCLSKIP;
X				break;
X			}
X			ap += n;
X			break;
X		case CHR:
X			printf("\tCHR %c\n",*ap++);
X			break;
X		case ANY:
X			printf("\tANY .\n");
X			break;
X		case BOL:
X			printf("\tBOL -\n");
X			break;
X		case EOL:
X			printf("\tEOL -\n");
X			break;
X		case BOT:
X			printf("BOT: %d\n",*ap++);
X			break;
X		case EOT:
X			printf("EOT: %d\n",*ap++);
X			break;
X		case BOW:
X			printf("BOW\n");
X			break;
X		case EOW:
X			printf("EOW\n");
X			break;
X		case REF:
X			printf("REF: %d\n",*ap++);
X			break;
X		case CCL:
X			printf("\tCCL [");
X			for (n = 0; n < MAXCHR; n++)
X				if (isinset(ap,(CHAR)n))
X					printf("%c",n);
X			printf("]\n");
X			ap += BITBLK;
X			break;
X		case NCL:
X			printf("\tNCL [");
X			for (n = 0; n < MAXCHR; n++)
X				if (isinset(ap,(CHAR)n))
X					printf("%c",n);
X			printf("]\n");
X			ap += BITBLK;
X			break;
X		default:
X			printf("bad dfa. opcode %o\n", ap[-1]);
X			exit(1);
X			break;
X		}
X}
X#endif
SHAR_EOF
if test 18779 -ne "`wc -c 'regex.c'`"
then
	echo shar: error transmitting "'regex.c'" '(should have been 18779 characters)'
fi
echo shar: extracting "'re_fail.c'" '(292 characters)'
if test -f 're_fail.c'
then
	echo shar: over-writing existing file "'re_fail.c'"
fi
sed 's/^X//' << \SHAR_EOF > 're_fail.c'
X
X#ifdef vms
X#include stdio
X#else
X#include <stdio.h>
X#endif
X
X/* 
X * re_fail:
X *	internal error handler for re_exec.
X *
X *	should probably do something like a
X *	longjump to recover gracefully.
X */ 
Xvoid	
Xre_fail(s, c)
Xchar *s;
Xchar c;
X{
X	fprintf(stderr, "%s [opcode %o]\n", s, c);
X	exit(1);
X}
SHAR_EOF
if test 292 -ne "`wc -c 're_fail.c'`"
then
	echo shar: error transmitting "'re_fail.c'" '(should have been 292 characters)'
fi
echo shar: extracting "'grep.c'" '(1014 characters)'
if test -f 'grep.c'
then
	echo shar: over-writing existing file "'grep.c'"
fi
sed 's/^X//' << \SHAR_EOF > 'grep.c'
X#ifdef vms
X#include stdio
X#else
X#include <stdio.h>
X#endif
X
X/*
X * Rudimentary grep to test regex routines.
X *
X * DEBUG is only applicable
X * to oZ version of regex. Make sure to 
X * compile regex.c with -DDEBUG as well.
X *
X */
Xextern char *re_comp();
Xextern re_exec();
X
Xmain(argc,argv)
Xchar *argv[];
X{
X	char str[512];
X	FILE *f;
X	register int n;
X	register char *p;
X
X	if (argc < 2)
X		error("usage: grep pat [file]");
X
X	if ((p = re_comp(argv[1])) != 0) {
X		printf("\t%s: %s\n", p, argv[1]);
X		exit(1);
X	}
X#ifdef DEBUG
X	symbolic(argv[1]);
X#endif
X	if (p = argv[2]) {
X		if ((f = fopen(p, "r")) == NULL) {
X			printf("cannot open %s\n", argv[2]);
X			exit(1);
X		}
X		while ((n = load(str, f)) != EOF)
X			if (re_exec(str))
X				printf("%s\n",str);
X
X	}
X	exit(0);
X}
Xload (s, f)
Xchar *s;
XFILE *f;
X{
X	register int c;
X	static int lineno = 0;
X
X	while ((c = getc(f)) != '\n' && c != EOF)
X		*s++ = c;
X	if (c == EOF)
X		return (EOF);
X	*s = (char) 0;
X	return (++lineno);
X}
X
Xerror(s) char *s ; 
X{ 
X	fprintf(stderr,"%s\n",s); 
X	exit(1); 
X}
SHAR_EOF
if test 1014 -ne "`wc -c 'grep.c'`"
then
	echo shar: error transmitting "'grep.c'" '(should have been 1014 characters)'
fi
echo shar: extracting "'makefile'" '(2472 characters)'
if test -f 'makefile'
then
	echo shar: over-writing existing file "'makefile'"
fi
sed 's/^X//' << \SHAR_EOF > 'makefile'
X#
X# regex routines (PUBLIC DOMAIN)
X#
X# by:	Ozan S. Yigit (oz)
X#	Dept. of Computer Science
X#	York University
X#
X# Applicable to BSD:
X#
X# If you have the generic(?) regex routines
X# than you can compare the timings of these
X# routines to the generic ones by:
X#
X#	make times
X#
X# which will create two rudimentary greps
X# lgrep and ogrep. lgrep will use the generic
X# regex routines, and ogrep will use oz version
X# of regex. Several patterns will be searched
X# in /usr/dict/words, and the output of the greps
X# will be compared. [for reasons of sanity]
X#
X# Surely, you will note, the time tests are somewhat
X# biased, since /usr/dict/words contain *short* lines,
X# thereby the real-life case of searching a complex
X# expression within a long line is not covered. You
X# will find, however, that the PD regex routines will
X# search *as fast* as the generic ones in most
X# cases, and about 10% slower in some cases, when
X# tested with files containing *long* lines. 
X# 
XCFLAGS = -O
X#
X# test patterns
X#
XPAT1 = '[d-f]zz*.*m'
XPAT2 = 'fo[ornt]*.*b[a-d]*'
XPAT3 = '.th.'
XPAT4 = '\(ab\)[a-d]*\1'
XPAT5 = 'burp'
X
XFILE = /usr/dict/words
XOUTD = /tmp/
X
XRSRC = regex.o re_fail.o
X
Xregex:  $(RSRC)
X#
X#	insert code to put these into a library
X#
Xrlint:
X	lint -phc regex.c
Xdebug:
X	cc -O -DDEBUG -o ogrep grep.c regex.c re_fail.c
X
Xlgrep:  grep.o
X	cc -o lgrep grep.o
X
Xogrep:  grep.o $(RSRC)
X	cc -o ogrep grep.o $(RSRC)
X
Xtimes:  lgrep ogrep
X	@echo generic regex vs oz regex
X#	@echo pattern: $(PAT1)
X	time ogrep $(PAT1) $(FILE) >$(OUTD)ogrep.out
X	time lgrep $(PAT1) $(FILE) >$(OUTD)lgrep.out
X	@echo output differences:
X	-diff $(OUTD)ogrep.out $(OUTD)lgrep.out
X	@echo "---"
X#	@echo pattern: $(PAT2)
X	time ogrep $(PAT2) $(FILE) >$(OUTD)ogrep.out
X	time lgrep $(PAT2) $(FILE) >$(OUTD)lgrep.out
X	@echo output differences:
X	-diff $(OUTD)ogrep.out $(OUTD)lgrep.out
X	@echo "---"
X#	echo pattern: $(PAT3)
X	time ogrep $(PAT3) $(FILE) >$(OUTD)ogrep.out
X	time lgrep $(PAT3) $(FILE) >$(OUTD)lgrep.out
X	@echo output differences:
X	-diff $(OUTD)ogrep.out $(OUTD)lgrep.out
X	@echo "---"
X#	echo pattern: $(PAT4)
X	time ogrep $(PAT4) $(FILE) >$(OUTD)ogrep.out
X	time lgrep $(PAT4) $(FILE) >$(OUTD)lgrep.out
X	@echo output differences:
X	-diff $(OUTD)ogrep.out $(OUTD)lgrep.out
X	@echo "---"
X#	echo pattern: $(PAT5)
X	time ogrep $(PAT5) $(FILE) >$(OUTD)ogrep.out
X	time lgrep $(PAT5) $(FILE) >$(OUTD)lgrep.out
X	@echo output differences:
X	-diff $(OUTD)ogrep.out $(OUTD)lgrep.out
X	@echo "---"
X	@rm $(OUTD)ogrep.out $(OUTD)lgrep.out
SHAR_EOF
if test 2472 -ne "`wc -c 'makefile'`"
then
	echo shar: error transmitting "'makefile'" '(should have been 2472 characters)'
fi
#	End of shell archive
exit 0



More information about the Mod.sources mailing list