Pattern matching with awk

Michael Nolan nolan at tssi.UUCP
Tue Mar 5 03:56:54 AEST 1991


lin at CS.WMICH.EDU (Lite Lin) writes:


>  This is a simple question, but I don't see it in "Freqently Asked
>Questions", so...
>  I'm trying to identify all the email addresses in email messages, i.e.,
>patterns with the format user at node.  Now I can use grep/sed/awk to find
>those lines containing user at node, but I can't figure out from the manual
>how or whether I can have access to the matching pattern (it can be
>anywhere in the line, and it doesn't have to be surrounded by spaces,
>i.e., it's not necessarily a separate "field" in awk).

If you have nawk or gawk, use the match function, which sets two variables:  

RSTART - the first position in the string matched by the pattern.
RLENGTH - the length of the string matching the pattern

A pattern to match any single mail address might be rather ugly, though.
If you assume all the following:

1.  Upper case and lower case letters are permitted
2.  Dash, underscore, and period are permitted
3.  There is only one @ [I'm not sure this assumption is valid, though!]
4.  There may be several ! or % in the 'user' portion
5.  No commas or spaces 

Then that gives a pattern something like this

[a-zA-Z0-9.\-_%!]+@[a-zA-Z0-9.\-_]+

I've escaped the dash, I suppose it might be necessary to escape other
characters as well.  Have I left anything out that might occur in strange
but otherwise valid mail addresses?
------------------------------------------------------------------------------
Michael Nolan                              "Software means never having
Tailored Software Services, Inc.            to say you're finished."       
Lincoln, Nebraska (402) 423-1490            --J. D. Hildebrand in UNIX REVIEW
UUCP:      tssi!nolan (or try sparky!dsndata!tssi!nolan)
Internet:  nolan at helios.unl.edu (if you can't get the other address to work) 



More information about the Comp.unix.questions mailing list