Text Processing Question

Mon Mar 18 15:19:09 AEST 1991

>From the keyboard of goer at ellis.uchicago.edu (Richard L. Goerwitz):
:In article <31134 at usc> rkumar at buddha.usc.edu (C.P. Ravikumar) writes:
:
:>I was wondering if there is a utility to check
:>for repitition of words in a document....
:>
:>I have the feeling this can be done using "awk".
:
:The hard part, as always, is settling on a field separator -

Perhaps.  I always thought the hard part was catching pairs of words that
extend over line boundaries.  Here's a perl version that catches these,
although I admit it's probably overkill to suck up the whole file into
memory before munging it.  Works fine on my machine. :-)

Here's the output when run on my C compiler man page:

/usr/man/man1/cc.1:
   39 compiler. Certain extensions, notably the [* long long *] type,
   57 Forces language and library interpretation based on [* the the *] original
  770 Each library has a profiled version whose name is formed [* by
  771 by *] inserting \(lq_p\(rq before the \(lq.a\(rq.

The precise definition of what constitutes a repeated words (and what
legit separators are) will vary according to tastes.  I chose identifier-
like tokens separated by white space.  Speed (and definitely memory)
optimizations are certainly possible, but this does the job well enough
for me.  The program (not line noise :-) follows:

--tom

#!/usr/bin/perl
undef $/; $* = 1; # process whole file
while ( $ARGV = shift ) { 
    if (!open ARGV) { warn "$ARGV: $!\n"; next; } 
    $_ = <>;
    s/\b(\s?)(([a-z]\w*)(\s+\3)+\b)/$1\200$2\200/g || next;
    split(/\n/);
    $n = 0; @hits = ();
    for (@_) { $n++; push(@hits, sprintf("%5d %s", $n, $_)) if /\200/; } 
    $_ = join("\n", at hits);
    s/\200([^\200]+)\200/[* $1 *]/g;
    print "$ARGV:\n$_\n";
}