UNIX system to house 140 mbyte unformatted textual dbase?

Mon May 28 09:56:09 AEST 1984

We have an unformatted textual database currently comprising 140
mbytes of text, which will grow to about 500 mbytes within the
next two years. Inverted indices (50% overhead--on top of
140 mbytes of text) have been developed,
but for some applications (such as fixed phrases or combinations of
common words) it is necessary to perform a linear search 
on the entire corpus.

a) i am interested in benchmarks to see how fast different machines
can perform linear searches. in particular, i would like to know
how fast the command "egrep xxx /usr/dict/words" (where
/usr/dict/words ~= 200K) runs on a GOULD, PYRAMID, ZILOG or different
68K based systems. We have access to a VAX 11/750 and 780, PDP 11/44
and PIXEL 100. Benchmarks from any other systems would be greatly
appreciated. The PIXEL is quite fast in core, but the disks
are ruinously slow: an otherwise idle PIXEL 100 (with 40 mbyte disks)
can only spend 30% of its time on an egrep. the rest of the time
it is evidently twiddling electrons waiting for more disk blocks.
does anybody out there have a Sun with the Fujitsu eagle?
	This dbase has a limited clientele, and the machine would not
need to field more than 4 searches or so at a time, but we could
easily use a more powerful system and would as soon not dedicate a
system to this database.

b) does anyone out there know of any good way to deal with searching
this much data on a UNIX system? experiments in distributed processing
that could provide wide access cheaply? this is a read only dbase, so
we could avoid the UNIX file system and store the data in big blocks
on a raw file system. has anyone got some special hardware hanging off
of a UNIX system to perform this kind of task?

						Gregory Crane
						Harvard University