Unix text files

Doug Gwyn <gwyn> gwyn at brl-tgr.ARPA
Tue Nov 5 15:29:30 AEST 1985


> Does anyone out there want to show those of us with weak knees how one
> would use this kind of data structure [used loosely] in a program?
> (In other words, as if the data were within the program not without.)
> Without additional support information, like keeping track of the number
> and lengths of lines.

Most data processing algorithms are (or should be) driven by the
structure of the data that they process; this is normally taught
these days in the "data structures" CS course.  It should be obvious
from the grammar how to structure code that e.g. gets a line
of text, processes it, and writes out the resulting line.  (There
is no need to bring in line numbering or "length of line".)  If
there is no (or only a fuzzy) definition of "line of text",
then it is not obvious how to get/put one, and some random
choice is made by the programmer.  (Which is what started this
discussion.)

For simplicity, I left out of the grammar one important constraint,
which is a limit of no more than 510 characters in a line of text
(exclusive of newline).  I had already stretched the notation a bit
and didn't want to invent yet another notation like { char }*510 .
This limit is actually important in allowing efficient get-line
implementations.

> I think it would be a good example to the young of inheirent complexity.

There is nothing complex about that grammar.  It is a remarkably
simple one, which was the point.  Note that it was decomposed
into meaningful subunits -- this is important!  Just having a
formal grammar (syntax) is not sufficient for good semantic
processing.  (People often forget this.)

> And I thought we were trying to make life simple!  The main problem here
> is that we are trying to impose structure on unstructured data, which
> is probably not the best approach.

Text files certainly are structured, although it's a rather
flexible structure.  One might argue that dividing text into
lines is artificial, but the concept of a "line of text" is
useful in many text-processing programs (e.g., "grep").

> Sentinels are a wonderful way of implementing lists, but a terrible way
> of implementing strings.  Hint, hint.

Oh, foo.  Both the count+data and NUL-terminated representations
for character strings have good and bad points.  I've used both
and prefer C's approach for most routine programming.

If the point of the correspondent was that FILES-11 variable-
length record format is easier to work with, he deserves a
large horse laugh.  See "Software Tools" for examples of the
use of UNIX-like text file formats in programs.



More information about the Comp.unix mailing list