Unix text files
Doug Gwyn <gwyn>
gwyn at brl-tgr.ARPA
Tue Nov 5 15:29:30 AEST 1985
> Does anyone out there want to show those of us with weak knees how one
> would use this kind of data structure [used loosely] in a program?
> (In other words, as if the data were within the program not without.)
> Without additional support information, like keeping track of the number
> and lengths of lines.
Most data processing algorithms are (or should be) driven by the
structure of the data that they process; this is normally taught
these days in the "data structures" CS course. It should be obvious
from the grammar how to structure code that e.g. gets a line
of text, processes it, and writes out the resulting line. (There
is no need to bring in line numbering or "length of line".) If
there is no (or only a fuzzy) definition of "line of text",
then it is not obvious how to get/put one, and some random
choice is made by the programmer. (Which is what started this
discussion.)
For simplicity, I left out of the grammar one important constraint,
which is a limit of no more than 510 characters in a line of text
(exclusive of newline). I had already stretched the notation a bit
and didn't want to invent yet another notation like { char }*510 .
This limit is actually important in allowing efficient get-line
implementations.
> I think it would be a good example to the young of inheirent complexity.
There is nothing complex about that grammar. It is a remarkably
simple one, which was the point. Note that it was decomposed
into meaningful subunits -- this is important! Just having a
formal grammar (syntax) is not sufficient for good semantic
processing. (People often forget this.)
> And I thought we were trying to make life simple! The main problem here
> is that we are trying to impose structure on unstructured data, which
> is probably not the best approach.
Text files certainly are structured, although it's a rather
flexible structure. One might argue that dividing text into
lines is artificial, but the concept of a "line of text" is
useful in many text-processing programs (e.g., "grep").
> Sentinels are a wonderful way of implementing lists, but a terrible way
> of implementing strings. Hint, hint.
Oh, foo. Both the count+data and NUL-terminated representations
for character strings have good and bad points. I've used both
and prefer C's approach for most routine programming.
If the point of the correspondent was that FILES-11 variable-
length record format is easier to work with, he deserves a
large horse laugh. See "Software Tools" for examples of the
use of UNIX-like text file formats in programs.
More information about the Comp.unix
mailing list