Need efficient way to read file in reverse

Jim Hutchison hutch at fps.com
Fri Sep 21 07:10:34 AEST 1990


In <1990Sep18.152818.1303 at phri.nyu.edu> roy at alanine.phri.nyu.edu (Roy Smith):
>	Is there a standard, portable, efficient way to read a file in
>reverse?

No, yes, and pretty much.  There isn't much call for this, atleast visibly,
so there isn't enough mass for a standard.  You can use constants defined
for your system, this makes it relatively portable.  You can get efficiency
along with portability, really.

>          I'm doing random seeks in a file and occassionally want to be able
>to find the beginning of the line into the middle of which I just seeked.
>Right now, I'm using a simple-minded revgetc() I wrote which is basically:
>
>	fseek (fp, -2L, FSEEK_CURRENT);
>	getc (fp);

This will spend a lot of time calling fseek and read.  By "a lot" I mean you
do (line-length / 4) reads & seeks per line you want to read.  Remember that
stdio trys to read-ahead, so each time you back-up it has to start over.  It
could be smarter, but depending on that would be very non-portable.

>but which turns out to loose big, since it looks like it does a read(2) for
>every call to getc().  Life is made worse by the fact that occasionally I
>have very long (over 1kbyte) lines.  My application is currently spending
>about 98% of its CPU time in revgetc(), according to gprof.
>
>	Obviously, the thing to do is write a buffered revgetc().  Modifying
>the standard getc() macro to traverse the buffer in the opposite direction
>would be easy, but the result would be non-portable.  To make it portable, I
>want to write it using the stdio functions, and that's the hitch.  The
>warning in the fseek/ftell man page about stdio read pointers possibly being
>magic cookies makes it hard for me to calculate how far back to seek to read
>in the preceeding, say, 1024 bytes.  What to do?

The ideal block size for reading your file can be gotten by doing a stat() or
fstat() on it, and using stat->st_blksize (see stat(2)).  If you setbuffer()
with a buffer that size, you should have a quasi-optimal block size.  If you
backup to the previous block boundary.  You should be able to getc to the
current offset, remembering the last time you found the start of the line,
at worst you may have to repeat this step many times.  This minimizes the
disk I/O and system calls, which are expensive.

start_of_string = 0			-- If nowhere, it must start at 0
trim = offset % blksize;		-- how far from block boundary
do {
    end = offset;			-- current end of search
    if (trim)
	offset -= trim;			-- trim back to boundary
    else
	offset -= blksize;		-- next boundary is previous block

    fseek(fp, offset, FSEEK_ABSOLUTE);	-- backup
    while (offset != end) {		-- while position is not to the end
	c = get(c);			-- get char from our buffer
	if (start_of_string(c))		-- decide if you have found a start
	    start_of_string = offset;	-- remember current offset
	offset++;			-- move on to next position
    }
} while (start_of_string == 0 && offset > 0);	-- find it until BOF

You'd get better performance if you read() the file yourself, and used your
own buffer instead of using getc().  That way you could proceed in your
direction of choice, backwards.  I used getc() here because you seem to
want to use stdio.

You can also back down from "blksize" if it gets to be to gross in terms of
your average line length (I'd guess greater than 4x is probably too hefty).
-
--
-
Jim Hutchison		{dcdwest,ucbvax}!ucsd!fps!hutch
Disclaimer:  I am not an official spokesman for FPS computing



More information about the Comp.unix.wizards mailing list