Holey files, Batman!

Donn Seeley donn at utah-cs.UUCP
Fri Mar 18 19:33:58 AEST 1988


Mark Rosenthal (mbr at aoa.UUCP) wants to know how to copy files with
holes in them.  He makes one suggestion that he doesn't think is
'entirely satisfactory':

	Read through the bytes sequentially, but don't output any whose
	value is 0.  Use lseek() before writing the next non-zero
	byte.  To guarantee that the created file has the same extent,
	always write the last byte.  Disadvantage: When copying a file
	with huge holes and little data, this would be very slow.

There are faster ways to do this if you don't need to be general about
making holes, for example if you only deal with sequential copying of
files.  We produced a modified 'cp' here at Utah that detects aligned
blocks of zeroes in input and performs seeks instead of writes on
output.  The detection of blocks of zeroes can be very cheap -- we find
that it's faster to copy a file with zeroes in it using our modified
'cp' than to copy it using the standard 'cp'.  Of course it helps that
most files have very predictable patterns of 'zero' use: null bytes
typically appear either in large blocks (produced by unexec(), for
example) or among other data that has roughly random bits (such as VAX
instructions).  The effect of this is that testing a block of zeroes is
inexpensive since the cost is made up by the subsequent seek, while
testing a block of random data that contains nulls is inexpensive
because the test almost always fails very quickly.  It's not hard to
imagine a pathological file that defeats this algorithm (8191 nulls
followed by one non-null byte, repeated many times, will take care of
our 4.3 VAXen with 8k filesystems) but we don't see such files in
practice.

As for Mark Rosenthal's list of utilities, I believe that only
restor/restore will replace holes in files (because dump actually puts
block number information on a tape).

Donn Seeley    University of Utah CS Dept    donn at cs.utah.edu
40 46' 6"N 111 50' 34"W    (801) 581-5668    utah-cs!donn

PS -- Here's the code...  To build a cp that creates holes, compile
cp.c with -Dwrite=zwrite and -Dclose=zclose and load with an object
made from the following file.  This code is specifically written for
4.2 or 4.3 BSD on the VAX, but the algorithm should be easy to port.

------------------------------------------------------------------------
#ifndef lint
static char RCSid[] = "$Header: zwrite.c,v 1.4 85/07/01 03:53:18 lepreau Exp $";
#endif
/*
 * zwrite(): like write(2), but creates holes in files if possible.
 * If a write() fails zwrite returns -1 instead of the number of
 * bytes written-- what do you think about this?
 *
 * The current code is only optimal if zwrite's are done in lengths
 * evenly dividing the filesystem blksize-- must be fixed.
 * 
 * zclose() should be called instead of close() if fd's are to be re-used
 * or if you might have zeroes at the end of the file.  And you should
 * check that zclose returns 0, also!
 *
 * Current `vax' code assumes st_blksize (filesys blksize) is <= 64K;
 * non-vax code not yet tested.
 *
 * Donn Seeley and Jay Lepreau, Univ of Utah, April 1985.
 */

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/file.h>

#define min(a,b)	((a) < (b) ? (a) : (b))

struct dtab {
	int blksize;
	off_t offset;
	char flags;
};
#define DF_LASTSEEK	0x2

static struct dtab (*bp)[];
static int dtabsize;

int
zwrite(fd, buf, len)
	int fd;
	register char *buf;		/* known to be r11 below */
	int len;
{
	register int count;		/* known to be r10 below */
	register int notzero = 0;	/* known to be r9 below */
	int savelen = len;
	int obsize;
	struct stat statb;

	if (dtabsize == 0) {
		register int i;
		extern char *malloc();

		dtabsize = getdtablesize();
		bp = (struct dtab (*)[]) malloc((unsigned) (dtabsize * sizeof(struct dtab)));
		if (bp == 0)
			return write(fd, buf, len);
		for (i = 0; i < dtabsize; i++)
			(*bp)[i].blksize = -1;
	}
	/* get output blocksize if we don't have it already */
	if ((*bp)[fd].blksize < 0) {
		if (fstat(fd, &statb) < 0 ||
		   (statb.st_mode & S_IFMT) == S_IFSOCK ||
		   (statb.st_mode & S_IFMT) == 0)	/* avoid kernel bug */
			(*bp)[fd].blksize = 0;	/* remember we are screwed */
		else
			(*bp)[fd].blksize = statb.st_blksize;
	}
	if ((obsize = (*bp)[fd].blksize) == 0)
		return write(fd, buf, len);

	while (len > 0) {
		count = min(len, obsize);

		/* Are there any non-zero chars in this block? */
#ifdef vax
		;asm("	skpc	$0,r10,(r11)")
		;asm("	beql	1f")		/* all zeroes */
		;asm("	incl	r9")		/* notzero++ */
		;asm("1:")
#else
		{
		register char *p;
		char *endbuf;
		
		p = buf;
		endbuf = buf + count;
		while (p < endbuf)
			if (*p++) {
				notzero++;
				break;
			}
		}
#endif
		if (notzero) {
			(*bp)[fd].flags &= ~DF_LASTSEEK;
			if (write(fd, buf, count) != count)
				return -1;
		}
		else {
			(*bp)[fd].flags |= DF_LASTSEEK;
			if (lseek(fd, (off_t) count, L_INCR) < 0)
				return -1;
		}
		len -= count;
		buf += count;
	}
	return savelen;
}

zclose(fd)
{
	int rc = 0;

	if (bp) {
		/* Can't just seek at end and expect to get anything! */
		if ((*bp)[fd].flags & DF_LASTSEEK) {
			if (lseek(fd, (off_t) -1, L_INCR) < 0) 
				rc = -1;
			else if (write(fd, "", 1) != 1)
				rc = -1;
		}
		(*bp)[fd].blksize = -1;
		(*bp)[fd].flags = 0;
	}
	if (close(fd) < 0)
		return -1;
	else
		return rc;
}
------------------------------------------------------------------------

Apologies for the asm()s -- they were fun at the time.  Didn't you ever
wonder what some of those CISC instructions were for?

Some typical timings on our 11/785:

% ls -ls /usr/site/src/psl/bin/bare-psl
 412 -rwxrwxr-x  1 psl      psl       1651920 Oct 19 08:17 /usr/site/src/psl/bin/bare-psl*
% time cp /usr/site/src/psl/bin/bare-psl /tmp
0.0u 4.0s 0:07 55% (11t+25ds+37avg+19max)k 106i+210o (0maj+7min)pf 0swaps
% rm /tmp/bare-psl
% time zcp /usr/site/src/psl/bin/bare-psl /tmp
0.2u 2.4s 0:04 65% (15t+27ds+43avg+23max)k 104i+60o (0maj+9min)pf 0swaps



More information about the Comp.unix.wizards mailing list