Data Corruption

Sun Jul 15 09:13:42 AEST 1990

smitty at essnj1.ESSNJAY.COM (Hibbard T. Smith JR) writes:

>I'm running ISC's 386/ix 2.0.2 on a UNISYS PW 825A (Intel 25 MHZ cache) with
>an Adaptec 1542A, a Micropolis 320 MB, and a Maxtor XLT-200.  The Maxtor is
>the primary drive (swap, root, and user plus a couple more file systems).  At
>random intervals, data gets written over what should be inodes at he low end
>of the Maxtor (drive 0).  Needless to say, this clobbers lots of files.  The 
>data being written in has belonged to /usr/lib/cron/log and /usr/adm/pacct.
>These files are both open for the entire time the system is active.  The 
>clobbers take place at night or in the wee hours of the morning.  At these
>times the system is mostly idle.  Even very low cron driven activity.

Sounds like your partition and/or  file system size is wrong.  If you
create either the partition or the file system with too many blocks
(bad values in /etc/partitions or bad arguments to mkfs), there is
NO error checking and the system will initially run but will write
"off the end of the disk" at some point.  You will have, at the 
minimum, symptoms as you describe and at worst, a totally corrupted drive.

ISC 2.0.2 had a fairly well known bug in its addharddisk routine that
would sometime calculate the arguments to mkpart and mkfs incorrectly.
I had the problem at this site.  I resorted to calculating the arguments
directly and making the file systems manually.  This is almost a 
spiritual experience :-)  If you are in doubt about some of these
numbers, it is better to undersize them and sacrifice a cylinder or
two.

You will also find that the ISC surface analysis utility (mkpart -[vV])
fails miserably with the adaptec controller.  It never finds a single 
bad block.  My Newbury/Maxtor 4380 (320mb formatted) had a number
of bad blocks.  The only way I found to handle these was to set up a 
job that repeatedly copied the contents of /dev/dsk/xxx to /dev/null
and redirected the contents of /dev/osm (operating system messages
must be installed in the kernal) to a printer.  Each bad block 
that was found was printed as a kernal error message to the printer.
Then mkpart -A can be used to mark that block as bad.  Before you
do this, get a copy of Neese's SCSI utilities and set the number of retries
on the drive to 1.  My drive has a lot of soft errors that were missed when
the default retry count of 5 was used.  Also while you're running
the  SCSI utilities, ensure that error correction is turned on. On both
my Newbury and on a Seagate ST291(?) it was turned off.  The drive is 
VASTLY more reliable now.

After you find your bad blocks, copy them to a file and store them on
a mountable floppy.  Also print out your disk's votc using the 
mkpart -vpa command.  I can assure you from ghastly experience that
they will be invaluable sometime in the future.  You can theoretically 
put the bad blocks in /etc/partitions but I've not had real good luck
with that.  One other word of caution..  If your /etc/partitions
does not equal the actual vtoc on the drive (such as after you add a drive
and then restore an earlier /etc/partitions from tape), mkpart will
try to make the drive match the /etc/partitions file.  It will do this
even on a read-only operation such as -vpa and it will do it without
asking.  The result is a trashed disk.  More words of experience.

Hope that helps.  Check out comp.sys.i386.  A lot of these topics
get discussed there.

73 
John

-- 
John De Armond, WD4OQC  | We can no more blame our loss of freedom on congress
Radiation Systems, Inc. | than we can prostitution on pimps.  Both simply
Atlanta, Ga             | provide broker services for their customers.
{emory,uunet}!rsiatl!jgd|  - Dr. W Williams |                **I am the NRA**