Revectoring bad blocks on RA81 disks

Steve Grandi grandi at noao.UUCP
Sat Jun 29 07:51:33 AEST 1985


Rumor hath it that a program is available through DEC field service to
revector bad blocks on UDA disk drives (RA81s in particular).  Details
of the rumor are that the program is a "standalone" program written by
the Ultrix folks called /rabads that can be booted instead of vmunix
and that non-Ultrix sites running 4.2BSD can obtain the program
through their field service reps. 

Has any non-Ultrix site obtained this program?  Is there a part number
or any identifying information that our Friendly Field Service Man can use 
to pry it out of DEC's bureaucracy?

We are already running the Riacs UDA driver on our 750's that will try
to revector blocks that generate hard errors; our problem are blocks
that generate lots and lots of soft errors.  Since soft errors tend to
turn into hard errors and since these are rather important blocks (see
below) and revectoring blocks with hard errors often generates data
which is not guaranteed to be correct (a "forced error" in MSCP-speak),
I would dearly love to revector these marginal blocks now and avoid the 
massive pain that a trashed file system can bring. (Once burned, twice
shy; 5 times in the last year burned, 10**6 times shy!!).  Also, the
system REALLY slows down when the disk driver is printing error
messages on the console.

Obviously, we could probably hack the Riacs driver to give us a
utility to revector disk blocks, but another rumor hath it that the
procedure used in the driver is not REALLY correct (since DEC is
incredibly reluctant to reveal details of the very complicated song
and dance that has to be gone through to accomplish this feat, I'm
not surprised).  Also it would be nice to have a tool that our
Friendly Field Service Rep believed in as opposed to the incredulous
looks I get when I explain the history of our disk driver.

Two details of our problems might be of interest to students of MSCP soft 
error datagrams or of the 4.2BSD file system.  The "drive detected error" we
are getting is code 1A39 (that's the contents of word 27 of the SDI
error variant of the MSCP packet) which indicates a "servo fine
position error" generated when "a write command is attempted while the
positioner is off track (not detented)".  The servo boards and the R/W
boards in the drives showing these errors have all been replaced, so
the HDA is obviously showing marginal behavior at these locations.

The disk blocks showing errors are also interesting.  For several file 
systems on several disks on several 750's, relative block numbers 576 
and 577 are repeatedly showing up with fine-positioning errors (and 
these cases constitute about 75% of our total collection of these errors).  
A morning's study of the output from dumpfs(8) and fs.h indicates that for 
our 8K/2K and 8K/1K file systems, blocks 576-7 contain the csum structure,
which contains a summary of information about all the cylinder groups
(number of directories, number of free blocks, number of free inodes,
number of free frags).  Obviously, since our disks are
figuratively digging holes in the oxide at these blocks, this
structure is used a lot, presumably everytime a file is created (and
extended?).  Is this structure a single point of failure?  If block
576 is destroyed, is the file-system totally trashed or just incapable
of creating new files?  (in other words, can I dump(8) the
file-system?).  Can fsck completely regenerate the data in the csum
structure? (I know fsck can correct things; one often sees "SUMMARY
INFORMATION ... BAD" messages on a post-crash reboot).

All in all, I think I might have been better off with Eagles....

Steve Grandi, National Optical Astronomy Observatories, Tucson, AZ, 602-325-9228
{arizona,decvax,hao,ihnp4,seismo}!noao!grandi  noao!grandi at lbl-csam.ARPA
-- 
Steve Grandi, National Optical Astronomy Observatories, Tucson, AZ, 602-325-9228
{arizona,decvax,hao,ihnp4,seismo}!noao!grandi  noao!grandi at lbl-csam.ARPA



More information about the Comp.unix.wizards mailing list