Summary: Hard disk errors on a 3b1

Sat Feb 11 07:41:17 AEST 1989

Hello netland:

I got 2 responses to my posting about intermittent HD errors with crash on my
3b1 with a Seagate 4096.  This is the second such drive, the first one, still
under warranty, overflowed the bad block table.  Days can pass without any
problem and suddenly I'll get half a dozen in a row.

The actual error (with crash) is always as follows:

	Drive 0, cmd 0

#HDERR ST:51 ... (repeated usually 3 times)

panic: Hard disk timeout
Please record panic message.
Press hardware reset to reboot.

where `#HDERR:ST51 ...' has been at different times:

#HDERR ST:51 EF:10 CL:4257 CH:4203 SN:4208 SC:4204 SDH:4224 DMACNT:FFFF DCRREG:94 MCRREG:8900
#HDERR ST:51 EF:10 CL:4280 CH:4202 SN:420E SC:4202 SDH:4222 DMACNT:FFFF DCRREG:92 MCRREG:8B00
#HDERR ST:51 EF:10 CL:4257 CH:4203 SN:420A SC:4202 SDH:4226 DMACNT:FFFF DCRREG:96 MCRREG:8100
#HDERR ST:51 EF:10 CL:4257 CH:4203 SN:4200 SC:4204 SDH:4225 DMACNT:FFFF DCRREG:B5 MCRREG:8B00
#HDERR ST:51 EF:10 CL:4283 CH:4203 SN:4204 SC:4202 SDH:4224 DMACNT:FFFF DCRREG:94 MCRREG:8100

>From my original posting:

> Is it possible that the disk really does not have a bad spot but that a
> combination of factors triggers a software bug in the kernel or driver?

Christopher J. Calabrese from AT&T Bell Laboratories, Murray Hill, NJ said:

> I've never run accross such problems with the disk drivers;
> however, it could be a bad disk controler chip, or a bad ribon cable.
> I've seen that around here before.

Many, many thanks go also to Brant Cheikes, who sent a long and detailed
account of exactly the same problem I'm having.  On his advice, I called
Ben Wollberg (415)678-1353 (8a-5p PST), who fixed his machine.  Ben told
me that the first thing he would do would be to backup the whole disk and
reformat it.  He said that this might make the problem disappear.  If it
didn't, I would probably have to have the disk repaired.  Apparently the
test they run to find and map the bad sectors takes 7 hours.  I wonder if
I can do something similar with the test disk.  Can anybody out there tell
me what I need to tell the test program to do an exhaustive format-write-read
check that would detect all intermittent errors? In any case, a summary of
Brant's response follows:

> The problems all showed up as HDERR's logged to /usr/adm/unix.log.
> The errors would come in groups of three or four, and would always be
> accompanied by a mechanical whine from the disk.  I believe that noise
> indicates that the drive is "recalibrating," retracting the heads and
> resetting itself in some way.  The errors were highly intermittent; I
> could go several days without an error, then suddenly get several in
> one day.  Weather did not seem to be a factor, nor did temperature.  I
> checked the power output from my power supply, and found no variation
> even while the drive was running the random seek diagnostic test.
> 
> Occasionally, the errors would cause recoverable disk errors.  Things
> like missing blocks in the free list, things that fsck could fix.  No
> data was ever lost, to my knowledge, but it really sucked having to
> fsck the disk every few days.
> 
> Then, the machine started crashing.
> The accompanying whine in these cases lasted several seconds, and the
> system was hung while it was going on.  Then boom, the panic and a
> reset was necessary.

Well, I hope this helps someone out there...

Augustine F. Cano	<canoaf at dept.csci.unt.edu>