Reliability of System V 1K file system

Tue Sep 25 04:00:06 AEST 1990

In article <1990Sep23.184158.841 at hq.demos.su>
avg at hq.demos.su (Vadim G. Antonov) writes:
>Practically all machines provide power fail interrupts -

I know of a number that do not; but in any case:

>and I don't know why Unix device drivers have no "xxpwfail" entries.

power failure interrupts are not any good unless they are guaranteed
to occur sufficiently early, and usually they are not.  The power supply
system on the main CPU (the thing that has a `power fail' interrupt)
is quite often completely independent of the power supply for the disk
drives.  If the electronics on the drive are in an indeterminate state,
nothing done at the CPU will guarantee anything.

>Anyway I'm quite sure *any* device can correctly handle power fails -
>if you handle device properly :-).

Power fail handling is like lightning protection: you can only do so much;
if Nature is out to get you, you are doomed.  (Lightning strikes have been
known to dance teasingly around all the grounding posts, giggle in circles
round and round as your hair stands on end, then viciously zap straight
into the heart of your computer.  Well, maybe not quite. :-) )

Designing systems that act properly on power failure is, however, tricky:

>It seems to me the best way to protect disks from accidental
>damaging by power fails is to start recalibrating or moving
>heads to landing zone - usually quite simple logic circuitry
>protects from writing while heads move.

Let me tell you about ... Century Data Systems T-300s.

(To be fair, I am not sure where the problem was located.  The T300s were
merely the end of the chain.)

We have a couple of Xerox file servers with big washtub drives.  These
drives have a power fail system that retracts the heads (quite reasonably)
so that they will not land on the disk when it stops spinning.  Apparently
it turns off the write current at the same time, because a simple power
failure does not damage anything.

Unfortunately, there are not-so-simple power failures.  Thunderstorms
(those thing that Californians never see :-) ) often cause momentary
power failures---anywhere from a fraction of a second to several
seconds.  (Power distribution systems have thing called `lightning
arrestors' that temporarily open the circuit to prevent serious
overvoltages.  There are two major variants, air and oil.  Lightning
will jump a simple air gap so the air versions blow compressed gas
across the gap.  I know nothing about the oil versions, other than that
they explode very prettily, like oil-filled transformers. :-) )

Anyway, as it happens, under certain conditions the T300s would detect
a power failure and begin retracting the heads.  Then the power would
come back on, the electronics would think, `oh, everything is OK', and
the write current would turn on---while the heads were still spiraling
down the pack.

The result was invariably a hopelessly damaged pack.  Hundreds of `bad'
sectors appeared in a spiral pattern, and the only means of recovery
was to reformat (followed by a tedious restore from the backup server,
all the while hoping desperately that another storm would not come up
during the multi-hour restore).  The problem has finally been fixed:
the file servers are now on a ten-minute UPS, and only a long-term power
failure---the kind the drives were engineered to handle---will get through.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 405 2750)
Domain:	chris at cs.umd.edu	Path:	uunet!mimsy!chris