strange problems (looking for help)

Dave Martindale dave at onfcanim.UUCP
Sat Apr 26 03:03:58 AEST 1986


In article <279 at entropy.UUCP> hubert at entropy.UUCP (Steve Hubert) writes:
>I wonder if anyone recognizes the following symptoms as symptoms of
>something concrete I can try to fix.  We are running 4.3BSD on a
>VAX11/785.  The disks are 3 RA81s on a single UDA.  The uda device
>driver is version 6.12 from Berkeley (9/16/85) which seems to be equal
>to or derived from a DEC driver from January 84.  I am not getting any
>kernel error messages at all.  Here is symptom number 1:
>
> [examples of cmp'ing a file with itself and getting non-repeatable errors,
> and C compiles which sometimes worked, sometimes not]

I had the same problem when installing our 780, and asked the disk
controller vendor to swap controller boards (Emulex SC780, driving
Eagles).  The problem remained.  The rest of the system passed DEC
diagnostics, so I didn't know where to look next.

Then we started occasionally getting soft ECC errors.  I like to keep
the memory system error-free, so I figured out which memory array board
the error was on and swapped it with another board, just to be sure.
The error remained in the same place!  So I swapped memory
controllers, and the problem did move.  (On the MS780-E memory system,
there are two controllers, on either side of the central bus interface
board).  So I pulled the bad controller entirely, the memory reverted
to non-interleaved operation on the remaining half memory, and the
mysterious data problems went away.  DEC has since replaced the bad
controller.

Moral of the story: a bad memory controller can mess up your data while
still passing DEC diagnostics and without giving any sort of error.
The memory ECC will catch bad RAM chips, and not much else.
There are also a number of places in the CPU unprotected by parity
checking where an intermittent hardware fault will damage data.



More information about the Comp.unix.wizards mailing list