VAX 785 won't fully boot

Bob Page page at ulowell.UUCP
Sun Nov 2 06:44:22 AEST 1986


[This message is somewhat long in the hope that I can provide
 enough information to get a resolution --Bob]

Our VAX 11/785 (Ultrix 1.2) crashed the other day, with the message:

machine check 0: cp read timeout or error confirmation fault
	[with 14 more lines of data not reproduced here]
panic: mchk

the message was printed twice (some of the values were different
the second time) then a dump was done (successfully) and the machine
rebooted itself.  Some files were corrupted & corrected by fsck, until
it couldn't do any more.  An operator fan fsck by hand and fixed
the bad/dup inodes, rebooted and things came back up fine & dandy.

A while later (could have been days, we weren't spitting dates at
the console every n minutes) another message:

uba0: uba error sr=10(IVMR) fmer=71 fubar=772150	[ honest, fubar! --BP ]
uqssp0 being reset					[ uqssp0 ? -- BP ]
uda50a0: hard error, 100040 0 0 a0100 c9424ec5 1060081 b0005 0 0 0 80065d08 0 0 0 0

which appeared five times, each time exactly the same, followed by two
more 'machine check 0: cp read timeout or error confirmation fault'
messages as described above, followed by another dump.  (Note that
the 1's could be 'l's, the LA100 console prints them identically.
Also note that 772150 is the address of the uda).

Upon reboot, the root file system was partially trashed.  LOTS of
bad/dup inodes, incorrect block counts, etc.  We were able to bring
up a crippled system and were doing a level 0 dump of the root
to make sure we didn't lose anything else when - it crashed again
(same error messages as above).  After lots of prodding it would
come back up and then crash again.  This went on about six times.
Once, just after the device information, it went down again with:
	panic: sbi0flt
and another dump.  Other times it might mchk after printing the
device information.

Now it won't finish booting at all.  I can specify ra(0,0)vmunix
or our backup vmunix and it will boot them OK , but the machine
hangs without a message someplace between autoconfiguring the
devices (it gets them all OK) and printing the message "Automatic
reboot in progress..."  I never see the message.  I doubt it's
the /etc/rc file because it hangs even when I boot ANY at the
`>>>' prompt.  The CPU run light stays on.

It's possible a special file in /dev was trashed - we saw some
directories (like /usr) that had magically turned into normal files,
and also saw /dev/rmt0 turn into a normal file (which sure didn't
help us with the level 0 dump!  We flooded the root partition!)

So, I have a 785 that won't boot, and don't know why.  DEC FS has
been in and run every diagnostic, replaced boards, etc, all to
no avail.  The hardware is clean, according to DEC.

One suggestion DEC made is that Ultrix can't handle our 24-line
DMZ's DMA, and that's why it was hanging.  I don't buy that, but
if I get the system back up I'll turn off the DMA and see what
that gets me.  I can't shut off the DMA until I get the system
back up anyway.

----------------------------------

So, I don't know what to do, short of rebuilding the kernel and
restoring the root file system from the Ultrix distribution tape.
I'd rather not do that.  I'm about to restore a mini-root on
the swap partition and look at the ra0 partition, hopefully to
discover some critical file that must be restored/fixed before I
can reboot.

Even if I am successful in rebooting the system, I don't know
how to interpret all the mchk data (I don't have Ultrix source
and unfortunately can't put BSD on it for various reasons), so
I can't be sure why it's crashing or how to prevent it.  Surely,
'cp read timeout or error confirmation fault' errors look like
hardware problems.

Any help would be greatly welcomed, acknowledged and appreciated.

..Bob

PS The devices are all DEC:

VAX 11/785, serial no. 2079, hardware level =16
mcr0 (MS780-E) at address 0x20002000, 14Mbytes, internal interleave
uba0 at address 0x20006000
uda0 at uba0 csr 172150 vec 744, ipl 15
ra0 at uda0 slave 0				[ra81 - system disk]
tmscp0 at uba0 csr 174500 vec 770, ipl 15
tms0 at tmscp0 slave 0				[tu81]
lp0 at uba0 csr 177514 vac 200, ipl 14		[lp27]
uba1 at address 0x20008000
uda1 at uba1 csr 172150 vec 774, ipl 15
ra1 at uda1 slave 1				[ra60]
ra2 at uda1 slave 2				[ra60]
de1 at uba1 csr 174510 vec 120, ipl 15		[deuna]

[end of article]
-- 
UUCP: wanginst!ulowell!page	Bob Page, U of Lowell CS Dept
VOX:  +1 617 452 5000 x2976	Lowell MA 01854 USA



More information about the Comp.unix.wizards mailing list