Watchdog reset

Thu Aug 31 12:31:47 AEST 1989

In SunSpots v8n105, Daniel Ehrlich (ehrlich at cs.psu.edu) writes:

> My question is this:  what *causes* a 'watchdog reset', other than pushing
> the 'reset' button on the back of the machine?  In neither case was the
> button pushed.  
> 

A "Watchdog reset" occurs when the watchdog timer (a hardware circuit) on
the CPU board detects that the processor is halted.  The processor is
restarted and vectored into the PROMs at the watchdog reset handler, which
prints the "Watchdog reset" message on the system console.

At this point, the hardware maps and assorted other state have been reset,
so there's no chance to go back to Unix to run the core dump subroutines
in the kernel.  Reboot and start over.

> What I have been told by the folks at Sun is that the 'Watchdog reset'
> occurs when there is a double bit parity error on the VME bus.
> ...
> Sun's standard response is to replace the CPU board.  Although this may
> not be the real cause of the problem.

It can be a bad CPU board.  Or it could be provoked by bad hardware - CPU,
memory, or perhaps a peripheral.  But it can also be a software problem.

Probably the most common cause of processor halts is double bus faults.
That is, the processor gets a fault trap while processing a fault trap.
The most common cause of this is overflowing the stack - especially the
interrupt stack.

SunOS puts a guard page (invalid page) below  the interrupt stack in
kernel virtual address space. If the processor is in the middle of pushing
an exception frame on the stack - perhaps an level 6 interrupt
interrupting a level 5 interrupt interrupting a level 4 ... (you get the
idea) - and the stack overflows into the guard page, that's a double bus
fault.  'Taint nothing the thing can do but give up.

Software can contribute to the problem by
	- not allocating a worst-case-size interrupt stack
	- recursing in an interrupt handler
	- using too big a local stack frame in an interrupt
		handler
	- reenabling interrupts before exiting from the
		interrupt handler
and similar sorts of screwup.

By the way, there's no parity on the VME bus, so the idea of a
"double bit parity error on the VMEbus" is nonsense.  Someone might
have been referring to "double bit ECC error on the memory bus"; that
results in another type of hardware interrupt.  Normally, Unix will catch
it and panic.  If the processor overflows the stack while trying to take
the ECC trap, then it's double bus fault time, as described above.

Misbehaving hardware can cause other types of interrupts as well,
for example timeout on access to the VMEbus.  These all turn into
processor traps that try to push exception frames on the stack.

The comment in your followup mail:

> The "Watchdog reset" errors seem to occur when both 7053 disk controllers
> as busy.  One can usually generate a "Watchdog reset" in sigle user mode
> by running fsck(8) in parallel on disks attached to the two controllers.

unfortunately doesn't help resolve whether it's bad hardware or a
software bug.

The only way to deterministically figure out which is to blame is to
hook up a bus analyzer.  Non-deterministically, the usual procedure
is to swap hardware until it seems probable that the problem is generic
rather than a sample defect.

Beau James				beau at Ultra.COM
Ultra Network Technologies		{sun,ames}!ultra.com!beau