5000/200 HANGS intermittently. No error messages. ugh. help.
Van Rauch
van at triton.unm.edu
Fri Oct 19 07:37:49 AEST 1990
In article <1990Oct16.211229.18767 at ariel.unm.edu> van at triton.unm.edu (Van Rauch) writes:
>
>very strange problem with our 5000. About every 3 to 7 days,
>the system will HANG. No messages on the console, nothing in
>syserr.hostname.# (uerf), no core file in /usr/adm/crash (savecore is
>turned on). - nuthin!
>
I should have rtfm'd before I posted. There exists a doc
called:
"Starting the Crash Dump Routine Mnaully on RISC Processors"
in volume 3 of System and Network Management.
As far as I can tell there is a bug in the 4.0 kernel
that innocuous user and system processes are tripping over.
After spending a few hours with crash vmcore.# vmunix.#
a trace on runnable processes at the time of
the crash shows different processes that are
eventually executing panic and boot instruction, for example:
> proc -r
SLT S PID PPID PGRP UID PY CPU SIGS EVENT FLAGS
...
80 r 4324 3999 4324 7341 113 255 0 in trace pagi
...
> trace 80
Stack trace -- last called first
0 boot (paniced = 0, arghowto = 0) [../../machine/mips/machdep.c: ,545 0x8010
9ea8]
1 panic (s = 80159828) [../../sys/subr_prf.c: ,1159 0x800a3c18]
2 kn02trap_error (ep = ffffdcf8, code = 80112fcc, sr = 0008, signo = ffffdcd4
...
> ps 80
SLOT PID UID COMMAND
80 4324 7341 (sml)
>
where "sml" is a program made available to students for a cs class.
The $60,000 question is, how does one get the text string for
the argument to PANIC eg. panic (s = 80159828)? Or more
plainly, where do I go from here?
The consensus here is that without adb, one can't get it.
Does anyone know differently?
Each time our 5000 has hanged, a different process leads to the
panic and boot. ie. there is no consistency at the csh level for
what comamnd is tripping the ?kernel? bug. Without more
help from /bin/crash I'm at a loss for how to find
the instruction that does the damage.
---
And now for someting completely different...
cmp different under 4.0
Given two files, foo1 and foo2; foo1 is NONempty and foo2 is empty.
And the script, "cmp.csh":
#! /bin/csh
set x = `cmp foo1 foo2`
echo $x
echo $x[1]
---
under 3.x:
----
fornax.unm.edu:van -> cmp.csh
cmp: EOF on foo2
cmp:
---
under 4.0
----
triton.unm.edu:van -> cmp.csh
cmp: EOF on foo2
Subscript out of range.
This happens because cmp under 4.0 was changed to
write EOF diagnostics to std err. instead of std out. Under
3.x EOF diags are written to std out.
Yes I'm splitting hairs here, but
when your favorite prof comes to you pulling his/her
hair out because their homegrown script breaks on the "new"
system, it makes you appreciate consistency ;-)
---
Van Rauch van at triton.unm.edu
Application/Systems
University of NM, CIRT
More information about the Comp.unix.ultrix
mailing list