probable fix for Sun 4 network problems

Mon Mar 6 04:49:16 AEST 1989

I believe we have found a fairly serious bug in the network support for
the Sun 4.  We have had various odd things happen on our 4's.  The most
consistent is one machine whose NFS service periodically hangs.  When it
is in this condition, all attempts to write files longer than a few K to
the machine via NFS time out.

I now believe this is because of an incorrect definition of splnet.  This
problem is specific to the Sun 4.  (Splnet is defined in system-specific
code.)  Splnet should raise the machine priority high enough to lock out
the interrupts that are used for incoming packets.  It is very hard to be
sure what level these interrupts are happening at, because the relevant
code is quite hardware-specific, and pulls rabbits out of hats in several
places.  However there's some reason to think that the interrupt is
happening at priority 2.  Splsoftclock is definitely defined as priority
2, which supports this theory.  Unfortunately splnet is only priority 1.
This means that if the softcall interrupts are really priority 2, splnet
does not lock out ipintr or the various IP-related timeouts.  The result
would be disasterous to the network code, since it means that data
structures related to the network may be garbaged if a packet happens to
arrive while you are in any of the network-related system calls.  Talking
about this issue is complicated by the fact that the Sun 4 uses at least
two different sets of level numbers.  The softclock is definitely level 1.
However level N interrupts occur at CPU priority 2 * N.  The definitions
of the splxxx routines should be using 2 * N, not N.

After six days, I finally got someone at the AnswerLine to call back.
However they haven't had a chance to check out my theory with any of the
Sun kernel experts.  It would be nice for someone who knew the hardware
specific details to verify that I'm right.  But we have been running for a
week with splnet raised from 1 to 2, and the NFS hangs I was seeing have
not occured.  We were seeing them every morning before.  So it seems
likely that this problem is real.  I don't see any way that my proposed
patch could cause trouble, even if it turns out not to be necessary.  So
if you are having odd problems with Sun 4 networking, you might try this.
When I get a response from Sun, I'll post it.

If you have source, look in sys/sun4/subr.s.  You'll find that splnet is
defined as RAISE(1).  Change it to RAISE(2).

If you don't have source, try the following.  (I'm being cautious in the
example below and outputting the original in both assembly and hex form,
just so you can verify that your code is the same as ours before the
change.  If not, don't proceed.  This is not an actual transcript, since I
did the patch in source form.)

  cp /vmunix /vmunix.new
  adb -w /vmunix.new
  splnet+8?i      cmp     %g1, 0x100
  .?X
  _splnet+8:      80a06100
  .?W 80a06200
  splnet+1c?i 	  or      %g1, 0x100, %g1
  .?X
  _splnet+0x1c:   82106100
  .?W 82106200
  ^D

Now bring up /vmunix.new.  If you want to make the change permanently, you
can do the same edit to /sys/sun4/OBJ/subr.o by using "adb -w subr.o".
Then you would build a new kernel as usual.