Sun-4 severe NFS problem

Dan Franklin dan at watson.bbn.com
Fri Mar 31 06:44:18 AEST 1989


We're having severe NFS problems involving our (only) Sun 4/110, running
SunOS 4.0.1. The symptom is that a process attempting to copy (cp) a
"large" file (greater than 2k bytes or so) between this machine and any of
several others, including a Sun-3/160 (SunOS 3.4), a MicroVAX (Ultrix
2.3), and our diskless Sun-3/50 machines (SunOS 3.4), will almost always
hang.

We've seen the problem while copying:

1) from a Sun-4 directory to one on the Sun-3/160, while on the Sun-4,
2) from a Vax directory to one on the Sun-4, while on the Sun-4,
3) from a Sun-4 directory to one on the Sun-3/160, while on a Sun-3/50.

We can copy tiny files without any problem.  But we get long delays when
copying larger files, ranging up to a delay of infinity :-) We ran
experiments mostly copying between Suns (cases 1 and 3). The definition of
"large" is not constant, but seems to be between 1k and 2k bytes.  With
files greater than that, the cp hangs; sometimes it returns, but usually
not.  Generally we get an accompanying "NFS server <hostname> not
responding still trying" error message.  It usually doesn't return at all,
until it's been interrupted.

The trace command reveals that the cp hangs in a variety of places: doing
a "stat" on the destination directory, or writing to the destination file,
or closing it--but always an operation involving the destination.

While a cp is hung, all of the machines involved in the cp operation
continue to respond to other commands, including other NFS commands.
However, on the initiating machine, you cannot access the directory
containing the file being cp'd. For example, in case 1, an "ls", on the
Sun-4, of the remote directory containing the file being copied will also
hang.  But you can look at that file on the serving machine, as well as on
other machines besides the Sun-4 that have that file mounted.

Other network services, including FTP and rlogin, work perfectly.

These symptoms seem to be quite different from those discussed in other
Sun-4 hanging situations.  No nfsd ever ends up in a permanent "D" wait
state on any of the machines, including the Sun-4.  Unrelated NFS
activities on the two machines in question work fine.

Our problem sounded a little like the interrupt priority bug discussed by
Charles Hedrick recently, so I tried raising the priority of splnet() to 2
and then to 3 by patching the kernel according to his instructions.  It
didn't help.

Naturally, we've called the Sun Hotline.  They said they'd call back in a
few hours; so far it's been two days with no response.

This situation renders our brand new Sun-4 completely useless for the
reason we bought it.  We desperately need to get it to work.  Any
suggestions, hints, things to try, wild guesses, etc. will be gratefully
received.

	Dan Franklin
	dfranklin at bbn.com or dan at bbn.com



More information about the Comp.sys.sun mailing list