Daemons stuck in 'D' "short-term" wait state

Mark J. Kilgard mjk at fluffy.rice.edu
Thu Mar 2 17:10:19 AEST 1989


We recently experienced a problem similiar to the one Rob McMahon
<cudcv%warwick.ac.uk at nss.cs.ucl.ac.uk> (v7n148) and more recently Dwight
Ernest <mcvax!independent!dwight at uunet.uu.net> (v7n161) experienced.

We started getting NFS file server not responding errors from one of our
file servers.  When I logged into the file server and did a ps, it showed
all the nfsd's in the 'D' "short-term" wait state.

Logically the file server worked fine but all its clients hung when they
tried to access it over NFS.  It was impossible to kill the nfsd's and
attempts to start a new set failed.  We were running 8 by the way but I
don't think that is important.

We rebooted the machine with 'reboot' and the fsck failed.  We did a
manual fsck and 'fix'ed 5-10 inconsistencies.  We changed the
configuration to bring the machine up with only 4 nfsd's and brought the
machine up multi-user.

Everything was fine till about 27 hours later when those 4 nfsd's were
found again in the D state.  The machine was 'reboot'ed and again the fsck
failed.

With some inspection by John Deuel <kink at rice.edu> an anomaly was found in
the the /barn/lost+found directory where the fsck had complained.  Link
counts were all messed up and it appeared that there were two copies of
the lost+found inode???

John clri'ed the lost+found inode and did an fsck to fix the resulting
mess.  The machine was rebooted and has been running for 36 hours now.  It
is running with 8 nfsd's presently.  There don't seem to be any more
problems.

It seems reasonable to think that the nfsd's might get confused by
anomalies in the file system and hang in a D state.  Or could it be that
the nfsd's got screwed up and possibily created the anomaly?  I can't
explain the cause of the initial fsck problems - the system had been
running for nearly a week without down time before the occurance.

It seemed that the first fsck didn't fix the anomaly (or maybe it just
reappeared?).  Perhaps a small glitch in fsck?

Have people had similiar experiences?  If so, what did you guess the cause
to be?  Were there fsck problems before?

- Mark



More information about the Comp.sys.sun mailing list