I want my bug-free NFS patch!

Sun Jun 26 15:47:48 AEST 1988

In article <9514 at eddie.MIT.EDU>, nessus at athena.mit.edu (Doug Alan) writes:
> I just installed a patch to NFS that allows you to mount the entire
> filesystem of a remote computer, rather than having to mount all of
> its individual disk partitions.

> The most prominent bug is as follows: Let's say the NFS server is
> called "server" and you are using a client machine.  [Server has
> local disk partition /c.  server:/ is mounted on /@/server.  we cd to
> /@/server/c/foodir, pwd says /foodir instead.]

When I was writing my NFS server, I ran into similar problems.  This
sounds very much as though the inumbers in the returned structures are
the real disk inumbers - which is wrong.  This leads to the server
seeing the rather unpleasant situation of two distinct files having the
same (dev,inum) pair.  My solution was to stripe the space of available
inumbers, based on the number of local disk partitions on the server.
However, given that you're using a patch to an existing NFS
implementation, you don't have the freedom to do this.  I think you're
pretty much out of luck, unless you want to dive rather deeply into the
NFS implementation on the server.

Why does this cause the above anomoly with pwd?  Because getwd (in pwd)
reads .. and finds the inumber for .; this gives it the foodir part.
Then it looks at ../.. and notices it has the same inumber as ..,
because they're both roots of filesystems on the server (server's / and
/c are both filesystem roots on the server, so they're both inode 2 -
the device number gets lost by the time they reach the client).
Normally, the only time foo/ and foo/.. have the same (dev,inum) is
when foo is /, so getwd assumes it's reached /.  Try mounting the disk
on the server somewhere else, instead of /c.  Make it somewhere at
least two levels down in the hierarchy: /etc/bardir, say.  Then try cd
/@/server/etc/bardir/foodir and see what pwd has to say.  This time,
you see, getwd() will not see two consecutive directories with the same
inumber as it winds its way up through .., ../.., ../../.., etc.

By now some of you may be wondering how come files with the same
inumber don't get confused with one another.  This is because files are
accessed by file handles, not inumber.  And the file handles,
presumably, are different.  (If they weren't, such files *would* get
confused.)

> I have also just noticed another problem since installing this patch.
> I cannot say whether or not this bug has always been there, or
> whether it appeared upon installing this patch.  This problem is
> intermittent and I can not reproduce it on demand.  I was looking at
> a text file that was on the remote machine.  Unfortunately, there
> appeared to be a bunch of nulls on the end of the file that weren't
> really there.  On this particular file, the problem was reproducable
> for a while, but eventually it stopped happening.

The problem persisted for as long as the block was present in the
client's buffer cache, I feel sure.  The question then is "how did it
get there?".  Are the filesystem mounts hard or soft (ie, do timeouts
cause the operation to fail or to retry indefinitely)?  With soft
mounts, the client implementation may wind up producing a bufferful of
nulls when it shouldn't.  If this is what's happening, it's a bug.

I could also see this being due to a race condition: the client tries
to read the last block of the file, based on its idea of the size of
the file.  However, in between its getting the size of the file and its
attempt to read the last block, someone else (another client, or a
process on the server) truncates the file to a shorter size.  The
result may well be a bufferful of nulls.

					der Mouse

			uucp: mouse at mcgill-vision.uucp
			arpa: mouse at larry.mcrcim.mcgill.edu