NFS vs. flock, revisited

Fri Mar 31 06:24:08 AEST 1989

I got lots of replies to my query about a way to make advisory locking
work over NFS.  The consensus seems to be using fcntl (or lockf), which
uses lockd to communicate info with the NFS server.  Thanks (so far, the
replies are still coming in) to Vic Abell, Glenn Barry (who even sent me
source for a "flock" that uses fcntl), Mark Sommer, Stephen X. Nahm, Jim
Grande, Guy Harris, and Robert Claeson.

Now a problem.. it doesn't seem to work.  There are a couple of things:
when we (the local Sun sys admin and I) tried running my test the first
time after starting up the statd and lockd processes on the clients, we
noticed that all of them immediately hung in state D.  This sounds like a
problem with lockd that one of the respondents mentioned; could someone
elaborate on it?  Anyway, we looked for the lockd processes we had started
and they had *all died*.  After restarting them, the test processes
started running.  Bizarre, but non-fatal.  The second thing is that even
when the test processes run OK, the results are NOT OK, the common log
file (which the processes are all trying to get exclusive locks on)
contains much overwritten text.

Just so you all know what I'm dealing with, I rewrote the locking function
I was talking about so that it follows the following algorithm.

	1. Try to lock the common log file using fcntl with type == F_WRLCK
	   and whence == start == len == 0.  If fcntl fails with errno EAGAIN
	   or EACCES (the latter is apparently equivalent to EAGAIN on
	   Pyramid OSx despite claims in the manpage that that feature hasn't
	   been implemented yet :-) try up to 7 more times after short sleeps.

	   If the lock cannot be granted after eight fcntl calls, or the
	   reason for failure is "non-trivial", return an error code.

	2. Seek (actually fseek()) to EOF on the common log file.

	3. Write the message (fprintf()/fflush()) to the common log file.

	4. Unlock the log file and return.

Does it sound like anything I'm doing here is wrong?  This code works
great on Pyramids running NFS, it only fails on our Suns.  I could, if I
absolutely had to, use write instead of fprintf.  It would be very ugly
and very painful to implement so I don't think it buys anything, but I
could be wrong.

There is of course the possibility that we're not running the right lockd,
statd, or something else on the Suns.  I don't have the faintest idea how
to find out.  (We had to start up statd and lockd on our client
workstations just to test this, btw - they don't normally run those
daemons because they're sufficiently bulky to cause complaints about
response time.)  If anyone suspects this to be a problem, let me know, and
also let me know how I can find out - I'm a total novice at anything
NFS-ish.

Phil Kos