file locking issues, NFS, lockf

Wed May 1 05:21:17 AEST 1991

=This is a slight modification of a posting that occured in comp.sys.sun.
=I received only a few answers which seemed to open as many questions as they
=answered. I now call upon the unix wizards to help me out.

I have written an application that is similar to a network database
application in which data is stored in on NFS-accessable file.  To protect
from multiple simultaneous updates, I have used the lockf subroutine to lock
the entire file.  I have had numerous problems with the lockf routine "locking
up".  The symptoms vary:

	S1. The client dies and the server doesn't realize it.  In order to
	avoid processes being killed when they own the lock, I catch the
	following signals: 
		signal( SIGHUP, clnp );
		signal( SIGQUIT, clnp );
		signal( SIGINT, clnp );
		signal( SIGILL, clnp );
		signal( SIGIOT, clnp );
		signal( SIGEMT, clnp );
		signal( SIGFPE, clnp );
		signal( SIGBUS, clnp );
		signal( SIGSEGV, clnp );
		signal( SIGSYS,	clnp );
		signal( SIGTERM, clnp );
	Should I catch more?

	FYI, Here's what the lock code looks like:

	  for(NumAttempts = 0;NumAttempts <= NUMPOLLS ; NumAttempts++){
	    if( lockf( fd, F_TLOCK, 0L ) != (-1))	{
	      success = TRUE;
	      break;
	    }
	    sleep(2);
	  }
	I avoid the indefinate wait lock because this appears to increase the
	probability that an error will occur.

	S2. Sometimes the client doesn't die--it just hangs.  Attaching the
hung program indicates something hangs inside of fcntl.

	S3. Occasionally, I get messages like 
		unknown klm_reply proc(0)
		unknown klm_reply proc(40)

	Does anyone have any idea where these come from?

	Other questions include:
	1. Is there any known way to unconfuse our machines and reset
state without rebooting the things?  Killing statd and lockd is not
sufficient.

	2. I was once told that sun released patches to their lock daemon, but
noone could direct me to them.  Does a wizard know where such things exist?

	3. If lockf cannot be made to work, would I be at risk using the old
technique of creating a "lock directory"?  I've read that with NFS this won't
work, but I've never read a good explanation of the problems with this approach.
Are their other workarounds (semaphores, etc) that I should try?

I would prefer to get this to work properly using lockf, since this seems to
be exactly what lockf is designed for.

Our network consists of sparcstation 1+ and IPC's running either 4.0.1, 4.1 or
4.1.1, and sun3's running 4.0.3.  In the near future we will also be using
DG's aviion/UX workstations. 

		Thanks for any help you can provide,

			-Rob