File locking on networks

Fri Jan 10 14:18:09 AEST 1986

In article <1106 at brl-tgr.ARPA>, gwyn at brl-tgr.ARPA (Doug Gwyn <gwyn>) writes:
> > Note also that a serious file locking mechanism on a network must provide
> > a way for a user program to be notified that the system has broken its lock.
> > This situation occurs when a process locks a file on another machine, 
> > and a comm link between the two machines goes down.  You clearly can't
> > keep your database down for hours while AT&T (grin) puts your long line
> > back in service, so the lock arbiter reluctantly breaks the lock.  (It
> > can't tell if your machine crashed or whether it was just a comm
> > line failure anyway.)  Now everybody can get at the file OK, but when the
> > comm link comes back up, the process will think it owns the lock and
> > will muck with the file.  So far nobody has designed a mechanism to tell
> > the process that this has happened, which means to be safe the system must
> > kill -9 any such process when this happens (e.g. it must make it *look*
> > like the system or process really did crash, even though it was just a
> > comm link failure).  I'm not sure how you even *detect* this situation
> > though.
> 
> I don't see a big problem.  There are three possible cases of failure...
> (2)  Communication link crashes.  (3)  Remote system crashes after
> planting a lock.  Cases (2) and (3) are the interesting ones, but they
> can be easily handled by simply pinging the locking system when a lock
> conflict occurs.  (Various strategies could be used to reduce pinging
> frequency, if desired, but I don't think it would be necessary.)  If the
> locker denies knowledge of the lock, then void it locally and proceed.

I don't see how the above proposal solves anything.  Take case (2).
The system that contains the data notices a lock conflict.  It pings
the system holding the lock.  It gets "network not reachable".  It
voids the lock and the database is now accessible.  OK, but the
database is in an inconsistent state.  Maybe when it breaks the lock it
does a database cleanup.  OK, now suppose the comm link comes back up.
The system that was out of touch still thinks it holds the lock; it's
been pinging the server trying to get an I/O request in (for example).
When the link comes up, the I/O request will get thru.  What does the
server do with this request?  If it satisfies it, it has permitted the
database to be changed by someone who doesn't have the lock.  It must
reject the request (e.g. a Unix read() or write() call) specifying some
kind of lock failure error code.  The application program on the remote
machine thinks it owns the lock.  It must be written to go back to the
top of the transaction and try to obtain the lock again, when it gets
this error code.  There are no such provisions in the System V locking
facilities.  Thus programs written for those facilities will break when
moved onto networks.

How can I make this clearer?  I'd be glad to be convinced that there is
no problem, but I think there really is...