Doug Rabson dfdcada31e Add the new kernel-mode NFS Lock Manager. To use it instead of the
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.

Highlights include:

* Thread-safe kernel RPC client - many threads can use the same RPC
  client handle safely with replies being de-multiplexed at the socket
  upcall (typically driven directly by the NIC interrupt) and handed
  off to whichever thread matches the reply. For UDP sockets, many RPC
  clients can share the same socket. This allows the use of a single
  privileged UDP port number to talk to an arbitrary number of remote
  hosts.

* Single-threaded kernel RPC server. Adding support for multi-threaded
  server would be relatively straightforward and would follow
  approximately the Solaris KPI. A single thread should be sufficient
  for the NLM since it should rarely block in normal operation.

* Kernel mode NLM server supporting cancel requests and granted
  callbacks. I've tested the NLM server reasonably extensively - it
  passes both my own tests and the NFS Connectathon locking tests
  running on Solaris, Mac OS X and Ubuntu Linux.

* Userland NLM client supported. While the NLM server doesn't have
  support for the local NFS client's locking needs, it does have to
  field async replies and granted callbacks from remote NLMs that the
  local client has contacted. We relay these replies to the userland
  rpc.lockd over a local domain RPC socket.

* Robust deadlock detection for the local lock manager. In particular
  it will detect deadlocks caused by a lock request that covers more
  than one blocking request. As required by the NLM protocol, all
  deadlock detection happens synchronously - a user is guaranteed that
  if a lock request isn't rejected immediately, the lock will
  eventually be granted. The old system allowed for a 'deferred
  deadlock' condition where a blocked lock request could wake up and
  find that some other deadlock-causing lock owner had beaten them to
  the lock.

* Since both local and remote locks are managed by the same kernel
  locking code, local and remote processes can safely use file locks
  for mutual exclusion. Local processes have no fairness advantage
  compared to remote processes when contending to lock a region that
  has just been unlocked - the local lock manager enforces a strict
  first-come first-served model for both local and remote lockers.

Sponsored by:	Isilon Systems
PR:		95247 107555 115524 116679
MFC after:	2 weeks
2008-03-26 15:23:12 +00:00

624 lines
15 KiB
Groff

.\" Copyright (c) 1983, 1993
.\" The Regents of the University of California. All rights reserved.
.\"
.\" Redistribution and use in source and binary forms, with or without
.\" modification, are permitted provided that the following conditions
.\" are met:
.\" 1. Redistributions of source code must retain the above copyright
.\" notice, this list of conditions and the following disclaimer.
.\" 2. Redistributions in binary form must reproduce the above copyright
.\" notice, this list of conditions and the following disclaimer in the
.\" documentation and/or other materials provided with the distribution.
.\" 4. Neither the name of the University nor the names of its contributors
.\" may be used to endorse or promote products derived from this software
.\" without specific prior written permission.
.\"
.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
.\" SUCH DAMAGE.
.\"
.\" @(#)fcntl.2 8.2 (Berkeley) 1/12/94
.\" $FreeBSD$
.\"
.Dd March 8, 2008
.Dt FCNTL 2
.Os
.Sh NAME
.Nm fcntl
.Nd file control
.Sh LIBRARY
.Lb libc
.Sh SYNOPSIS
.In fcntl.h
.Ft int
.Fn fcntl "int fd" "int cmd" "..."
.Sh DESCRIPTION
The
.Fn fcntl
system call provides for control over descriptors.
The argument
.Fa fd
is a descriptor to be operated on by
.Fa cmd
as described below.
Depending on the value of
.Fa cmd ,
.Fn fcntl
can take an additional third argument
.Fa "int arg" .
.Bl -tag -width F_GETOWNX
.It Dv F_DUPFD
Return a new descriptor as follows:
.Pp
.Bl -bullet -compact -offset 4n
.It
Lowest numbered available descriptor greater than or equal to
.Fa arg .
.It
Same object references as the original descriptor.
.It
New descriptor shares the same file offset if the object
was a file.
.It
Same access mode (read, write or read/write).
.It
Same file status flags (i.e., both file descriptors
share the same file status flags).
.It
The close-on-exec flag associated with the new file descriptor
is set to remain open across
.Xr execve 2
system calls.
.El
.It Dv F_DUP2FD
It is functionally equivalent to
.Bd -literal -offset indent
dup2(fd, arg)
.Ed
.Pp
The
.Dv F_DUP2FD
constant is not portable, so it should not be used if portability is needed.
Use
.Fn dup2
instead.
.It Dv F_GETFD
Get the close-on-exec flag associated with the file descriptor
.Fa fd
as
.Dv FD_CLOEXEC .
If the returned value ANDed with
.Dv FD_CLOEXEC
is 0,
the file will remain open across
.Fn exec ,
otherwise the file will be closed upon execution of
.Fn exec
.Fa ( arg
is ignored).
.It Dv F_SETFD
Set the close-on-exec flag associated with
.Fa fd
to
.Fa arg ,
where
.Fa arg
is either 0 or
.Dv FD_CLOEXEC ,
as described above.
.It Dv F_GETFL
Get descriptor status flags, as described below
.Fa ( arg
is ignored).
.It Dv F_SETFL
Set descriptor status flags to
.Fa arg .
.It Dv F_GETOWN
Get the process ID or process group
currently receiving
.Dv SIGIO
and
.Dv SIGURG
signals; process groups are returned
as negative values
.Fa ( arg
is ignored).
.It Dv F_SETOWN
Set the process or process group
to receive
.Dv SIGIO
and
.Dv SIGURG
signals;
process groups are specified by supplying
.Fa arg
as negative, otherwise
.Fa arg
is interpreted as a process ID.
.El
.Pp
The flags for the
.Dv F_GETFL
and
.Dv F_SETFL
flags are as follows:
.Bl -tag -width O_NONBLOCKX
.It Dv O_NONBLOCK
Non-blocking I/O; if no data is available to a
.Xr read 2
system call, or if a
.Xr write 2
operation would block,
the read or write call returns -1 with the error
.Er EAGAIN .
.It Dv O_APPEND
Force each write to append at the end of file;
corresponds to the
.Dv O_APPEND
flag of
.Xr open 2 .
.It Dv O_DIRECT
Minimize or eliminate the cache effects of reading and writing.
The system
will attempt to avoid caching the data you read or write.
If it cannot
avoid caching the data, it will minimize the impact the data has on the cache.
Use of this flag can drastically reduce performance if not used with care.
.It Dv O_ASYNC
Enable the
.Dv SIGIO
signal to be sent to the process group
when I/O is possible, e.g.,
upon availability of data to be read.
.El
.Pp
Several commands are available for doing advisory file locking;
they all operate on the following structure:
.Bd -literal
struct flock {
off_t l_start; /* starting offset */
off_t l_len; /* len = 0 means until end of file */
pid_t l_pid; /* lock owner */
short l_type; /* lock type: read/write, etc. */
short l_whence; /* type of l_start */
int l_sysid; /* remote system id or zero for local */
};
.Ed
The commands available for advisory record locking are as follows:
.Bl -tag -width F_SETLKWX
.It Dv F_GETLK
Get the first lock that blocks the lock description pointed to by the
third argument,
.Fa arg ,
taken as a pointer to a
.Fa "struct flock"
(see above).
The information retrieved overwrites the information passed to
.Fn fcntl
in the
.Fa flock
structure.
If no lock is found that would prevent this lock from being created,
the structure is left unchanged by this system call except for the
lock type which is set to
.Dv F_UNLCK .
.It Dv F_SETLK
Set or clear a file segment lock according to the lock description
pointed to by the third argument,
.Fa arg ,
taken as a pointer to a
.Fa "struct flock"
(see above).
.Dv F_SETLK
is used to establish shared (or read) locks
.Pq Dv F_RDLCK
or exclusive (or write) locks,
.Pq Dv F_WRLCK ,
as well as remove either type of lock
.Pq Dv F_UNLCK .
If a shared or exclusive lock cannot be set,
.Fn fcntl
returns immediately with
.Er EAGAIN .
.It Dv F_SETLKW
This command is the same as
.Dv F_SETLK
except that if a shared or exclusive lock is blocked by other locks,
the process waits until the request can be satisfied.
If a signal that is to be caught is received while
.Fn fcntl
is waiting for a region, the
.Fn fcntl
will be interrupted if the signal handler has not specified the
.Dv SA_RESTART
(see
.Xr sigaction 2 ) .
.El
.Pp
When a shared lock has been set on a segment of a file,
other processes can set shared locks on that segment
or a portion of it.
A shared lock prevents any other process from setting an exclusive
lock on any portion of the protected area.
A request for a shared lock fails if the file descriptor was not
opened with read access.
.Pp
An exclusive lock prevents any other process from setting a shared lock or
an exclusive lock on any portion of the protected area.
A request for an exclusive lock fails if the file was not
opened with write access.
.Pp
The value of
.Fa l_whence
is
.Dv SEEK_SET ,
.Dv SEEK_CUR ,
or
.Dv SEEK_END
to indicate that the relative offset,
.Fa l_start
bytes, will be measured from the start of the file,
current position, or end of the file, respectively.
The value of
.Fa l_len
is the number of consecutive bytes to be locked.
If
.Fa l_len
is negative,
.Fa l_start
means end edge of the region.
The
.Fa l_pid
and
.Fa l_sysid
fields are only used with
.Dv F_GETLK
to return the process ID of the process holding a blocking lock and
the system ID of the system that owns that process.
Locks created by the local system will have a system ID of zero.
After a successful
.Dv F_GETLK
request, the value of
.Fa l_whence
is
.Dv SEEK_SET .
.Pp
Locks may start and extend beyond the current end of a file,
but may not start or extend before the beginning of the file.
A lock is set to extend to the largest possible value of the
file offset for that file if
.Fa l_len
is set to zero.
If
.Fa l_whence
and
.Fa l_start
point to the beginning of the file, and
.Fa l_len
is zero, the entire file is locked.
If an application wishes only to do entire file locking, the
.Xr flock 2
system call is much more efficient.
.Pp
There is at most one type of lock set for each byte in the file.
Before a successful return from an
.Dv F_SETLK
or an
.Dv F_SETLKW
request when the calling process has previously existing locks
on bytes in the region specified by the request,
the previous lock type for each byte in the specified
region is replaced by the new lock type.
As specified above under the descriptions
of shared locks and exclusive locks, an
.Dv F_SETLK
or an
.Dv F_SETLKW
request fails or blocks respectively when another process has existing
locks on bytes in the specified region and the type of any of those
locks conflicts with the type specified in the request.
.Pp
This interface follows the completely stupid semantics of System V and
.St -p1003.1-88
that require that all locks associated with a file for a given process are
removed when
.Em any
file descriptor for that file is closed by that process.
This semantic means that applications must be aware of any files that
a subroutine library may access.
For example if an application for updating the password file locks the
password file database while making the update, and then calls
.Xr getpwnam 3
to retrieve a record,
the lock will be lost because
.Xr getpwnam 3
opens, reads, and closes the password database.
The database close will release all locks that the process has
associated with the database, even if the library routine never
requested a lock on the database.
Another minor semantic problem with this interface is that
locks are not inherited by a child process created using the
.Xr fork 2
system call.
The
.Xr flock 2
interface has much more rational last close semantics and
allows locks to be inherited by child processes.
The
.Xr flock 2
system call is recommended for applications that want to ensure the integrity
of their locks when using library routines or wish to pass locks
to their children.
.Pp
The
.Fn fcntl ,
.Xr flock 2 ,
and
.Xr lockf 3
locks are compatible.
Processes using different locking interfaces can cooperate
over the same file safely.
However, only one of such interfaces should be used within
the same process.
If a file is locked by a process through
.Xr flock 2 ,
any record within the file will be seen as locked
from the viewpoint of another process using
.Fn fcntl
or
.Xr lockf 3 ,
and vice versa.
Note that
.Fn fcntl F_GETLK
returns \-1 in
.Fa l_pid
if the process holding a blocking lock previously locked the
file descriptor by
.Xr flock 2 .
.Pp
All locks associated with a file for a given process are
removed when the process terminates.
.Pp
All locks obtained before a call to
.Xr execve 2
remain in effect until the new program releases them.
If the new program does not know about the locks, they will not be
released until the program exits.
.Pp
A potential for deadlock occurs if a process controlling a locked region
is put to sleep by attempting to lock the locked region of another process.
This implementation detects that sleeping until a locked region is unlocked
would cause a deadlock and fails with an
.Er EDEADLK
error.
.Sh RETURN VALUES
Upon successful completion, the value returned depends on
.Fa cmd
as follows:
.Bl -tag -width F_GETOWNX -offset indent
.It Dv F_DUPFD
A new file descriptor.
.It Dv F_DUP2FD
A file descriptor equal to
.Fa arg .
.It Dv F_GETFD
Value of flag (only the low-order bit is defined).
.It Dv F_GETFL
Value of flags.
.It Dv F_GETOWN
Value of file descriptor owner.
.It other
Value other than -1.
.El
.Pp
Otherwise, a value of -1 is returned and
.Va errno
is set to indicate the error.
.Sh ERRORS
The
.Fn fcntl
system call will fail if:
.Bl -tag -width Er
.It Bq Er EAGAIN
The argument
.Fa cmd
is
.Dv F_SETLK ,
the type of lock
.Pq Fa l_type
is a shared lock
.Pq Dv F_RDLCK
or exclusive lock
.Pq Dv F_WRLCK ,
and the segment of a file to be locked is already
exclusive-locked by another process;
or the type is an exclusive lock and some portion of the
segment of a file to be locked is already shared-locked or
exclusive-locked by another process.
.It Bq Er EBADF
The
.Fa fd
argument
is not a valid open file descriptor.
.Pp
The argument
.Fa cmd
is
.Dv F_DUP2FD ,
and
.Fa arg
is not a valid file descriptor.
.Pp
The argument
.Fa cmd
is
.Dv F_SETLK
or
.Dv F_SETLKW ,
the type of lock
.Pq Fa l_type
is a shared lock
.Pq Dv F_RDLCK ,
and
.Fa fd
is not a valid file descriptor open for reading.
.Pp
The argument
.Fa cmd
is
.Dv F_SETLK
or
.Dv F_SETLKW ,
the type of lock
.Pq Fa l_type
is an exclusive lock
.Pq Dv F_WRLCK ,
and
.Fa fd
is not a valid file descriptor open for writing.
.It Bq Er EDEADLK
The argument
.Fa cmd
is
.Dv F_SETLKW ,
and a deadlock condition was detected.
.It Bq Er EINTR
The argument
.Fa cmd
is
.Dv F_SETLKW ,
and the system call was interrupted by a signal.
.It Bq Er EINVAL
The
.Fa cmd
argument
is
.Dv F_DUPFD
and
.Fa arg
is negative or greater than the maximum allowable number
(see
.Xr getdtablesize 2 ) .
.Pp
The argument
.Fa cmd
is
.Dv F_GETLK ,
.Dv F_SETLK
or
.Dv F_SETLKW
and the data to which
.Fa arg
points is not valid.
.It Bq Er EMFILE
The argument
.Fa cmd
is
.Dv F_DUPFD
or
.Dv F_DUP2FD
and the maximum number of file descriptors permitted for the
process are already in use,
or no file descriptors greater than or equal to
.Fa arg
are available.
.It Bq Er ENOLCK
The argument
.Fa cmd
is
.Dv F_SETLK
or
.Dv F_SETLKW ,
and satisfying the lock or unlock request would result in the
number of locked regions in the system exceeding a system-imposed limit.
.It Bq Er EOPNOTSUPP
The argument
.Fa cmd
is
.Dv F_GETLK ,
.Dv F_SETLK
or
.Dv F_SETLKW
and
.Fa fd
refers to a file for which locking is not supported.
.It Bq Er EOVERFLOW
The argument
.Fa cmd
is
.Dv F_GETLK ,
.Dv F_SETLK
or
.Dv F_SETLKW
and an
.Fa off_t
calculation overflowed.
.It Bq Er EPERM
The
.Fa cmd
argument
is
.Dv F_SETOWN
and
the process ID or process group given as an argument is in a
different session than the caller.
.It Bq Er ESRCH
The
.Fa cmd
argument
is
.Dv F_SETOWN
and
the process ID given as argument is not in use.
.El
.Pp
In addition, if
.Fa fd
refers to a descriptor open on a terminal device (as opposed to a
descriptor open on a socket), a
.Fa cmd
of
.Dv F_SETOWN
can fail for the same reasons as in
.Xr tcsetpgrp 3 ,
and a
.Fa cmd
of
.Dv F_GETOWN
for the reasons as stated in
.Xr tcgetpgrp 3 .
.Sh SEE ALSO
.Xr close 2 ,
.Xr dup2 2 ,
.Xr execve 2 ,
.Xr flock 2 ,
.Xr getdtablesize 2 ,
.Xr open 2 ,
.Xr sigvec 2 ,
.Xr lockf 3 ,
.Xr tcgetpgrp 3 ,
.Xr tcsetpgrp 3
.Sh STANDARDS
The
.Dv F_DUP2FD
constant is non portable.
It is provided for compatibility with AIX and Solaris.
.Sh HISTORY
The
.Fn fcntl
system call appeared in
.Bx 4.2 .
.Pp
The
.Dv F_DUP2FD
constant first appeared in
.Fx 7.1 .