for unknown events.
A number of modules return EINVAL in this instance, and I have left
those alone for now and instead taught MOD_QUIESCE to accept this
as "didn't do anything".
install nfssvc(). It also updates the argument count, but did so
without setting SYF_MPSAFE, effectively removing the MPSAFE flag even
when syscalls.master indicates it doesn't require Giant. This change
forces the modevent to set MPSAFE as a flag to its internal notion of
an argument coutn.
Note: this duplication of information is a bad thing, but is a more
general problem I'm not currently willing to address.
mechanism so that early processing on mbufs can be performed before
a context switch to the NFS server threads. Because of this, if
the socket code is running without Giant, the NFS server also needs
to be able to run the upcall code without relying on the presence on
Giant. This change modifies the NFS server to run using a "giant
code lock" covering operation of the whole subsystem. Work is in
progress to move to data-based locking as part of the NFSv4 server
changes.
Introduce an NFS server subsystem lock, 'nfsd_mtx', and a set of
macros to operate on the lock:
NFSD_LOCK_ASSERT() Assert nfsd_mtx owned by current thread
NFSD_UNLOCK_ASSERT() Assert nfsd_mtx not owned by current thread
NFSD_LOCK_DONTCARE() Advisory: this function doesn't care
NFSD_LOCK() Lock nfsd_mtx
NFSD_UNLOCK() Unlock nfsd_mtx
Constify a number of global variables/structures in the NFS server
code, as they are not modified and contain constants only:
nfsrvv2_procid nfsrv_nfsv3_procid nonidempotent
nfsv2_repstat nfsv2_type nfsrv_nfsv3_procid
nfsrvv2_procid nfsrv_v2errmap nfsv3err_null
nfsv3err_getattr nfsv3err_setattr nfsv3err_lookup
nfsv3err_access nfsv3err_readlink nfsv3err_read
nfsv3err_write nfsv3err_create nfsv3err_mkdir
nfsv3err_symlink nfsv3err_mknod nfsv3err_remove
nfsv3err_rmdir nfsv3err_rename nfsv3err_link
nfsv3err_readdir nfsv3err_readdirplus nfsv3err_fsstat
nfsv3err_fsinfo nfsv3err_pathconf nfsv3err_commit
nfsrv_v3errmap
There are additional structures that should be constified but due
to their being passed into general purpose functions without const
arguments, I have not yet converted.
In general, acquire nfsd_mtx when accessing any of the global NFS
structures, including struct nfssvc_sock, struct nfsd, struct
nfsrv_descript.
Release nfsd_mtx whenever calling into VFS, and acquire Giant for
calls into VFS. Giant is not required for any part of the
operation of the NFS server with the exception of calls into VFS.
Giant will never by acquired in the upcall code path. However, it
may operate entirely covered by Giant, or not. If debug.mpsafenet
is set to 0, the system calls will acquire Giant across all
operations, and the upcall will assert Giant. As such, by default,
this enables locking and allows us to test assertions, but should not
cause any substantial new amount of code to be run without Giant.
Bugs should manifest in the form of lock assertion failures for now.
This approach is similar (but not identical) to modifications to the
BSD/OS NFS server code snapshot provided by BSDi as part of their
SMPng snapshot. The strategy is almost the same (single lock over
the NFS server), but differs in the following ways:
- Our NFS client and server code bases don't overlap, which means
both fewer bugs and easier locking (thanks Peter!). Also means
NFSD_*() as opposed to NFS_*().
- We make broad use of assertions, whereas the BSD/OS code does not.
- Made slightly different choices about how to handle macros building
packets but operating with side effects.
- We acquire Giant only when entering VFS from the NFS server daemon
threads.
- Serious bugs in BSD/OS implementation corrected -- the snapshot we
received was clearly a work in progress.
Based on ideas from: BSDi SMPng Snapshot
Reviewed by: rick@snowhite.cis.uoguelph.ca
Extensive testing by: kris
is not pretty, but it fixes the code so that it no longer violates the
vnode locking rules in the VFS API and doesn't trip any of the locking
assertions enabled by the DEBUG_VFS_LOCKS kernel configuration option.
There is one report that this patch fixed a "locking against myself"
panic on an NFS server that was tripped by a diskless client.
Approved by: re (scottl)
nfsm_srvpathsiz. This macro plucks a length out of an rpc request and
verifies that its size does not exceed NFS_MAXPATHLEN. If it does
it generates an ENAMETOOLONG response.
- Use this macro, and the existing nfsm_srvnamsiz macro in two places
where we deal with paths passed in by the client.
This fixes a linux interoperability bug. Linux was sending oversized path
components which would cause us to ignore the request all together. This
causes linux to hang indefinitly while it waits for a response. This
could still happen in other cases where we error out with EBADRPC.
Sponsored by: Isilon Systems, Inc.
Reviewed by: alfred, fabbri@isilon.com, neal@isilon.com
temporary storage. In the old NFS code it wasn't at all clear if
the value of `tl' was used across or after macro calls, but I'm
fairly confident that the convention was to keep its use local.
Each ex-macro function now uses a local version of this variable,
so all of the double-indirection goes away.
The only exception to the `local use' rule for `tl' is nfsm_clget(),
which is left unchanged by this commit.
Reviewed by: peter
next to equivalent m_len adjustments. Move the nfsm_subs.h macros
into groups depending on which phase they are used in, since that
affects the error recovery requirements. Collect some of the common error
checking into a single macro as preparation for unwinding some more.
Have nfs_rephead return a value instead of secretly modifying args.
Remove some unused function arguments that were being passed around.
Clarify nfsm_reply()'s error handling (I hope).
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
vm_mtx does not recurse and is required for most low level
vm operations.
faults can not be taken without holding Giant.
Memory subsystems can now call the base page allocators safely.
Almost all atomic ops were removed as they are covered under the
vm mutex.
Alpha and ia64 now need to catch up to i386's trap handlers.
FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).
Reviewed (partially) by: jake, jhb
This is because calls with M_WAIT (now M_TRYWAIT) may not wait
forever when nothing is available for allocation, and may end up
returning NULL. Hopefully we now communicate more of the right thing
to developers and make it very clear that it's necessary to check whether
calls with M_(TRY)WAIT also resulted in a failed allocation.
M_TRYWAIT basically means "try harder, block if necessary, but don't
necessarily wait forever." The time spent blocking is tunable with
the kern.ipc.mbuf_wait sysctl.
M_WAIT is now deprecated but still defined for the next little while.
* Fix a typo in a comment in mbuf.h
* Fix some code that was actually passing the mbuf subsystem's M_WAIT to
malloc(). Made it pass M_WAITOK instead. If we were ever to redefine the
value of the M_WAIT flag, this could have became a big problem.
mail:
The problem seems to originate with NFS's postop_attr
information that is returned with a read or write RPC.
Within a vm_fault context, the code cannot deal with
vnode_pager_setsize() shrinking a vnode.
The workaround in the patch below stops the nfsm_postop_attr()
macro from ever shrinking a vnode. If the new size in the
postop_attr information is smaller, then it just sets the
nfsnode n_attrstamp to 0 to stop the wrong size getting
used in the future. This change only affects postop_attr
attributes; the nfsm_loadattr() macro works as normal.
The change is implemented by adding a new argument to
nfs_loadattrcache() called 'dontshrink'. When this is
non-zero, nfs_loadattrcache() will never reduce the
vnode/nfsnode size; instead it zeros n_attrstamp.
There remain other was processes can get stuck in vmopar.
Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
Reviewed by: dillon
Tested by: Vadim Belman <voland@lflat.org>
with the new snapshot code.
Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.
Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().
Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.
Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.
Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.
Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.
Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.
Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.
Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.
Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.
<sys/bio.h>.
<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.
Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.
Still a few bogus uses of struct buf to track down.
Repocopy by: peter
there is nothing we can do about it. In fact, after further review
there simply are not very many instances of the two structures NFS
checks for 'bloat' so I've decided to simply rip the checks out entirely.
Submitted by: Andrew Gallatin <gallatin@cs.duke.edu>
NFSSERVER defined, useful for userland fileservers that want to
use a filehandle type interface to the filesystem.
Submitted by: Assar Westerlund assar@stacken.kth.se
PR: kern/15452
NFS packets, mainly initializing structure pointers to NULL which
are conditionally freed prior to return.
PR: kern/15249
Submitted by: Ian Dowse <iedowse@maths.tcd.ie>
blocks of zeros could wind up in a file written to over NFS by a client.
The problem only occurs a few times per several gigabytes of data. This
problem turned out to be bug #3 below.
bug #1:
B_CLUSTEROK must be cleared when an NFS buffer is reverted from
stage 2 (ready for commit rpc) to stage 1 (ready for write).
Reversions can occur when a dirty NFS buffer is redirtied with new
data.
Otherwise the VFS/BIO system may end up thinking that a stage 1
NFS buffer is clusterable. Stage 1 NFS buffers are not clusterable.
bug #2:
B_CLUSTEROK was inappropriately set for a 'short' NFS buffer (short
buffers only occur near the EOF of the file). Change to only set
when the buffer is a full biosize (usually 8K). This bug has no
effect but should be fixed in -current anyway. It need not be
backported.
bug #3:
B_NEEDCOMMIT was inappropriately set in nfs_flush() (which is
typically only called by the update daemon). nfs_flush()
does a multi-pass loop but due to the lack of vnode locking it
is possible for new buffers to be added to the dirtyblkhd list
while a flush operation is going on. This may result in nfs_flush()
setting B_NEEDCOMMIT on a buffer which has *NOT* yet gone through its
stage 1 write, causing only the commit rpc to be made and thus
causing the contents of the buffer to be thrown away (never sent to
the server).
The patch also contains some cleanup, which only applies to the commit
into -current.
Reviewed by: dg, julian
Originally Reported by: Dan Nelson <dnelson@emsphone.com>