Remove hacks from the NFSv2/3 client intended to handle a lack of a
server-side RPC retranmission cache for non-idempotent operations: these
hacks substituted 0 (success) for the expected EEXIST in the event that
a target name already existed for LINK, SYMLINK, and MKDIR operations,
under the assumption that EEXIST represented a second application of the
original RPC rather than a true failure.
Background: certain NFS operations (in this case, LINK, SYMLINK, and
MKDIR) are not idempotent, as they leave behind persisting state on the
server that prevents them from being replayed without an error;if an UDP
RPC reply is lost leading to a retransmission by theclient, the second
reply will return EEXIST rather than success, asthe new object has
already been created. The NFS client previouslysilently mapped the
EEXIST return into success to paper over thisproblem.
However, in all modern NFS server implementations, a reply cache is kept
in order to retransmit the original reply to a retransmitted request,
rather than performing the operation a second time, allowing this hack
to be avoided. This allows link()-based filelocking over NFS to operate
correctly, as an application requestingthe creation of a new link for a
file to tell if it succeededatomically or not.
Other NFS clients, including Solaris and Linux, generally follow this
behavior for the same reasons. Most clients also now default to TCP,
which also helps avoid the issue of retransmitted but non-idempotent
requests in most cases.
Reported by: Adam McDougall <mcdouga9 at egr dot msu dot edu>,
Timo Sirainen <tss at iki dot fi>
Reviewed by: mohans
Log: NetApp filers return corrupt post op attrs in the wcc on NFS error responses.
This is easy to reproduce for EROFS. I am not sure if the attrs can be corrupt
for other NFS error responses. For now, disabling wcc pre-op attr checks and
post-op attr loads on NFS errors (sysctl'ed).
rev. 1.11 of src/sys/geom/geom_vfs.c
rev. 1.516 of src/sys/kern/vfs_bio.c
rev. 1.35 of src/sys/nfs4client/nfs4_vnops.c
rev. 1.272 of src/sys/nfsclient/nfs_vnops.c
rev. 1.195 of src/sys/sys/buf.h
rev. 1.18 of src/sys/sys/bufobj.h
rev. 1.73 of src/sys/ufs/ffs/ffs_extern.h
rev. 1.133 of src/sys/ufs/ffs/ffs_snapshot.c
rev. 1.324 of src/sys/ufs/ffs/ffs_vfsops.c
Avoid dealing with buffers in bdwrite() that are from other side of
snaplock divisor in the lock order then the buffer being written. Add
new BOP, bop_bdwrite(), to do dirty buffer flushing for same vnode in
the bdwrite(). Default implementation, bufbdflush(), refactors the code
from bdwrite(). For ffs device buffers, specialized implementation is
used.
This commit changes KPI/KBI, thus recompilation of out of tree kernel
modules is required.
Approved by: re (kensmith)
This was actually contained in the MPSAFE NFS client changes which is not
all being MFC'd, however, this fixes a bug in the previous fix to
nfs_flush().
Guidance from: mohans
- In nfs_flush(), clear the NMODIFIED bit only if there are no dirty
buffers *and* there are no buffers queued up for writing.
- Keep track of the number of in-progress async direct IO writes in the
nfsnode. Make fsync/close wait until all of these drain. Add a check to
nfs_getpage() and nfs_putpage().
Backing out an earlier change. It seems harmless for NFS to miss the "force
unmount" flag, making the acquisition of the MNT_ILOCK in nfs_request() and
nfs_sigintr() unnecessary. Pointed out by tegge@.
If B_NOCACHE is set the pages of vm backed buffers will be invalidated.
However clean buffers can be backed by dirty VM pages so invalidating them
can lead to data loss.
Add support for flush dirty page in the data invalidation function
of some network file systems.
This fixes data losses during vnode recycling (and other code paths
using invalbuf(*,V_SAVE,*,*)) for data written using an mmaped file.
Fixes up the handling of shared vnode lock lookups in the NFS client,
adds a FS type specific flag indicating that the FS supports shared
vnode lock lookups, adds some logic in vfs_lookup.c to test this flag
and set lock flags appropriately.
This change fixes the general problem of cascading vnode locks when an
NFS server goes down.
Ideally, we wouldn't need these changes, as enabling shared vnode lock
lookups globally would work. Unfortunately, UFS, for example isn't
ready for shared vnode lock lookups, crashing pretty quickly.
This change is the result of discussions with Stephan Uphoff (ups@).
Thanks to Kris for shaking out several bugs in NFS with shared vnode
lock lookups in current. MFC'ed per Kris' request.
Reviewed by: ups@
filehandle is looked up by 2 or more processes.
- Don't vrele() the losing vnode, as vfs_hash_insert() vput()'s it.
- Initialize mutexes on the losing nfsnode (as these get destroyed in the
nfsnode reclaim path).
- Move the initialization of the filehandle to before the vfs_insert, to
close some races which could result in multiple vnodes for the same
filehandle being inserted into the hash.
Fix to readdir+ reply handling. When inserting an entry into the namecache,
initialize the nfsnode's ctime. Otherwise a subsequent lookup purges the
just entered namecache entry.
Approved by: re
retransmitted. This bug results in data corruption. Writes are
silently dropped on EWOULDBLOCK (caused because socket send buffer is
full and sockbuf timer fires - with NFS/TCP).
Reviewed by: ups@
Approved by: re
mnt_kern_flag. This eliminates a race where MNT_UPDATE flag could be
lost when nmount() raced against sync(), sync_fsync() or quotactl().
Approved by: re (kensmith)
Fix for a NFS/TCP client bug which would cause the NFS/TCP stream to get
out of sync under heavy loads, forcing frequent reconnets, causing EBADRPC
errors etc.
Approved by: re
Add a new kernel environment variable "boot.netif.mtu" which is used to
set the MTU prior to mounting root via NFS. This is required if the
server supports a higher than default MTU because the client will not
see the responses otherwise.
Always supply curthread as argument to nfs_asyncio and nfs_doio
in nfs_strategy. Otherwise, for some buffers, signals would be ignored
at the intr mounts.
Reviewed by: mohan
Approved by: pjd (mentor)
Signals may be delivered to process as well as to the thread. Check the
thread-delivered signals in addition to the process one.
Reviewed by: mohan
Approved by: pjd (mentor)
Adjust minimum iod threads from 4 to 0 -- since we compile the NFS
client into the kernel by default, and many users won't use NFS,
don't start an extra 4 kernel threads that are unused. Once NFS
becomes active, it will start nfsiod's as it needs them.
We might consider mandating a minimum iod's equal to the number of
active NFS mounts (truncated to some value), which would force some
to remain available without having to create a new one if the file
system is mostly inactive.
PR: 70880
Prodded by: cel
Head nod: peter
Pointed out by: Joe <fbsd_user at a1poweruser dot com>
mimicing the NFS reference implementation.
NFS over TCP does not need fast retransmit timeouts, since network loss
and congestion are managed by the transport (TCP), unlike with NFS over
UDP. A long timeout prevents the unnecessary retransmission of non-
idempotent NFS requests.
Reviewed by: mohans, silby, rees?
Sponsored by: Network Appliance, Incorporated
the estimator to be more easily tuned and maintained.
There should be no functional change except there is now a lower limit
on the retransmit timeout to prevent the client from retransmitting
faster than the server's disks can fill requests, and an upper limit
to prevent the estimator from taking too long to retransmit during a
server outage.
Reviewed by: mohan, kris, silby
Sponsored by: Network Appliance, Incorporated
and src/sys/nfsclient/nfs_vnops.c,v 1.262 (by ps@):
- Always return success from NFS strategy. nfs_doio(), in the
event of an error, does the right thing, in terms of setting
the error flags in the buf header. That fixes a crash from
bstrategy().
- Treat ETIMEDOUT as a "recoverable" error, causing the buffer
to be re-dirtied. ETIMEDOUT can occur on soft mounts, when
the number of retries are exceeded, and we don't want data loss
in that case.
Submitted by: Mohan Srinivasan
Approved by: re (scottl)
request, the FreeBSD NFS client will quickly back off to a excessively
long wait (days, then weeks) before retrying the request.
Change the behavior of the FreeBSD NFS client to match the behavior of
the reference NFS client implementation (Solaris). This provides a fixed
delay of 10 seconds between each retry by default. A sysctl, called
nfs3_jukebox_delay, is now available to tune the delay. Unlike Solaris,
the sysctl value on FreeBSD is in seconds, rather than in HZ.
MFC revision 1.136 to RELENG_6
Sponsored by: Network Appliance, Incorporated
Reviewed by: rick
Approved by: re (kensmith), silby
Fix a bug in the NFS/TCP retransmission path.
The bug was that earlier, if a request was retransmitted,
we would do subsequent retransmits every 10 msecs.
This can cause data corruption under moderate loads by reordering
operations as seen by the client NFS attribute cache, and on the
server side when the retransmission occurs after the original request
has left the duplicate cache, since the operation will be committed
for a second time.
Further work on retransmission handling is needed (e.g. they are still
being done sent too often since they are scaled by HZ, and the size of
the dup cache is too small and easily overwhelmed on busy servers).
Submitted by: mohans
Approved by: re (mux)
The client's READDIRPLUS logic skips the attributes and
filehandle of the ".." entry. If the server doesn't send
attributes but does send a filehandle for "..", the
client's logic doesn't account for the extra "value
Fix a bug in NFSv3 READDIRPLUS reply processing
The client's READDIRPLUS logic skips the attributes and
filehandle of the ".." entry. If the server doesn't send
attributes but does send a filehandle for "..", the
client's logic doesn't account for the extra "value
follows" field that indicates whether the filehandle is
present, causing the remaining entries in the reply
to be ignored.
This is an MFC of 1.264 in the CURRENT branch.
Sponsored by: Network Appliance, Inc.
Reviewed by: rick, mohans
Approved by: re, silby
Add boot.nfsroot.options loader tunable.
It allows to specify options for NFS root file system.
Currently supported options are: soft, intr, conn, lockd.
I'm adding this functionality mostly for 'lockd' option, which is only
honored when performing the initial mount and will be silently ignored
if used while updating the mount options.
This will allow to use flock(2) without the need of using varmfs or
rpc.lockd and friends.
Example of use:
boot.nfsroot.options="intr,lockd"
Approved by: re (scottl)
vnode_create_vobject() while preserving the binary ABI
to filesystem modules in RELENG_6: introduce a new function
vnode_create_vobject_off() that takes the size argument
as off_t; move all stock file systems to it; re-implement
the old vnode_create_vobject() using vnode_create_vobject_off()
so that old or binary-only FS modules can work w/o hitting the
bug. The trick is to pass a size of 0 to vnode_create_vobject_off()
so that it will call VOP_GETATTR() and thus get the actual,
untruncated file size even if the calling module still uses
the old vnode_create_vobject().
PR: kern/92243
Approved by: re (scottl)
In nfs_dolock(), GC now under-used ioflg, rendered obsolete when we moved
from using a fifo to talk to rpc.lockd to using a special device node.
Approved by: re (scottl)