fdvp and fvp vnodes are not locked, and race with reclaim cannot be handled
by the generic bypass routine.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31310
Handle it in fifo_close by checking for v_fifoinfo == NULL
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31310
Caller of VOP_LOOKUP() passes dvp locked and expect it locked on return.
Relock of lower vnode in any case could leave upper vnode reclaimed and
unlocked.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31310
The upper vnode reference to the lower vnode is the only reference that
keeps our pointer to the lower vnode alive. If lower vnode is relocked
during the VOP call, upper vnode might become unlocked and reclaimed,
which invalidates our reference.
Add a transient vhold around VOP call.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31310
The advlock VOP takes the vnode unlocked, which makes the normal bypass
function racy. Same as null_pgcache_read(), nullfs implementation needs
to take interlock and reference lower vnode under it.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31310
In certain emergency cases such as media failure or removal, UFS will
initiate a forced unmount in order to prevent dirty buffers from
accumulating against the no-longer-usable filesystem. The presence
of a stacked filesystem such as nullfs or unionfs above the UFS mount
will prevent this forced unmount from succeeding.
This change addreses the situation by allowing stacked filesystems to
be recursively unmounted on a taskqueue thread when the MNT_RECURSE
flag is specified to dounmount(). This call will block until all upper
mounts have been removed unless the caller specifies the MNT_DEFERRED
flag to indicate the base filesystem should also be unmounted from the
taskqueue.
To achieve this, the recently-added vfs_pin_from_vp()/vfs_unpin() KPIs
have been combined with the existing 'mnt_uppers' list used by nullfs
and renamed to vfs_register_upper_from_vp()/vfs_unregister_upper().
The format of the mnt_uppers list has also been changed to accommodate
filesystems such as unionfs in which a given mount may be stacked atop
more than one lower mount. Additionally, management of lower FS
reclaim/unlink notifications has been split into a separate list
managed by a separate set of KPIs, as registration of an upper FS no
longer implies interest in these notifications.
Reviewed by: kib, mckusick
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D31016
For NFSv4.1/4.2, the client may set the "seqid" field of the
stateid to 0 in RPC requests. This indicates to the server that
it should not check the "seqid" or return NFSERR_OLDSTATEID if the
"seqid" value is not up to date w.r.t. Open/Lock operations
on the stateid. This "seqid" is incremented by the NFSv4 server
for each Open/OpenDowngrade/Lock/Locku operation done on the stateid.
Since a failure return of NFSERR_OLDSTATEID is of no use to
the client for I/O operations, it makes sense to set "seqid"
to 0 for the stateid argument for I/O operations.
This avoids server failure replies of NFSERR_OLDSTATEID,
although I am not aware of any case where this failure occurs.
This makes the FreeBSD NFSv4.1/4.2 client compatible with the
Linux NFSv4.1/4.2 client.
MFC after: 2 weeks
Since MAXPHYS now allows the FreeBSD NFS client
to do 1Mbyte I/O operations, add a sysctl called vfs.nfsd.srvmaxio
so that the maximum NFS server I/O size can be set up to 1Mbyte.
The Linux NFS client can also do 1Mbyte I/O operations.
The default of 128Kbytes for the maximum I/O size has
not been changed for two reasons:
- kern.ipc.maxsockbuf must be increased to support 1Mbyte I/O
- The limited benchmarking I can do actually shows a drop in I/O rate
when the I/O size is above 256Kbytes.
However, daveb@spectralogic.com reports seeing an increase
in I/O rate for the 1Mbyte I/O size vs 128Kbytes using a Linux client.
Reviewed by: asomers
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D30826
hst will be nul-terminated but the remaining space in the buffer is left
uninitialized. Avoid copying the entire buffer to ensure that
uninitialized bytes are not leaked via statfs(2).
Reported by: KMSAN
Reviewed by: rmacklem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31167
Commit 844aa31c6d87 added cache_enter_time_flags(), specifically
so that the NFS client could specify that cache enter replace
any stale entry for the same name. Doing so avoids a KASSERT()
panic() in cache_enter_time(), as reported by the PR.
This patch uses cache_enter_time_flags() for Readdirplus, to
avoid the panic(), since it is impossible for the NFS client
to know if another client (or a local process on the NFS server)
has replaced a file with another file of the same name.
This patch only affects NFS mounts that use the "rdirplus"
mount option.
There may be other places in the NFS client where this needs
to be done, but no panic() has been observed during testing.
PR: 257043
MFC after: 2 weeks
The fi_rgen and fi_wgen fields are generation numbers used when sleeping
waiting for the other end of the fifo to be opened. The fields were not
explicitly initialized after allocation, but this was harmless. To
avoid false positives from KMSAN, though, ensure that they get
initialized to zero.
Reported by: KMSAN
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Linux has had an "nconnect" NFS mount option for some time.
It specifies that N (up to 16) TCP connections are to created for a mount,
instead of just one TCP connection.
A discussion on freebsd-net@ indicated that this could improve
client<-->server network bandwidth, if either the client or server
have one of the following:
- multiple network ports aggregated to-gether with lagg/lacp.
- a fast NIC that is using multiple queues
It does result in using more IP port#s and might increase server
peak load for a client.
One difference from the Linux implementation is that this implementation
uses the first TCP connection for all RPCs composed of small messages
and uses the additional TCP connections for RPCs that normally have
large messages (Read/Readdir/Write). The Linux implementation spreads
all RPCs across all TCP connections in a round robin fashion, whereas
this implementation spreads Read/Readdir/Write across the additional
TCP connections in a round robin fashion.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D30970
When the setting of kern.ipc.maxsockbuf is less than what is
desired for I/O based on vfs.maxbcachebuf and vfs.nfs.bufpackets,
a console message of "Consider increasing kern.ipc.maxsockbuf".
is printed.
This patch modifies the message to provide a suggested value
for kern.ipc.maxsockbuf.
Note that the setting is only needed when the NFS rsize/wsize
is set to vfs.maxbcachebuf.
While here, make nfs_bufpackets global, so that it can be used
by a future patch that adds a sysctl to set the NFS server's
maximum I/O size. Also, remove "sizeof(u_int32_t)" from the maximum
packet length, since NFS_MAXXDR is already an "overestimate"
of the actual length.
MFC after: 2 weeks
Each unionfs node holds a reference to its parent directory vnode.
A single open file reference can therefore end up keeping an
arbitrarily deep vnode hierarchy in place. When that reference is
released, the resulting VOP_RECLAIM call chain can then exhaust the
kernel stack.
This is easily reproducible by running the unionfs.sh stress2 test.
Fix it by deferring recursive unionfs vnode release to taskqueue
context.
PR: 238883
Reviewed By: kib (earlier version), markj
Differential Revision: https://reviews.freebsd.org/D30748
During FUSE_SETLK, the owner field should uniquely identify the calling
process. The fusefs module now sets it to the process's pid.
Previously, it expected the calling process to set it directly, which
was wrong.
libfuse also apparently expects the owner field to be set during
FUSE_GETLK, though I'm not sure why.
PR: 256005
Reported by: Agata <chogata@moosefs.pro>
MFC after: 2 weeks
Reviewed by: pfg
Differential Revision: https://reviews.freebsd.org/D30622
When NFSv4.1 support was added to the client, the implementation was
still experimental and, as such, the default minor version was set to 0.
Since the NFSv4.1 client implementation is now believed to be solid
and the NFSv4.1/4.2 protocol is significantly better than NFSv4.0,
I beieve that NFSv4.1/4.2 should be used where possible.
This patch changes the default minor version for NFSv4 to be the highest
minor version supported by the NFSv4 server. If a specific minor version
is desired, the "minorversion" mount option can be used to override
this default. This is compatible with the Linux NFSv4 client behaviour.
This was discussed on freebsd-current@ in mid-May 2021 under
the subject "changing the default NFSv4 minor version" and
the consensus seemed to be support for this change.
It also appeared that changing this for FreeBSD 13.1 was
not considered a POLA violation, so long as UPDATING
and RELNOTES entries were made for it.
MFC after: 2 weeks
Allocate nameidata on stack and NDPREINIT() it, for compatibility with
assumptions from other filesystems' lookup code.
Reviewed by: mckusick
Discussed with: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30041
Every FUSE operation has a unique value in its header. As the name
implies, these values are supposed to be unique among all outstanding
operations. And since FUSE_INTERRUPT is asynchronous and racy, it is
desirable that the unique values be unique among all operations that are
"close in time".
Ensure that they are actually unique by incrementing them whenever we
reuse a fuse_dispatcher object, for example during fsync, write, and
listextattr.
PR: 244686
MFC after: 2 weeks
Reviewed by: pfg
Differential Revision: https://reviews.freebsd.org/D30810
/dev/fuse is always ready for writing, so it's kind of dumb to poll it.
But some applications do it anyway. Better to return ready than EINVAL.
MFC after: 2 weeks
Reviewed by: emaste, pfg
Differential Revision: https://reviews.freebsd.org/D30784
The fusefs driver will print warning messages about FUSE servers that
commit protocol violations. Previously it would print those warnings on
every violation, but that could spam the console. Now it will print
each warning no more than once per lifetime of the mount. There is also
now a dtrace probe for each violation.
MFC after: 2 weeks
Sponsored by: Axcient
Reviewed by: emaste, pfg
Differential Revision: https://reviews.freebsd.org/D30780
When the NFSv4.0 client was implemented, acquisition of a clientid
via SetClientID/SetClientIDConfirm was done upon the first Open,
since that was when it was needed. NFSv4.1/4.2 acquires the clientid
during mount (via ExchangeID/CreateSession), since the associated
session is required during mount.
This patch modifies the NFSv4.0 mount so that it acquires the
clientid during mount. This simplifies the code and makes it
easy to implement "find the highest minor version supported by
the NFSv4 server", which will be done for the default minorversion
in a future commit.
The "start_renewthread" argument for nfscl_getcl() is replaced
by "tryminvers", which will be used by the aforementioned
future commit.
MFC after: 2 weeks
Michael Dexter <editor@callfortesting.org> reported
a crash in FreeNAS, where the first argument to
clnt_bck_svccall() was no longer valid.
This argument is a pointer to the callback CLIENT
structure, which is free'd when the associated
NFSv4 ClientID is free'd.
This appears to have occurred because a callback
reply was still in the socket receive queue when
the CLIENT structure was free'd.
This patch acquires a reference count on the CLIENT
that is not CLNT_RELEASE()'d until the socket structure
is destroyed. This should guarantee that the CLIENT
structure is still valid when clnt_bck_svccall() is called.
It also adds a check for closed or closing to
clnt_bck_svccall() so that it will not process the callback
RPC reply message after the ClientID is free'd.
Comments by: mav
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D30153
A problem was reported via email, where a large (130000+) accumulation
of NFSv4 opens on an NFSv4 mount caused significant lock contention
on the mutex used to protect the client mount's open/lock state.
Although the root cause for the accumulation of opens was not
resolved, it is obvious that the NFSv4 client is not designed to
handle 100000+ opens efficiently.
For a common case where delegations are not being issued by the
NFSv4 server, the code acquires the mutex lock for open/lock state,
finds the delegation list empty and just unlocks the mutex and returns.
This patch adds an NFS mount point flag that is set when a delegation
is issued for the mount. Then the patched code checks for this flag
before acquiring the open/lock mutex, avoiding the need to acquire
the lock for the case where delegations are not being issued by the
NFSv4 server.
This change appears to be performance neutral for a small number
of opens, but should reduce lock contention for a large number of opens
for the common case where server is not issuing delegations.
This commit should not affect the high level semantics of delegation
handling.
MFC after: 2 weeks
Pre-r318997 the code looked like:
if (vp->v_mount->mnt_stat.f_fsid.val[0] != (uint32_t)np->n_vattr.na_filesid[0])
vap->va_fsid = (uint32_t)np->n_vattr.na_filesid[0];
Doing this assignment got lost by r318997 and, as such, NFSv4 mounts
of servers with trees of file systems on the server is broken, due to duplicate
fileno values for the same st_dev/va_fsid.
Although I could have re-introduced the assignment, since the value of
na_filesid[0] is not guaranteed to be unique across the server file systems,
I felt it was better to always do the hash for na_filesid[0,1].
Since dev_t (st_dev/va_fsid) is now 64bits, I switched to a 64bit hash.
There is a slight chance of a hash conflict where 2 different na_filesid
values map to same va_fsid, which will be documented in the BUGS
section of the man page for mount_nfs(8). Using a table to keep track
of mappings to catch conflicts would not easily scale to 10,000+ server file
systems and, when the conflict occurs, it only results in fts(3) reporting
a "directory cycle" under certain circumstances.
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D30660
This is aimed at preventing stacked filesystems like nullfs and unionfs
from "losing" their lower mounts due to forced unmount. Otherwise,
VFS operations that are passed through to the lower filesystem(s) may
crash or otherwise cause unpredictable behavior.
Introduce two new functions: vfs_pin_from_vp() and vfs_unpin().
which are intended to be called on the lower mount(s) when the stacked
filesystem is mounted and unmounted, respectively.
Much as registration in the mnt_uppers list previously did, pinning
will prevent even forced unmount of the lower FS and will allow the
stacked FS to freely operate on the lower mount either by direct
use of the struct mount* or indirect use through a properly-referenced
vnode's v_mount field.
vfs_pin_from_vp() is modeled after vfs_ref_from_vp() in that it uses
the mount interlock coupled with re-checking vp->v_mount to ensure
that it will fail in the face of a pending unmount request, even if
the concurrent unmount fully completes.
Adopt these new functions in both nullfs and unionfs.
Reviewed By: kib, markj
Differential Revision: https://reviews.freebsd.org/D30401
Commit d224f05fcfc1 pre-parsed the next operation number for
the put file handle operations. This patch uses this next
operation number, plus the type of the file handle being set by
the put file handle operation, to implement the rules in RFC5661
Sec. 2.6 with respect to replying NFSERR_WRONGSEC.
This patch also adds a check to see if NFSERR_WRONGSEC should be
replied when about to perform Lookup, Lookupp or Open with a file
name component, so that the NFSERR_WRONGSEC reply is done for
these operations, as required by RFC5661 Sec. 2.6.
This patch does not have any practical effect for the FreeBSD NFSv4
client and I believe that the same is true for the Linux client,
since NFSERR_WRONGSEC is considered a fatal error at this time.
MFC after: 2 weeks
Commit 947bd2479ba9 added support for the Secinfo_no_name operation.
When a non-exported file system is being traversed, the list of
security flavors is empty. It turns out that the Linux client
mount attempt fails when the security flavors list in the
Secinfo_no_name reply is empty.
This patch modifies Secinfo/Secinfo_no_name so that it replies
with all four security flavors when the list is empty.
This fixes Linux NFSv4.1/4.2 mounts when the file system at
the NFSv4 root (as specified on a V4: exports(5) line) is
not exported.
MFC after: 2 weeks
RFC5661 Sec. 2.6 specifies when a NFSERR_WRONGSEC error reply can be done.
For the four operations PutFH, PutrootFH, PutpublicFH and RestoreFH,
NFSERR_WRONGSEC can or cannot be replied, depending upon what operation
follows one of these operations in the compound.
This patch modifies nfsrvd_compound() so that it parses the next operation
number before executing any of the above four operations, storing it in
"nextop".
A future commit will implement use of "nextop" to decide if NFSERR_WRONGSEC
can be replied for the above four operations.
This commit should not change the semantics of performing the compound RPC.
MFC after: 2 weeks
The 'nodup' option forces fdescfs to return real vnode behind file
descriptor instead of the fdescfs fd vnode, on lookup. The end result
is that e.g. stat("/dev/fd/3") returns the stat data for the underlying
vnode, if any. Similarly, fchdir(2) works in the expected way.
For open(2), if applied over file descriptor opened with O_PATH, it
effectively re-open that vnode into normal file descriptor which has the
specified access mode, assuming the current vnode permissions allow it.
If the file descriptor does not reference vnode, the behavior is unchanged.
This is done by a mount option, because permission check on open(2) breaks
established fdescfs open semantic of dup(2)-ing the descriptor. So it
is not suitable for /dev/fd mount.
Tested by: Andrew Walker <awalker@ixsystems.com>
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30140
Without this patch, nfsd_checkrootexp() returns failure
and then the NFSv4 operation would reply NFSERR_WRONGSEC.
RFC5661 Sec. 2.6 only allows a few NFSv4 operations, none
of which call nfsv4_checktootexp(), to return NFSERR_WRONGSEC.
This patch modifies nfsd_checkrootexp() to return the
error instead of a boolean and sets the returned error to an RPC
layer AUTH_ERR, as discussed on nfsv4@ietf.org.
The patch also fixes nfsd_errmap() so that the pseudo
error NFSERR_AUTHERR is handled correctly such that an RPC layer
AUTH_ERR is replied to the NFSv4 client.
The two new "enum auth_stat" values have not yet been assigned
by IANA, but are the expected next two values.
The effect on extant NFSv4 clients of this change appears
limited to reporting a different failure error when a
mount that does not use adequate security is attempted.
MFC after: 2 weeks
There are several NFSv4.1/4.2 server operation functions which
have unneeded checks for the NFSv4 root being set up.
The checks are not needed because the operations always follow
a Sequence operation, which performs the check.
This patch deletes these checks, simplifying the code so
that a future patch that fixes the checks to conform with
RFC5661 Sec. 2.6 will be less extension.
MFC after: 2 weeks
The Linux client is now attempting to use the Secinfo_no_name
operation for NFSv4.1/4.2 mounts. Although it does not seem to
mind the NFSERR_NOTSUPP reply, adding support for it seems
reasonable.
I also noticed that "savflag" needed to be 64bits in
nfsrvd_secinfo() since nd_flag in now 64bits, so I changed
the declaration of it there. I also added code to set "vp" NULL
after performing Secinfo/Secinfo_no_name, since these
operations consume the current FH, which is represented
by "vp" in nfsrvd_compound().
Fixing when the server replies NFSERR_WRONGSEC so that
it conforms to RFC5661 Sec. 2.6 still needs to be done
in a future commit.
MFC after: 2 weeks
Instead of requiring all implementations of vfs_quotactl to unbusy
the mount for Q_QUOTAON and Q_QUOTAOFF, add an "mp_busy" in/out param
to VFS_QUOTACTL(9). The implementation may then indicate to the caller
whether it needed to unbusy the mount.
Also, add stbool.h to libprocstat modules which #define _KERNEL
before including sys/mount.h. Otherwise they'll pull in sys/types.h
before defining _KERNEL and therefore won't have the bool definition
they need for mp_busy.
Reviewed By: kib, markj
Differential Revision: https://reviews.freebsd.org/D30556
Parts of libprocstat like to pretend they're kernel components for the
sake of including mount.h, and including sys/types.h in the _KERNEL
case doesn't fix the build for some reason. Revert both the
VFS_QUOTACTL() change and the follow-up "fix" for now.
Instead of requiring all implementations of vfs_quotactl to unbusy
the mount for Q_QUOTAON and Q_QUOTAOFF, add an "mp_busy" in/out param
to VFS_QUOTACTL(9). The implementation may then indicate to the caller
whether it needed to unbusy the mount.
Reviewed By: kib, markj
Differential Revision: https://reviews.freebsd.org/D30218
A problem was reported via email, where a large (130000+) accumulation
of NFSv4 opens on an NFSv4 mount caused significant lock contention
on the mutex used to protect the client mount's open/lock state.
Although the root cause for the accumulation of opens was not
resolved, it is obvious that the NFSv4 client is not designed to
handle 100000+ opens efficiently. When searching for an open,
usually for a match by file handle, a linear search of all opens
is done.
Commit 3f7e14ad9345 added a hash table of lists hashed on file handle
for the opens. This patch uses the hash lists for searching for
a matching open based of file handle instead of an exhaustive
linear search of all opens.
This change appears to be performance neutral for a small number
of opens, but should improve expected performance for a large
number of opens.
This commit should not affect the high level semantics of open
handling.
MFC after: 2 weeks
A problem was reported via email, where a large (130000+) accumulation
of NFSv4 opens on an NFSv4 mount caused significant lock contention
on the mutex used to protect the client mount's open/lock state.
Although the root cause for the accumulation of opens was not
resolved, it is obvious that the NFSv4 client is not designed to
handle 100000+ opens efficiently. When searching for an open,
usually for a match by file handle, a linear search of all opens
is done.
Commit 3f7e14ad9345 added a hash table of lists hashed on file handle
for the opens. This patch uses the hash lists for searching for
a matching open based of file handle instead of an exhaustive
linear search of all opens.
This change appears to be performance neutral for a small number
of opens, but should improve expected performance for a large
number of opens. This patch also moves any found match to the front
of the hash list, to try and maintain the hash lists in recently
used ordering (least recently used at the end of the list).
This commit should not affect the high level semantics of open
handling.
MFC after: 2 weeks
A problem was reported via email, where a large (130000+) accumulation
of NFSv4 opens on an NFSv4 mount caused significant lock contention
on the mutex used to protect the client mount's open/lock state.
Although the root cause for the accumulation of opens was not
resolved, it is obvious that the NFSv4 client is not designed to
handle 100000+ opens efficiently. When searching for an open,
usually for a match by file handle, a linear search of all opens
is done.
This patch adds a table of hash lists for the opens, hashed on
file handle. This table will be used by future commits to
search for an open based on file handle more efficiently.
MFC after: 2 weeks