r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
If node attribute returned in the reply for read rpc indicate
truncation, and it happens that the vnode is exclusively locked,
update of the node attributes would try to shrink vnode size. Since
during the read some vnode pages were busied by the reading thread,
vnode_pager_setsize() deadlocks waiting for the busy state owned by
the caller.
Use a thread-local flag to indicate that NFS read owns some (s)busy
pages states and postpone the call to vnode_pager_setsize() until the
thread relinguishes the ownership.
Diagnosed by: rlibby
Tested by: pho, rlibby
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.
This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.
This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.
[0] https://pubs.opengroup.org/onlinepubs/9699919799/
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23247
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
If nfsrpc_getdirpath() returns NFSERR_MINORVERMISMATCH, it would erroneously
get mapped to EIO. This was not particularily harmful, but would make it
hard for sysadmins to diagnose why an NFSv4 mount is failing.
mount_nfs.c still needs to be fixed so that it does not report
NFSERR_MINORVERMISMATCH as an unknown error 10021.
MFC after: 1 week
None of these case were actually using the variable(s) uninitialized, but
I figured that silencing the warnings via initializing them made sense.
Some of these predated r355677.
This patch adds support for NFSv4.2 (RFC-7862) and Extended Attributes
(RFC-8276) to the NFS client and server.
NFSv4.2 is comprised of several optional features that can be supported
in addition to NFSv4.1. This patch adds the following optional features:
- posix_fadvise(POSIX_FADV_WILLNEED/POSIX_FADV_DONTNEED)
- posix_fallocate()
- intra server file range copying via the copy_file_range(2) syscall
--> Avoiding data tranfer over the wire to/from the NFS client.
- lseek(SEEK_DATA/SEEK_HOLE)
- Extended attribute syscalls for "user" namespace attributes as defined
by RFC-8276.
Although this patch is fairly large, it should not affect support for
the other versions of NFS. However it does add two new sysctls that allow
a sysadmin to limit which minor versions of NFSv4 a server supports, allowing
a sysadmin to disable NFSv4.2.
Unfortunately, when the NFS stats structure was last revised, it was assumed
that there would be no additional operations added beyond what was
specified in RFC-7862. However RFC-8276 did add additional operations,
forcing the NFS stats structure to revised again. It now has extra unused
entries in all arrays, so that future extensions to NFSv4.2 can be
accomodated without revising this structure again.
A future commit will update nfsstat(1) to report counts for the new NFSv4.2
specific operations/procedures.
This patch affects the internal interface between the nfscommon, nfscl and
nfsd modules and, as such, they all must be upgraded simultaneously.
I will do a version bump (although arguably not needed), due to this.
This code has survived a "make universe" but has not been built with a
recent GCC. If you encounter build problems, please email me.
Relnotes: yes
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.
v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
We might race with reclaim, and then this is no longer a nfs vnode, in
which case we do not need to handle deferred vnode_pager_setsize()
either.
Reported by: rk@ronald.org
PR: 242184
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
flag and use the same system.
This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22116
Make the nfsclient always call vnode_pager_setsize() with the vnode
exclusively locked. This ensures that page fault always can find the
backing page if the object size check succeeded. Set VV_VMSIZEVNLOCK
flag on NFS nodes.
The main offender breaking the interface in nfsclient is
nfs_loadattrcache(), which is used whenever server responded with
updated attributes, which can happen on non-changing operations as
well. Also, iod threads only have buffers locked (and even that is
LK_KERNPROC), but they still may call nfs_loadattrcache() on RPC
response.
Instead of immediately calling vnode_pager_setsize() if server
response indicated changed file size, but the vnode is not exclusively
locked, set a new node flag NVNSETSZSKIP. When the vnode exclusively
locked, or when we can temporary upgrade the lock to exclusive, call
vnode_pager_setsize(), by providing the nfsclient VOP_LOCK() implementation.
Tested by: pho
Discussed with: rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D21883
Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594
To be consistent with replacing the mtx_lock()/mtx_unlock() calls on
the NFS node mutex (n_mtx) and ncl_iod_mutex, this patch replaces
all mtx_assert() calls on these mutexes with macros as well.
This will simplify changing these locks to sx locks in a future commit.
However, this change may be delayed indefinitely, since it appears there
is a deadlock when vnode_pager_setsize() is called to shrink the size
and the NFS node lock is held.
There is no semantic change as a result of this commit.
Suggested by: kib
MFC after: 1 week
Since the NFS node mutex needs to change to an sx lock so it can be held when
vnode_pager_setsize() is called and the iod lock is held when the NFS node lock
is acquired, the iod mutex will need to be changed to an sx lock as well.
To simply the future commit that changes both the NFS node lock and iod lock
to sx locks, this commit replaces all mtx_lock()/mtx_unlock() calls on the
iod lock with macros.
There is no semantic change as a result of this commit.
I don't know when the future commit will happen and be MFC'd, so I have
set the MFC on this commit to one week so that it can be MFC'd at the same
time.
Suggested by: kib
MFC after: 1 week
For a long time, some places in the NFS code have locked/unlocked the
NFS node lock with the macros NFSLOCKNODE()/NFSUNLOCKNODE() whereas
others have simply used mtx_lock()/mtx_unlock().
Since the NFS node mutex needs to change to an sx lock so it can be held when
vnode_pager_setsize() is called, replace all occurrences of mtx_lock/mtx_unlock
with the macros to simply making the change to an sx lock in future commit.
There is no semantic change as a result of this commit.
I am not sure if the change to an sx lock will be MFC'd soon, so I put
an MFC of 1 week on this commit so that it could be MFC'd with that commit.
Suggested by: kib
MFC after: 1 week
node lock when shrinking.
This is similar to r252528, applied to the above commit.
Apparently there is a race which makes necessary at least to keep the
n_size and pager size consistent when extending. Current suspect is
that iod threads perform vnode_pager_setsize() without taking the
vnode lock, which corrupts the file content.
Reported and tested by: Masachika ISHIZUKA <ish@amail.plala.or.jp>
Discussed with: rmacklem (related issues)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
vnode_pager_setsize() under the node mutex.
r248567 moved some calls of vnode_pager_setsize() after the node lock
is unlocked, do the rest now.
Reported and tested by: peterj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
These were fully neutered in r177676 (2008), but not removed at the time for
unclear reasons. They're totally dead code, so go ahead and yank them now.
No functional change.
Current implementation of vnode_create_vobject() and
vnode_destroy_vobject() is written so that it prepared to handle the
vm object destruction for live vnode. Practically, no filesystems use
this, except for some remnants that were present in UFS till today.
One of the consequences of that model is that each filesystem must
call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result
all of them get rid of the v_object in reclaim.
Move the call to vnode_destroy_vobject() to vgonel() before
VOP_RECLAIM(). This makes v_object stable: either the object is NULL,
or it is valid vm object till the vnode reclamation. Remove code from
vnode_create_vobject() to handle races with the parallel destruction.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21412
This silly code segment has existed in the sources since it was brought
into FreeBSD 10 years ago. I honestly have no idea why this was done.
It was possible that I thought that it might have been better to not
set B_ASYNC for the "else" case, but I can't remember.
Anyhow, this patch gets rid of the if/else that does the same thing
either way, since it looks silly and upsets a static analyser.
This will have no semantic effect on the NFS client.
PR: 238167
vtruncbuf takes a "struct ucred*" argument. AFAICT, it's been unused ever
since that function was first added in r34611. Remove it. Also, remove some
"struct ucred" arguments from fuse and nfs functions that were only used by
vtruncbuf.
Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20377
The more appropriate place to do the flushing is VOP_OPEN(). This was
uncovered because VOP_SET_TEXT() is now called with the vnode'
vm_object rlocked, which is incompatible with the flush operations.
After the move, there is no need for NFS-specific VOP_SET_TEXT
overload.
Sponsored by: The FreeBSD Foundation
MFC after: 30 days
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
r340744 broke the NFSv4 client, because it replaced pfind_locked() with a
call to pfind(), since pfind() acquires the sx lock for the pid hash and
the NFSv4 already holds a mutex when it does the call.
The patch fixes the problem by recreating a pfind_any_locked() and adding the
functions pidhash_slockall() and pidhash_sunlockall to acquire/release
all of the pid hash locks.
These functions are then used by the NFSv4 client instead of acquiring
the allproc_lock and calling pfind().
Reviewed by: kib, mjg
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19887
There are a few places that use hand crafted versions of the macros
from sys/netinet/in.h making it difficult to actually alter the
values in use by these macros. Correct that by replacing handcrafted
code with proper macro usage.
Reviewed by: karels, kristof
Approved by: bde (mentor)
MFC after: 3 weeks
Sponsored by: John Gilmore
Differential Revision: https://reviews.freebsd.org/D19317
that can happen when rerooting into NFSv4 rootfs with kernel
built with INVARIANTS.
I've talked to rmacklem@ (back in 2017), and while the root cause
is still unknown, the case guarded by assertion (nfscl_doclose()
being called from VOP_INACTIVE) is believed to be safe, and the
whole thing seems to run just fine.
Obtained from: CheriBSD
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
o In vm_pager_bufferinit() create pbuf_zone and start accounting on how many
pbufs are we going to have set.
In various subsystems that are going to utilize pbufs create private zones
via call to pbuf_zsecond_create(). The latter calls uma_zsecond_create(),
and sets a limit on created zone. After startup preallocate pbufs according
to requirements of all pbuf zones.
Subsystems that used to have a private limit with old allocator now have
private pbuf zones: md(4), fusefs, NFS client, smbfs, VFS cluster, FFS,
swap, vnode pager.
The following subsystems use shared pbuf zone: cam(4), nvme(4), physio(9),
aio(4). They should have their private limits, but changing that is out of
scope of this commit.
o Fetch tunable value of kern.nswbuf from init_param2() and while here move
NSWBUF_MIN to opt_param.h and eliminate opt_swap.h, that was holding only
this option.
Default values aren't touched by this commit, but they probably should be
reviewed wrt to modern hardware.
This change removes a tight bottleneck from sendfile(2) operation, that
uses pbufs in vnode pager. Other pagers also would benefit from faster
allocation.
Together with: gallatin
Tested by: pho
The NFS client code (nfsrpc_readdir() and nfsrpc_readdirplus()) wasn't
filling in parts of the readdir reply, such as d_pad[01] and the bytes
at the end of d_name within d_reclen. As such, data left in a buffer cache
block could be leaked to userland in the readdir reply.
This patch makes sure all of the data is filled in.
Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: kib, markj
MFC after: 2 weeks
Prior to this patch, nfs_advlock() did NFSVOPUNLOCK(); return (error);
in many places. This patch replaces these code sequenences with a "goto out;"
and does the NFSVOPUNLOCK(); return (error); at the end of the function
in order to make the vnode locking simpler.
This patch does not change the semantics of nfs_advlock().
Suggested by: kib
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D17853
Leave ptrace(2) alone for the moment as it's defined to take a caddr_t.
Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17852
This will enable callers to take const paths as part of syscall
decleration improvements.
Where doing so is easy and non-distruptive carry the const through
implementations. In UFS the value is passed to an interface that must
take non-const values. In ZFS, const poisoning would touch code shared
with upstream and it's not worth adding diffs.
Bump __FreeBSD_version for external API consumers.
Reviewed by: kib (prior version)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17805
A crash was reported where the crash occurred in nfs_advlock() when the
NFS_ISV4(vp) macro was being executed. This was caused by the vnode
being VI_DOOMED due to a forced dismount in progress.
This patch fixes the problem by locking the vnode before executing the
NFS_ISV4() macro.
Tested by: rlibby
PR: 232673
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D17757
Use bypass to catch any NFS VOP dispatch and route it through the
wrapper which does sigdeferstop() and then dispatches original
VOP. NFS does not need a bypass below it, which is not supported.
The vop offset in the vop_vector is added since otherwise it is
impossible to get vop_op_t from the internal table, and I did not
wanted to create the layered fs only to wrap NFS VOPs.
VFS_OP()s wrap is straightforward.
Requested and reviewed by: mjg (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D17658
Newer versions of gcc generate "might not be initialized" warnings for
several variables in nfsrpc_doiods(). I have checked and all of these
variables are assigned values before they are used.
In the one case of "tdrpc", it could have passed garbage as an argument
to nfscl_dofflayoutio() when mirrorcnt is one. However nfscl_dofflayoutio() only
uses the argument when mirrorcnt > 1, so it wasn't actually broken.
This patch initializes "tdrpc" to avoid confusion and initializes the rest
to make the compiler happy.
Requested by: mmacy
Newer versions of gcc generate "set, but not used" warnings.
Add __unused macros to silence these warnings.
Although the variables are not being used, they are values parsed from
arguments to callback RPCs that might be needed in the future.
Requested by: mmacy
When a NFSv4.1 client mount using pNFS detects a failure trying to do a
Renew (actually just a Sequence operation), the code would simply try
again and again and again every 30sec.
This would tie up the "nfscl" thread, which should also be doing other
things like Renews on other DSs and the MDS.
This patch adds code which closes down the TCP connection and marks it
defunct when Renew detects an failure to communicate with the DS, so
further Renews will not be attempted until a new working TCP connection to
the DS is established.
It also makes the call to nfscl_cancelreqs() unconditional, since
nfscl_cancelreqs() checks the NFSCLDS_SAMECONN flag and does so while holding
the lock.
This fix only applies to the NFSv4.1 client whne using pNFS and without it
the only effect would have been an "nfscl" thread busy doing Renew attempts
on an unresponsive DS.
MFC after: 2 weeks
Without this patch, the client side NFSv4.1 pNFS code erroneously did writes
and commits to both DS mirrors using the TCP connection of the first one.
For my test setup this worked, since I have both DSs running on the same
machine, but it would have failed when the DSs are on separate machines.
This patch fixes the code to use the correct TCP connection for each DS.
This patch should only affect the NFSv4.1 client when using "pnfs" mounts
to mirrored DSs.
MFC after: 2 weeks
So long as the TCP connection to a pNFS DS isn't shared with other DSs,
it can be closed down when the DS is being disabled in the pNFS client.
This causes any RPCs in progress to fail.
This patch only affects the NFSv4.1 pNFS client when errors occur
while doing I/O on a DS.
MFC after: 2 weeks
an I/O attempt on a DS to the server via LayoutReturn.
The current FreeBSD client can generate these errors for an operational
DS while doing a recovery of a mirror after a mirrored DS has been repaired.
I am not sure why these errors occur, but my best current guess is a race
between the Layout Recall issued by the kernel code run from pnfsdscopymr(8)
and a Read operation on the DS for the file bing copied.
The errrors are not fatal, since the client falls back on doing I/O through
the MDS, which can do the I/O successfully as a proxy. (The fact that the
MDS can do this indicates that the file does still exist on the functioning
DS.)
This patch only affects behaviour of the pNFS client and only when using
Flexible File layouts.
MFC after: 2 weeks
Without this patch, the NFSv4.1 pNFS client shared a single TCP connection
for all DSs that resided on the same machine. This made disabling one of
the DSs impossible. Although unlikely, it is possible that the storage
subsystem has failed in such a way that the storage for one DS on a machine
is no longer functioning correctly, but the storage used by another DS on
the same machine is still ok. For this case, it would be nice if a system
can fail one of the DSs without failing them all.
This patch changes the default behaviour to use separate TCP connections
for each DS even if they reside on the same machine.
I do not believe that this will be a problem for extant pNFS servers, but
a sysctl can be set to restore the old behaviour if this change causes a
problem for an extant pNFS server.
This patch only affects the NFSv4.1 pNFS client.
MFC after: 2 weeks