If the FUSE server tells the kernel that a file's size has changed, then
the kernel must invalidate any portion of that file in cache. But the
kernel can't do that during VOP_STRATEGY, because the file's buffers are
already locked. Instead, proceed with the write.
PR: 256937
Reported by: Agata <chogata@moosefs.pro>
Tested by: Agata <chogata@moosefs.pro>
MFC after: 2 weeks
Reviewed by: pfg
Differential Revision: https://reviews.freebsd.org/D32332
fuse_vnop_bmap needs to know the file's size in order to calculate the
optimum amount of readahead. If the file's size is unknown, it must ask
the FUSE server. But if the file's data was previously cached and the
server reports that its size has shrunk, fusefs must invalidate the
cached data. That's not possible during VOP_BMAP because the buffer
object is already locked.
Fix the panic by not querying the FUSE server for the file's size during
VOP_BMAP if we don't need it. That's also a a slight performance
optimization.
PR: 256937
Reported by: Agata <chogata@moosefs.pro>
Tested by: Agata <chogata@moosefs.pro>
MFC after: 2 weeks
If the FUSE server does something that would make our cache incoherent,
we should print a warning to the user. However, we previously warned in
some situations when we shouldn't, such as if the file's size changed on
the server _after_ our own attribute cache had expired. This change
suppresses the warning in cases like that. It also moves the warning
logic to a single place within the code.
PR: 256936
Reported by: Agata <chogata@moosefs.pro>
Tested by: Agata <chogata@moosefs.pro>, jSML4ThWwBID69YC@protonmail.com
MFC after: 2 weeks
These began to become obsolete in d6d64f0f2c (r137739) and the deal
was later sealed in 003e18aef4 (r137801) when vfs.fifofs.fops was
dropped and vop-bypass for pipes became mandatory.
PR: 225934
Suggested by: markj
Reviewe by: kib, markj
Differential Revision: https://reviews.freebsd.org/D32270
For a pNFS server configuration, an NFSv4.2 Deallocate operation
is proxied to the DS(s). The code that parsed the reply for the
proxy RPC is broken and did not process the pre-operation attributes.
This patch fixes this problem.
This bug would only affect pNFS servers built from recent main/FreeBSD14
sources.
For file systems that allow it, fusefs will skip FUSE_OPEN,
FUSE_RELEASE, FUSE_OPENDIR, and FUSE_RELEASEDIR operations, a minor
optimization.
MFC after: 2 weeks
Reviewed by: pfg
Differential Revision: https://reviews.freebsd.org/D32141
Commit 5e5ca4c8fc added a flag to a NFSv4 mount point that is set when
the first delegation is acquired from the NFSv4 server.
For a common case where delegations are not being issued by the
NFSv4 server, the nfscl_removedeleg() code acquires the mutex lock for
open/lock state, finds the delegation list empty, then just unlocks the
mutex and returns. This patch adds a check of the flag to avoid the
need to acquire the mutex for this common case.
This change appears to be performance neutral for a small number
of opens, but should reduce lock contention for a large number of opens
for the common case where server is not issuing delegations.
This commit should not affect the high level semantics of delegation
handling.
MFC after: 2 weeks
This fixes an insta-panic when attempting to use unionfs with
DEBUG_VFS_LOCKS. Note that unionfs still has a long way to
go before it's generally stable or usable.
Reviewed by: kib (prior version), markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D31917
During VOP_GETPAGES, fusefs needs to determine the file's length, which
could require a FUSE_GETATTR operation. If that fails, it's better to
SIGBUS than panic.
MFC after: 1 week
Sponsored by: Axcient
Reviewed by: markj, kib
Differential Revision: https://reviews.freebsd.org/D31994
Unlike Copy, the NFSv4.2 Allocate and Deallocate operations do not
allow a reply with partial completion. As such, the only way to
limit the time the operation takes to provide a reasonable RPC RTT
is to limit the size of the allocation/deallocation in the NFSv4.2
client.
This patch uses the sysctl vfs.nfs.maxalloclen to set
the limit on the size of the Deallocate operation.
There is no way to know how long a server will take to do an
deallocate operation, but 64Mbytes results in a reasonable
RPC RTT for the slow hardware I test on.
For an 8Gbyte deallocation, the elapsed time for doing it in 64Mbyte
chunks was the same (within margin of variability) as the
elapsed time taken for a single large deallocation
operation for a FreeBSD server with a UFS file system.
Unlike Copy, the NFSv4.2 Allocate and Deallocate operations do not
allow a reply with partial completion. As such, the only way to
limit the time the operation takes to provide a reasonable RPC RTT
is to limit the size of the allocation/deallocation in the NFSv4.2
client.
This patch adds a sysctl called vfs.nfs.maxalloclen to set
the limit on the size of the Allocate operation.
There is no way to know how long a server will take to do an
allocate operation, but 64Mbytes results in a reasonable
RPC RTT for the slow hardware I test on, so that is what
the default value for vfs.nfs.maxalloclen is set to.
For an 8Gbyte allocation, the elapsed time for doing it in 64Mbyte
chunks was the same as the elapsed time taken for a single large
allocation operation for a FreeBSD server with a UFS file system.
MFC after: 2 weeks
By default NFS server reports as scope and owner major the host UUID
value and zero for owner minor. It works good in case of standalone
server. But in case of CARP-based HA cluster failover the values
should remain persistent, otherwise some clients like VMware ESXi
get confused by the change and fail to reconnect automatically.
The patch makes server scope, major owner and minor owner values
configurable via sysctls. If not set (by default) the host UUID
value is still used.
Reviewed by: rmacklem
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31952
As of commit 103b207536, the NFSv4.2 server will limit the size
of a Copy operation based upon a 1 second timeout. The Linux 5.2
kernel server also limits Copy operation size to 4Mbytes.
As such, the NFSv4.2 client can attempt a large Copy without
resulting in a long RPC RTT for these servers.
This patch changes vfs.nfs.maxcopyrange to 64bits and sets
the default to the maximum possible size of SSIZE_MAX, since
a larger size makes the Copy operation more efficient and
allows for copying to complete with fewer RPCs.
The sysctl may be need to be made smaller for other non-FreeBSD
NFSv4.2 servers.
MFC after: 2 weeks
Although it is not specified in the RFCs, the concept that
the NFSv4 server should reply to an RPC request within a
reasonable time is accepted practice within the NFSv4 community.
Without this patch, the NFSv4.2 server attempts to reply to
a Copy operation within 1 second by limiting the copy to
vfs.nfs.maxcopyrange bytes (default 10Mbytes). This is crude at
best, given the large variation in I/O subsystem performance.
This patch uses the COPY_FILE_RANGE_TIMEO1SEC flag added by
commit c5128c48df to limit the reply time for a Copy
operation to approximately 1 second.
MFC after: 2 weeks
Running stress2 unionfs tests reliably produces a namei_zone corruption
panic due to unionfs_relookup() attempting to NUL-terminate a newly-
allocate pathname buffer without first validating the buffer length.
Instead, avoid allocating new pathname buffers in unionfs entirely,
using already-provided buffers while ensuring the the correct flags
are set in struct componentname to prevent freeing or manipulation
of those buffers at lower layers.
While here, also compute and store the path length once in the unionfs
node instead of constantly invoking strlen() on it.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D31728
Move the common kernel function signatures from machine/reg.h to a new
sys/reg.h. This is in preperation for adding PT_GETREGSET to ptrace(2).
Reviewed by: imp, markj
Sponsored by: DARPA, AFRL (original work)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19830
The NFSv4.2 Deallocate operation loops on VOP_DEALLOCATE()
while progress is being made (remaining length decreasing).
This patch changes the loop on VOP_ALLOCATE() for the NFSv4.2
Allocate operation do the same, instead of stopping after
an arbitrary 20 iterations.
MFC after: 2 weeks
This patch adds a VOP_DEALLOCATE() to the NFS client.
For NFSv4.2 servers that support the Deallocate operation,
it is used. Otherwise, it falls back on calling
vop_stddeallocate().
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31640
Similar to the UFS revision 8df4bc48c8
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31464
Use the same locking algorithm for msdosfs_rename() as used by ufs_rename().
Convert doscheckpath() to non-sleeping version.
Reported by: trasz
PR: 257522
In collaboration with: pho
Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31464
LK_EXCLUSIVE must be passed always, some consumers need the ability to
specify LK_NOWAIT
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31464
and rename it to msdosfs_lookup_ino(), similarly to UFS
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31464
The recently added VOP_DEALLOCATE(9) VOP call allows
implementation of the Deallocate NFSv4.2 operation.
Since the Deallocate operation is a single succeed/fail
operation, the call to VOP_DEALLOCATE(9) loops so long
as progress is being made. It calls maybe_yield()
between loop iterations to allow other processes
to preempt it.
Where RFC 7862 underspecifies behaviour, the code
is written to be Linux NFSv4.2 server compatible.
Reviewed by: khng
Differential Revision: https://reviews.freebsd.org/D31624
Implementing VOP_DEALLOCATE to allow hole-punching in the same manner as
POSIX shared memory's fspacectl(SPACECTL_DEALLOC) support.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31684
The partial page invalidation code is factored out to be a separate
helper from tmpfs_reg_resize().
Sponsored by: The FreeBSD Foundation
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D31683
The NFSv4.2 Allocate operation sanity checks the aa_offset
and aa_length arguments. Since they are assigned to variables
of type off_t (signed) it was possible for them to be negative.
It was also possible for aa_offset+aa_length to exceed OFF_MAX
when stored in lo_end, which is uint64_t.
This patch adds checks for these cases to the sanity check.
Reviewed by: kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31511
This patch adds a Lookup+Open compound RPC to the NFSv4.1/4.2
NFS client, which can be used by nfs_lookup() so that a
subsequent Open RPC is not required.
It uses the cn_flags OPENREAD, OPENWRITE added by commit c18c74a87c.
This reduced the number of RPCs by about 15% for a kernel
build over NFS.
For now, use of Lookup+Open is only done when the "oneopenown"
mount option is used. It may be possible for Lookup+Open to
be used for non-oneopenown NFSv4.1/4.2 mounts, but that will
require extensive further testing to determine if it works.
While here, I've added the changes to the nfscommon module
that are needed to implement the Deallocate NFSv4.2 operation.
This avoids needing another cycle of changes to the internal
KAPI between the NFS modules.
This commit has changed the internal KAPI between the NFS
modules and, as such, all need to be rebuilt from sources.
I have not bumped __FreeBSD_version, since it was bumped a
few days ago.
For NFSv4.1/4.2, if the "oneopenown" mount option is used,
there is, at most, only one open stateid for each NFS vnode.
When an open stateid for a file is acquired, set a pointer to
the open structure in the NFS vnode. This pointer can be used to
acquire the open stateid without searching the open linked list
when the following is true:
- No delegations have been issued for the file. Since delegations
can outlive an NFS vnode for a file, use the global
NFSMNTP_DELEGISSUED flag on the mount to determine this.
- No lock stateid has been issued for the file. To determine
this, a new NFS vnode flag called NMIGHTBELOCKED is set when a lock
stateid is issued, which can then be tested.
When this open structure pointer can be used, it avoids the need to
acquire the NFSCLSTATELOCK() and searching the open structure list for
an open. The NFSCLSTATELOCK() can be highly contended when there are
a lot of opens issued for the NFSv4.1/4.2 mount.
This patch only affects NFSv4.1/4.2 mounts when the "oneopenown"
mount option is used.
MFC after: 2 weeks
For NFSv4.1/4.2, the client may use either an open, lock or
delegation stateid as the stateid argument for an I/O operation.
RFC 5661 defines an order of preference of delegation, then lock
and finally open stateid for the argument, although NFSv4.1/4.2
servers are expected to handle any stateid type.
For the "oneopenown" mount option, the lock owner was not being
correctly generated and, as such, the I/O operation would use an
open stateid, even when a lock stateid existed. Although this
did not and should not affect an NFSv4.1/4.2 server's behaviour,
this patch makes the behaviour for "oneopenown" the same as when
the mount option is not specified.
Found during inspection of packet captures. No failure during
testing against NFSv4.1/4.2 servers of the unpatched code occurred.
MFC after: 2 weeks
fdvp and fvp vnodes are not locked, and race with reclaim cannot be handled
by the generic bypass routine.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31310
Handle it in fifo_close by checking for v_fifoinfo == NULL
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31310
Caller of VOP_LOOKUP() passes dvp locked and expect it locked on return.
Relock of lower vnode in any case could leave upper vnode reclaimed and
unlocked.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31310
The upper vnode reference to the lower vnode is the only reference that
keeps our pointer to the lower vnode alive. If lower vnode is relocked
during the VOP call, upper vnode might become unlocked and reclaimed,
which invalidates our reference.
Add a transient vhold around VOP call.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31310
The advlock VOP takes the vnode unlocked, which makes the normal bypass
function racy. Same as null_pgcache_read(), nullfs implementation needs
to take interlock and reference lower vnode under it.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31310
In certain emergency cases such as media failure or removal, UFS will
initiate a forced unmount in order to prevent dirty buffers from
accumulating against the no-longer-usable filesystem. The presence
of a stacked filesystem such as nullfs or unionfs above the UFS mount
will prevent this forced unmount from succeeding.
This change addreses the situation by allowing stacked filesystems to
be recursively unmounted on a taskqueue thread when the MNT_RECURSE
flag is specified to dounmount(). This call will block until all upper
mounts have been removed unless the caller specifies the MNT_DEFERRED
flag to indicate the base filesystem should also be unmounted from the
taskqueue.
To achieve this, the recently-added vfs_pin_from_vp()/vfs_unpin() KPIs
have been combined with the existing 'mnt_uppers' list used by nullfs
and renamed to vfs_register_upper_from_vp()/vfs_unregister_upper().
The format of the mnt_uppers list has also been changed to accommodate
filesystems such as unionfs in which a given mount may be stacked atop
more than one lower mount. Additionally, management of lower FS
reclaim/unlink notifications has been split into a separate list
managed by a separate set of KPIs, as registration of an upper FS no
longer implies interest in these notifications.
Reviewed by: kib, mckusick
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D31016
For NFSv4.1/4.2, the client may set the "seqid" field of the
stateid to 0 in RPC requests. This indicates to the server that
it should not check the "seqid" or return NFSERR_OLDSTATEID if the
"seqid" value is not up to date w.r.t. Open/Lock operations
on the stateid. This "seqid" is incremented by the NFSv4 server
for each Open/OpenDowngrade/Lock/Locku operation done on the stateid.
Since a failure return of NFSERR_OLDSTATEID is of no use to
the client for I/O operations, it makes sense to set "seqid"
to 0 for the stateid argument for I/O operations.
This avoids server failure replies of NFSERR_OLDSTATEID,
although I am not aware of any case where this failure occurs.
This makes the FreeBSD NFSv4.1/4.2 client compatible with the
Linux NFSv4.1/4.2 client.
MFC after: 2 weeks
Since MAXPHYS now allows the FreeBSD NFS client
to do 1Mbyte I/O operations, add a sysctl called vfs.nfsd.srvmaxio
so that the maximum NFS server I/O size can be set up to 1Mbyte.
The Linux NFS client can also do 1Mbyte I/O operations.
The default of 128Kbytes for the maximum I/O size has
not been changed for two reasons:
- kern.ipc.maxsockbuf must be increased to support 1Mbyte I/O
- The limited benchmarking I can do actually shows a drop in I/O rate
when the I/O size is above 256Kbytes.
However, daveb@spectralogic.com reports seeing an increase
in I/O rate for the 1Mbyte I/O size vs 128Kbytes using a Linux client.
Reviewed by: asomers
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D30826