When a NFSv4 byte range write lock is unlocked, all
data modifications need to be flushed to the server
to satisfy the coherency requirements for byte range
locking. However, if a write delegation for the
file is held by the client, flushing is not required,
since no other NFSv4 client can have the file NFSv4
Opened.
Found by inspection as suggested by a similar change
that was done to the Linux NFSv4 client.
Commits 3ad1e1c1ce20 and 57014f21e754 added a Lookup+Open
RPC for NFSv4.1/4.2, which can reduce the RPC count by
10-20% for some loads. This has now received a fair amount
of testing, so I think it is ok to enable it.
Note that the Lookup+Open RPC is only used when the
"oneopenown" mount option is specified. As such, this
change won't affect most NFSv4.1/4.2 mounts.
When a NFSv4.1/4.2 session to the NFS server (not a pNFS DS) is
replaced, the old session should always be marked defunct by
nfsess_defunct being set non-zero.
However, the hang reported by the PR suggests that this might
be the case.
This patch adds a printf() to indicate this has somehow happened.
PR: 260011
MFC after: 2 weeks
The NFSERR_BADSESSION reply from a NFSv4.1/4.2 server
is handled by newnfs_request(). It should not be handled
separately after newnfs_request() has returned.
These two cases were spotted during code inspection.
One of them should only redo what newnfs_request() already
did by the same "nfscl" thread. The other might have
resulted in recovery being done twice, but the code is
only used for "pnfs" mounts, so that would be rare.
Also, since NFSERR_BADSESSION should only be replied by
a server after the server reboots, this would be extremely
rare.
MFC after: 2 weeks
* If during FUSE_CREATE, FUSE_MKDIR, etc the server returns the same
inode number for the new file as for its parent directory, reject it.
Previously this would triggers a recurse-on-non-recursive lock panic.
* If during FUSE_LINK the server returns a different inode number for
the new name as for the old one, reject it. Obviously, that can't be
a hard link.
* If during FUSE_LOOKUP the server returns the same inode number for the
new file as for its parent directory, reject it. Nothing good can
come of this.
PR: 263662
Reported by: Robert Morris <rtm@lcs.mit.edu>
MFC after: 2 weeks
Reviewed by: pfg
Differential Revision: https://reviews.freebsd.org/D35128
Robert Morris reported that, if a client sends an absurdly
large Owner/OwnerGroup string, the kernel malloc() for the
large size string can block forever.
This patch adds a sanity limit for Owner/OwnerGroup string
length. Since the RFCs do not specify any limit and FreeBSD
can handle a group name greater than 1Kbyte, the limit is
set at a generous 10Kbytes.
Reported by: rtm@lcs.mit.edu
PR: 260546
MFC after: 2 weeks
When the MDS of a pNFS service receives an Open/Create
and the file already exists, it must do a Setattr of
size == 0. Without this patch, this was eroneously
done via a VOP_SETAATR() call, which would set the
length of the MDS file to 0 (which is already is,
since all data lives on the DSs).
This patch fixes the problem by doing a nfsvno_setattr()
instead of VOP_SETATTR(), which knows to do a proxied
Setattr on the DSs.
For a non-pNFS server, the change has no effect, since
nfsvno_setattr() only does a VOP_SETATTR() for that case.
This was found during a recent IETF NFSv4 testing event.
MFC after: 2 weeks
When the NFSv4.1/4.2 client is doing a pnfs mount to
mirrored DS(s), asynchronous threads are used to do the
RPCs against the DS(s) concurrently. If a DS is slow
to reply, it is possible for the "cred" to be free'd
before the asynchronous thread is done with it, causing
a panic/crash.
This patch fixes the problem by acquiring a refcount on
the "cred" while it is being used by the asynchronous thread
for a DS RPC. This bug was found during a recent IETF
NFSv4 testing event.
This bug only affects "pnfs" mounts to mirrored pNFS
servers.
MFC after: 2 weeks
Without this patch the NFSv4.1/4.2 server erroneously
always frees session slot zero for callbacks. This only
affects 4.1/4.2 mounts if the server has delegations
enabled or is a pNFS configuration. Even for those
cases, the effect is mainly to only use slot 0 for
callbacks, serializing all of them. There is a slight
chance that callbacks will fail if the client performs
them in a different order than received on the TCP
connection.
If this bug affects your server, you will see console
messages like:
newnfs_request: Bad session slot
This patch fixes the problem. Found during a recent
IETF NFSv4 testing event.
PR: 263728
MFC after: 2 weeks
Robert Morris reported that, for the case of SecinfoNoname
with the Parent option, providing a non-directory could
cause a crash.
This patch adds a sanity check for v_type == VDIR for
this case, to avoid the crash.
Reported by: rtm@lcs.mit.edu
PR: 260300
MFC after: 2 weeks
For IO_APPEND VOP_WRITE()s, the code first does a
Getattr RPC to acquire the file's size, before it
can do the Write RPC.
Although NFS does not have an append write operation,
an NFSv4 compound can use a Verify operation to check
that the client's notion of the file's size is
correct, followed by the Write operation.
This patch modifies the NFSv4 client to use an Appendwrite
RPC, which does a Verify to check the file's size before
doing the Write. This avoids the need for a Getattr RPC
to preceed this RPC and reduces the RPC count by half for
IO_APPEND writes, so long as the client knows the file's
size.
The nfsd structure was moved from the stack to be malloc()'d,
since the kernel stack limit was being exceeded.
While here, fix the types of a few variables, although
there should not be any semantics change caused by these
type changes.
The daemon can specify fsname=XXX in its mount options. If so, the file
system should report f_mntfromname as XXX during statfs. This will show
up in the output of commands like mount and df.
Submitted by: Ali Abdallah <ali.abdallah@suse.com>
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D35090
Prior to fuse protocol version 7.9, the fuse_entry_out structure had a
smaller size. But fuse_vnop_create did not take that into account when
working with servers that use older protocols. The bug does not matter
for servers which don't use file handles or open flags (the only fields
affected).
PR: 263625
Submitted by: Ali Abdallah <ali.abdallah@suse.com>
MFC after: 2 weeks
During a FUSE_WRITE, the kernel requests the server to write a certain
amount of data, and the server responds with the amount that it actually
did write. It is obviously an error for the server to write more than
it was provided, and we always treated it as such, but there were two
problems:
* If the server responded with a huge amount, greater than INT_MAX, it
would trigger an integer overflow which would cause a panic.
* When extending the file, we wrongly set the file's size before
validing the amount written.
PR: 263263
Reported by: Robert Morris <rtm@lcs.mit.edu>
MFC after: 2 weeks
Sponsored by: Axcient
Reviewed by: emaste
Differential Revision: https://reviews.freebsd.org/D34955
The "void *stuff" (also called fstuff and dstuff) argument
was used by the Mac OSX port. For FreeBSD, this argument
is always NULL, so remove it to clean up the code.
This commit gets rid of "stuff" for assorted functions
local to nfs_clrpcops.c.
Future commits will do the same for other functions.
Formerly fusefs would pass up the stack any error value returned by the
fuse server. However, some values aren't valid for userland, but have
special meanings within the kernel. One of these, EJUSTRETURN, could
cause a kernel page fault if the server returned it in response to
FUSE_LOOKUP. Fix by validating all errors returned by the server.
Also, fix a data lifetime bug in the FUSE_DESTROY test.
PR: 263220
Reported by: Robert Morris <rtm@lcs.mit.edu>
MFC after: 3 weeks
Sponsored by: Axcient
Reviewed by: emaste
Differential Revision: https://reviews.freebsd.org/D34931
The "void *stuff" (also called fstuff and dstuff) argument
was used by the Mac OSX port. For FreeBSD, this argument
is always NULL, so remove it to clean up the code.
This commit gets rid of "stuff" for nfscl_nget().
Future commits will do the same for other functions.
The "void *stuff" (also called fstuff and dstuff) argument
was used by the Mac OSX port. For FreeBSD, this argument
is always NULL, so remove it to clean up the code.
This commit gets rid of "stuff" for nfscl_loadattrcache().
Future commits will do the same for other functions.
The "void *stuff" (also called fstuff and dstuff) argument
was used by the Mac OSX port. For FreeBSD, this argument
is always NULL, so remove it to clean up the code.
This commit gets rid of "stuff" for nfscl_request().
Future commits will do the same for other functions.
The "void *stuff" (also called fstuff and dstuff) argument
was used by the Mac OSX port. For FreeBSD, this argument
is always NULL, so remove it to clean up the code.
This commit gets rid of "stuff" for nfscl_postop_attr().
Future commits will do the same for other functions.
For IO_APPEND VOP_WRITE()s, the code first does a
Getattr RPC to acquire the file's size, before it
can do the Write RPC.
Although NFS does not have an append write operation,
an NFSv4 compound can use a Verify operation to check
that the client's notion of the file's size is
correct, followed by the Write operation.
This patch modifies nfscl_wcc_data() to optionally
acquire the file's size, for use with an AppendWrite.
Although the "stuff" arguments are always NULL
(these were used for the Mac OSX port and should be
cleared out someday), make the argument to
nfscl_wcc_data() explicitly NULL for clarity.
This patch does not cause any semantics change until
the AppendWrite is added in a future commit.
* We never send FUSE_LOOKUP for the root inode, since its inode number
is hard-coded to 1. Therefore, we should not send FUSE_FORGET for it,
lest the server see its lookup count fall below 0.
* During VOP_RECLAIM, if we are reclaiming the root inode, we must clear
the file system's vroot pointer. Otherwise it will be left pointing
at a reclaimed vnode, which will cause future VOP_LOOKUP operations to
fail. Previously we only cleared that pointer during VFS_UMOUNT. I
don't know of any real-world way to trigger this bug.
MFC after: 2 weeks
Reviewed by: pfg
Differential Revision: https://reviews.freebsd.org/D34753
For IO_APPEND VOP_WRITE()s, the code first does a
Getattr RPC to acquire the file's size, before it
can do the Write RPC.
Although NFS does not have an append write operation,
an NFSv4 compound can use a Verify operation to check
that the client's notion of the file's size is
correct before doing the Write operation.
This patch prepares the NFSv4 client for such an
RPC, which will be added in a future commit.
This patch does not cause any semantics change.
Commit 867c27c23a5c modified the NFS client so that
it did IO_APPEND writes directly to the NFS server
bypassing the buffer cache, via a call to
nfs_directio_write(). Unfortunately, this (very old)
function assumed that the uio iov was for user space
addresses. As such, a IO_APPEND VOP_WRITE() that
was for system space, such as ktrace(1) does, would
write bogus data.
This patch fixes nfs_directio_write() so that it
handles kernel space uio iovs.
Reported by: bz
Tested by: bz
MFC after: 2 weeks
A NFSv4.1/4.2 pNFS mount needs to do a
separate Open+LayoutGet RPC, so do not do
a Lookup+Open RPC for these mounts.
The Lookup+Open RPCs are still disabled,
until further testing is done, so this patch
has no effect at this time.
Use of the Lookup+Open RPC is currently disabled,
due to a problem detected during testing. This
patch fixes this problem. The problem was that
nfscl_postop_attr() does not parse the attributes
if nd_repstat != 0. It also would parse the
return status for the operation, where the
Lookup+Open code had already parsed it.
The first change in the patch does not make any
semantics change, but makes the code identical
to what is done later in the function, so that
it is apparent that the semantics should be the
same in both places.
Lookup+Open remains disabled while further
testing is being done, so this patch has no
effect at this time.
The Fsinfo RPC is exempt from the check for
Kerberized NFS being required, as recommended
by RFC2623. However, there is no reason to
exempt Fsinfo from the requirement to use TLS.
This patch fixes the code so that the exemption
only applies to Kerberized NFS and not
NFS-over-TLS.
This only affects NFS-over-TLS for an NFSv3
mount when it is required, but the client does
not do so.
MFC after: 1 month
ler@, markj@ reported a use after free in nfscl_cleanupkext().
They also provided two possible causes:
- In nfscl_cleanup_common(), "own" is the owner string
owp->nfsow_owner. If we free that particular
owner structure, than in subsequent comparisons
"own" will point to freed memory.
- nfscl_cleanup_common() can free more than one owner, so the use
of LIST_FOREACH_SAFE() in nfscl_cleanupkext() is not sufficient.
I also believe there is a 3rd:
- If nfscl_freeopenowner() or nfscl_freelockowner() is called
without the NFSCLSTATE mutex held, this could race with
nfscl_cleanupkext().
This could happen when the exclusive lock is held
on the client, such as when delegations are being returned
or when recovering from NFSERR_EXPIRED.
This patch fixes them as follows:
1 - Copy the owner string to a local variable before the
nfscl_cleanup_common() call.
2 - Modify nfscl_cleanup_common() so that it will never free more
than the first matching element. Normally there should only
be one element in each list with a matching open/lock owner
anyhow (but there might be a bug that results in a duplicate).
This should guarantee that the FOREACH_SAFE loops in
nfscl_cleanupkext() are adequate.
3 - Acquire the NFSCLSTATE mutex in nfscl_freeopenowner()
and nfscl_freelockowner(), if it is not already held.
This serializes all of these calls with the ones done in
nfscl_cleanup_common().
Reported by: ler
Reviewed by: markj
Tested by: cy
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D34334
When renaming a directory into a different parent directory, invalidate
the cached attributes of the new parent. Otherwise, stat will show the
wrong st_nlink value.
MFC after: 1 week
Reviewed by: ngie
Differential Revision: https://reviews.freebsd.org/D34336
This reverts commit dd08b84e35b6fdee0df5fd0e0533cd361dec71cb.
cy@ reported a problem caused by this patch. He will be
testing an alternate patch, but I'm reverting this one.
VOP_GETWRITEMOUNT() is called on the vn_start_write() path without any
vnode locks guaranteed to be held. It's therefore unsafe to blindly
access per-mount and per-vnode data. Instead, follow the approach taken
by nullfs and use the vnode interlock coupled with the hold count to
ensure the mount and the vnode won't be recycled while they are being
accessed.
Reviewed by: kib (earlier version), markj, pho
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D34282
ler@, markj@ reported a use after free in nfscl_cleanupkext().
They also provided two possible causes:
- In nfscl_cleanup_common(), "own" is the owner string
owp->nfsow_owner. If we free that particular
owner structure, than in subsequent comparisons
"own" will point to freed memory.
- nfscl_cleanup_common() can free more than one owner, so the use
of LIST_FOREACH_SAFE() in nfscl_cleanupkext() is not sufficient.
I also believe there is a 3rd:
- If nfscl_freeopenowner() or nfscl_freelockowner() is called
without the NFSCLSTATE mutex held, this could race with
nfscl_cleanupkext().
This could happen when the exclusive lock is held
on the client, such as when delegations are being returned.
This patch fixes them as follows:
1 - Copy the owner string to a local variable before the
nfscl_cleanup_common() call.
2 - Modify nfscl_cleanup_common() to return whether or not a
free was done.
When a free was done, do a goto to restart the loop, instead
of using FOREACH_SAFE, which was not safe in this case.
3 - Acquire the NFSCLSTATE mutex in nfscl_freeopenowner()
and nfscl_freelockowner(), if it not already held.
This serializes all of these calls with the ones done in
nfscl_cleanup_common().
Reported by: ler
Reviewed by: markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D34334
HugeSectors * BytesPerSec should be computed before converting
HugeSectors to a DEV_BSIZE-based count.
Fixes: ba2c98389b78 ("msdosfs: sanity check sector count from BPB")
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34264
The ESXi NFSv4.1 client bogusly sends the wrong value
for the csa_sequence argument for a Create_session operation.
RFC8881 requires this value to be the same as the sequence
reply from the ExchangeID operation most recently done for
the client ID.
Without this patch, the server replies NFSERR_STALECLIENTID,
which is the correct response for an NFSv4.0 SetClientIDConfirm
but is not the correct error for NFSv4.1/4.2, which is
specified as NFSERR_SEQMISORDERED in RFC8881.
This patch fixes this.
This change does not fix the issue reported in the PR, where
the ESXi client loops, attempting ExchangeID/Create_session
repeatedly.
Reported by: asomers
Tested by: asomers
PR: 261291
MFC after: 1 week