This will be used to break a deadlock in ZFS between the per-mountpoint
teardown lock and page busy locks. In particular, when purging data
from the page cache during dataset rollback, we want to avoid blocking
on the busy state of invalid pages since the busying thread may be
blocked on the teardown lock in zfs_getpages().
Add a helper, vn_pages_remove_valid(), for use by filesystems. Bump
__FreeBSD_version so that the OpenZFS port can make use of the new
helper.
PR: 258208
Reviewed by: avg, kib, sef
Tested by: pho (part of a larger patch)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32931
When the NFSv4.2 server does a VOP_ALLOCATE(), it needs
the operation to be done for the RPC's credential and not
td_ucred. It also needs the writing to be done synchronously.
This patch adds "ioflag" and "cred" arguments to VOP_ALLOCATE()
and modifies vop_stdallocate() to use these arguments.
The VOP_ALLOCATE.9 man page will be patched separately.
Reviewed by: khng, kib
Differential Revision: https://reviews.freebsd.org/D32865
As with FIFOs, a path descriptor for a unix socket cannot be used with
kevent().
In principle connectat(2) and bindat(2) could be modified to support an
AT_EMPTY_PATH-like mode which operates on the socket referenced by an
O_PATH fd referencing a unix socket. That would eliminate the path
length limit imposed by sockaddr_un.
Update O_PATH tests.
Reviewed by: kib
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31970
Although it is not specified in the RFCs, the concept that
the NFSv4 server should reply to an RPC request within a
reasonable time is accepted practice within the NFSv4 community.
Without this patch, the NFSv4.2 server attempts to reply to
a Copy operation within 1second by limiting the copy to
vfs.nfs.maxcopyrange bytes (default 10Mbytes). This is crude at
best, given the large variation in I/O subsystem performance.
This patch adds a kernel only flag COPY_FILE_RANGE_TIMEO1SEC
that the NFSv4.2 can specify, which tells VOP_COPY_FILE_RANGE()
to return after approximately 1 second with a partial result and
implements this in vn_generic_copy_file_range(), used by
vop_stdcopyfilerange().
Modifying the NFSv4.2 server to set this flag will be done in
a separate patch. Also under consideration is exposing the
COPY_FILE_RANGE_TIMEO1SEC to userland for use on the FreeBSD
copy_file_range(2) syscall.
MFC after: 2 weeks
Reviewed by: khng
Differential Revision: https://reviews.freebsd.org/D31829
This changes vn_deallocate() to match the behavior of vn_rdwr() when
picking which ucred to use. That is, vn_deallocate() uses file_cred for
making VOP call if it is non-NULL, or use active_cred otherwise.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31712
Yield at the end of each loop iteration if there are remaining works as
indicated by the value of *len updated by VOP_DEALLOCATE. Without this,
when calling vop_stddeallocate to zero a large region, the
implementation only zerofills a relatively small chunk and returns.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31705
The addition of ioflag allows callers passing
IO_SYNC/IO_DATASYNC/IO_DIRECT down to the file system implementation.
The vop_stddeallocate fallback implementation is updated to pass the
ioflag to the file system implementation. vn_deallocate(9) internally is
also changed to pass ioflag to the VOP_DEALLOCATE call.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D31500
Converted vn_write to use this helper.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31513
This includes a style fix around ioflag checking as well.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib, bcr
Differential Revision: https://reviews.freebsd.org/D31505
VOP_LOOKUP() is called with cn_flags bits ISLASTCN and ISOPEN
to indicate that the lookup is for the last component of a pathname
when doing open.
If the cn_flags also indicates if the open is for Reading, Writing or Both,
the NFSv4 client can do an NFSv4 Open operation in the same compound
RPC as Lookup, often avoiding the additional Open RPC now done when
VOP_OPEN() is called.
This patch defines two new cn_flags bits called OPENREAD and OPENWRITE
and sets these in open2nameif() based on FREAD, FWRITE flag bits.
This will allow a subsequent patch to the NFSv4 client to do the Open
operation in the same RPC as Lookup.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31431
fspacectl(2) is a system call to provide space management support to
userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the
deallocation. vn_deallocate(9) is a public KPI for kmods' use.
The purpose of proposing a new system call, a KPI and a VOP call is to
allow bhyve or other hypervisor monitors to emulate the behavior of SCSI
UNMAP/NVMe DEALLOCATE on a plain file.
fspacectl(2) comprises of cmd and flags parameters to specify the
space management operation to be performed. Currently cmd has to be
SPACECTL_DEALLOC, and flags has to be 0.
fo_fspacectl is added to fileops.
VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation
of VOP_DEALLOCATE(9) is provided.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28347
vn_bmap_seekhole_locked() is factored out version of vn_bmap_seekhole().
This variant requires shared vnode lock being held around the call.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D31404
and remove repetetive code that calculates vnode locking type for write.
Reviewed by: khng, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31405
ca1ce50b2b5ef11d ("vfs: add more safety against concurrent forced
unmount to vn_write") has a side effect of only checking MNT_SYNCHRONOUS
if O_FSYNC is set.
Reviewed By: mjg
Differential Revision: https://reviews.freebsd.org/D30610
The previous:
if ((uoff_t)uio->uio_offset + uio->uio_resid > lim)
signal(....);
was replaced with:
if ((uoff_t)uio->uio_offset + uio->uio_resid < lim)
return;
signal(....);
Making (uoff_t)uio->uio_offset + uio->uio_resid == lim trip over the
limit, when it did not previously.
Unbreaks running 13.0 buildworld.
Instead of trying to partially ifdef out ktrace handling, define the
missing identifier to 0. Without this fix lack of ktrace in the kernel
also means there is no SIGXFSZ signal delivery.
When enabled, writes to ktrace.out that exceed the max file size limit
cause SIGXFSZ as it should be, but note that the limit is taken from
the process that initiated ktrace. When disabled, write is blocked,
but signal is not send.
Note that in either case ktrace for the affected process is stopped.
Requested and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30257
and use the mark to stop applying file size limits on the write of
the accounting record. This allows to remove hack to clear process
limits in acct_process(), and avoids the bug with the clearing being
ineffective because limits are also cached in the thread structure.
Reported and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30257
This combination does not make sense, and cannot be satisfied by lookup.
In particular, lookup cannot supply dvp, it only can directly return vp.
Reported and reviewed by: markj using syzkaller
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
It reopens the passed file descriptor, checking the file backing vnode'
current access rights against open mode. In particular, this flag allows
to convert file descriptor opened with O_PATH, into operable file
descriptor, assuming permissions allow that.
Reviewed by: markj
Tested by: Andrew Walker <awalker@ixsystems.com>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30148
PR#255523 reported that a file copy for a file with a large hole
to EOF on ZFS ran slowly over NFSv4.2.
The problem was that vn_generic_copy_file_range() would
loop around reading the hole's data and then see it is all
0s. It was coded this way since UFS always allocates a data
block near the end of the file, such that a hole to EOF never exists.
This patch modifies vn_generic_copy_file_range() to check for a
ENXIO returned from VOP_IOCTL(..FIOSEEKDATA..) and handle that
case as a hole to EOF. asomers@ confirms that it works for his
ZFS test case.
PR: 255523
Tested by: asomers
Reviewed by: asomers
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D30076
Filter on fifos is real filter for the object, and not a filesystem
events filter like EVFILT_VNODE.
Reported by: markj using syzkaller
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
If VOP_ADD_WRITECOUNT() or adv locking failed, so VOP_CLOSE() needs to
be called, we cannot use fp fo_close() when there is no fp. This occurs
when e.g. kernel code directly calls vn_open() instead of the open(2)
syscall.
In this case, VOP_CLOSE() can be called directly, after possible lock
upgrade.
Reported by: nvass@gmx.com
PR: 255119
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29830
if VREAD access is checked as allowed during open
Requested by: wulf
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29323
When O_NOFOLLOW is specified, namei() returns the symlink itself. In
this case, open(O_PATH) should be allowed, to denote the location of symlink
itself.
Prevent O_EXEC in this case, execve(2) code is not ready to try to execute
symlinks.
Reported by: wulf
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D29323
with the reasoning that the flags did not worked properly, and were not
shipped in a release.
O_RESOLVE_BENEATH is kept as useful.
Reviewed by: markj
Tested by: arichardson, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D28907
r366302 broke copy_file_range(2) for small values of
input file offset and len.
It was possible for rem to be greater than len and then
"len - rem" was a large value, since both variables are
unsigned.
Reported by: koobs, Pablo <pablogsal gmail com> (Python)
Reviewed by: asomers, koobs
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D28981
or EX_SHLOCK. Do it by setting a vnode iflag indicating that the locking
exclusive open is in progress, and not allowing F_LOCK request to make
a progress until the first open finishes.
Requested by: mckusick
Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28697
If uio_offset is past end of the object size, calculated resid is negative.
Delegate handling this case to the locked read, as any other non-trivial
situation.
PR: 253158
Reported by: Harald Schmalzbauer <bugzilla.freebsd@omnilan.de>
Tested by: cy
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The current list is limited to the cases where UFS needs to handle
vput(dvp) specially. Which means VOP_CREATE(), VOP_MKDIR(), VOP_MKNOD(),
VOP_LINK(), and VOP_SYMLINK().
Reviewed by: chs, mkcusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Most future operations on the returned file descriptor will fail
anyway, and application should be ready to handle that failures. Not
forcing it to understand the transient failure mode on open, which is
implementation-specific, should make us less special without loss of
reporting of errors.
Suggested by: chs
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
POSIX O_DSYNC means that writes include an implicit fdatasync(2), just
as O_SYNC implies fsync(2).
VOP_WRITE() functions that understand the new IO_DATASYNC flag can act
accordingly, but we'll still pass down IO_SYNC so that file systems that
don't understand it will continue to provide the stronger O_SYNC
behaviour.
Flag also applies to fcntl(2).
Reviewed by: kib, delphij
Differential Revision: https://reviews.freebsd.org/D25090
vn_rdwr() must lock the entire file range for IO_APPEND
just like vn_io_fault() does for O_APPEND.
Reviewed by: kib, imp, mckusick
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D28008
There is no reason why vp->v_object cannot be NULL. If it is, it's
fine, handle it by delegating to VOP_READ().
Tested by: pho
Reviewed by: markj, mjg
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27327