with the reasoning that the flags did not worked properly, and were not
shipped in a release.
O_RESOLVE_BENEATH is kept as useful.
Reviewed by: markj
Tested by: arichardson, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D28907
In particular, replace a note that reload through vget() is obsoleted,
with explanation why this code is required.
Reviewed by: chs, mckusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
The current list is limited to the cases where UFS needs to handle
vput(dvp) specially. Which means VOP_CREATE(), VOP_MKDIR(), VOP_MKNOD(),
VOP_LINK(), and VOP_SYMLINK().
Reviewed by: chs, mkcusick
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
This lets callers avoid atomic ops by initializing the count to required
value from the get go.
While here add falloc_abort to backpedal from this without having to
fdrop.
Both FreeBSD and Linux mkdir -p walk the tree up ignoring any EEXIST on
the way and both are used a lot when building respective kernels.
This poses a problem as spurious locking avoidably interferes with
concurrent operations like getdirentries on affected directories.
Work around the problem by adding FAILIFEXISTS flag. In case of lockless
lookup this manages to avoid any work to begin with, there is no speed
up for the locked case but perhaps this can be augmented later on.
For simplicity the only supported semantics are as used by mkdir.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D27789
No functional change intended.
Tracking these structures separately for each proc enables future work to
correctly emulate clone(2) in linux(4).
__FreeBSD_version is bumped (to 1300130) for consumption by, e.g., lsof.
Reviewed by: kib
Discussed with: markj, mjg
Differential Revision: https://reviews.freebsd.org/D27037
Restart syscalls and some sync operations when filesystem indicated
ERELOOKUP condition, mostly for VOPs operating on metdata. In
particular, lookup results cached in the inode/v_data is no longer
valid and needs recalculating. Right now this should be nop.
Assert that ERELOOKUP is catched everywhere and not returned to
userspace, by asserting that td_errno != ERELOOKUP on syscall return
path.
In collaboration with: pho
Reviewed by: mckusick (previous version), markj
Tested by: markj (syzkaller), pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26136
mkdir -p /foo/bar/baz will mkdir each path component and ignore EEXIST.
The NOCACHE lookup will make the namecache unnecessarily evict the existing entry,
and then fallback to the fs lookup routine eventually leading namei to return an
error as the directory is already there.
For invocations like mkdir -p /usr/obj/usr/src/sys/GENERIC/modules this triggers
fallbacks to the slowpath for concurrently executing lookups.
Tested by: pho
Discussed with: kib
It is like O_BENEATH, but disables to walk out of the subtree rooted
in the starting directory. O_BENEATH does not care if path walks out
if it returned.
Requested by: Dan Gohman <dev@sunfishcode.online>
PR: 248335
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886
the helper to convert AT_ flags for *at() syscalls to namei flags.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886
Stop abusing internal namei flag NI_LCF_STRICTRELATIVE as indicator of
cap-restricted lookup. Add designated returned flag NIRES_STRICTREL
to inform kern_openat() that lookup was restricted.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D25886
The pointer to vnode is already stored into f_vnode, so f_data can be
reused. Fix all found users of f_data for DTYPE_VNODE.
Provide finit_vnode() helper to initialize file of DTYPE_VNODE type.
Reviewed by: markj (previous version)
Discussed with: freqlabs (openzfs chunk)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26346
This removes a lot of special casing from the VFS layer.
Reviewed by: kib (previous version)
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D25612
The current scheme of calling VOP_GETATTR adds avoidable overhead.
An example with tmpfs doing fstat (ops/s):
before: 7488958
after: 7913833
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D25910
The kernel would unlock already unlocked mutex if the buffer got filled up
before the mount list ended.
Reported by: pho
Fixes: r363069 ("vfs: depessimize getfsstat when only the count is requested")
For software like PostgreSQL and SQLite that sometimes reads sequentially
while also writing sequentially some distance behind with interleaved
syscalls on the same fd, performance is better on UFS if we do
sequential access heuristics separately for reads and writes.
Patch originally by Andrew Gierth in 2008, updated and proposed by me with
his permission.
Reviewed by: mjg, kib, tmunro
Approved by: mjg (mentor)
Obtained from: Andrew Gierth <andrew@tao11.riddles.org.uk>
Differential Revision: https://reviews.freebsd.org/D25024
During buildkernel there are very frequent calls to priv_check and they
all are for PRIV_VFS_GENERATION (coming from stat/fstat).
This results in branching on several potential privileges checking if
perhaps that's the one which has to be evaluated.
Instead of the kitchen-sink approach provide a way to have commonly used
privs directly evaluated.
This eliminates a branch from its consumers trading it for an extra call
if ktrace is enabled for curthread. Given that this is almost never true,
the tradeoff is worth it.
Otherwise we risk running into use-after-free.
In particular this codepath ends up dropping all protection before
suspending writes:
ufs_quotactl -> quotaoff_inchange -> vfs_write_suspend_umnt
Reported by: pho
This opens the door for other descriptor types to implement
posix_fallocate(2) as needed.
Reviewed by: kib, bcr (manpages)
Differential Revision: https://reviews.freebsd.org/D23042
The previous behavior of leaving VI_OWEINACT vnodes on the active list without
a hold count is eliminated. Hold count is kept and inactive processing gets
explicitly deferred by setting the VI_DEFINACT flag. The syncer is then
responsible for vdrop.
Reviewed by: kib (previous version)
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D23036
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
via sys_sync(2). Minor cleanup, no functional changes.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19366
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.
v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
Currently si_usecount is effectively a sum of usecounts from all associated
vnodes. This is maintained by special-casing for VCHR every time usecount is
modified. Apart from complicating the code a little bit, it has a scalability
impact since it forces a read from a cacheline shared with said count.
There are no consumers of the feature in the ports tree. In head there are only
2: revoke and devfs_close. Both can get away with a weaker requirement than the
exact usecount, namely just the count of active vnodes. Changing the meaning to
the latter means we only need to modify it on 0<->1 transitions, avoiding the
check plenty of times (and entirely in something like vrefact).
Reviewed by: kib, jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22202
The current linux man page and testing done on a fairly recent linux5.n
kernel have identified two changes to the semantics of the linux
copy_file_range system call.
Since the copy_file_range(2) system call is intended to be linux compatible
and is only currently in head/current and not used by any commands,
it seems appropriate to update the system call to be compatible with
the current linux one.
The first of these semantic changes was changed to be compatible with
linux5.n by r354564.
For the second semantic change, the old linux man page stated that, if
infd and outfd referred to the same file, EBADF should be returned.
Now, the semantics is to allow infd and outfd to refer to the same file
so long as the byte ranges defined by the input file offset, output file offset
and len does not overlap. If the byte ranges do overlap, EINVAL should be
returned.
This patch modifies copy_file_range(2) to be linux5.n compatible for this
semantic change.
The struct is already populated on each mount (and remount). Fields are either
constant or not used by filesystem in the first place.
Some infrequently used functions use it to avoid having to allocate a new buffer
and are left alone.
The current code results in an avoidable copying single-threaded and significant
cache line bouncing multithreaded
While here deduplicate initial filling of the struct.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21317
This patch adds support to the kernel for a Linux compatible
copy_file_range(2) syscall and the related VOP_COPY_FILE_RANGE(9).
This syscall/VOP can be used by the NFSv4.2 client to implement the
Copy operation against an NFSv4.2 server to do file copies locally on
the server.
The vn_generic_copy_file_range() function in this patch can be used
by the NFSv4.2 server to implement the Copy operation.
Fuse may also me able to use the VOP_COPY_FILE_RANGE() method.
vn_generic_copy_file_range() attempts to maintain holes in the output
file in the range to be copied, but may fail to do so if the input and
output files are on different file systems with different _PC_MIN_HOLE_SIZE
values.
Separate commits will be done for the generated syscall files and userland
changes. A commit for a compat32 syscall will be done later.
Reviewed by: kib, asomers (plus comments by brooks, jilles)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D20584
descriptor, not the file descriptor. The file descriptor is used only for
verification so do not expect any additional capabilities on it.
Reported by: antoine
Tested by: antoine
Discussed with: kib, emaste, bapt
Sponsored by: Fudo Security
the file associated with the given file descriptor.
Reviewed by: kib, asomers
Reviewed by: cem, jilles, brooks (they reviewed previous version)
Discussed with: pjd, and many others
Differential Revision: https://reviews.freebsd.org/D14567
When renameat(2) is used with:
- absolute path for to;
- tofd not set to AT_FDCWD;
- the target exists
kern_renameat() requires CAP_UNLINK capability on tofd, but
corresponding namei ni_filecap is not initialized at all because the
lookup is absolute. As result, the check was done against empty filecap
and syscall fails erronously.
Fix it by creating a return flags namei member and reporting if the
lookup was absolute, then do not touch to.ni_filecaps at all.
PR: 222258
Reviewed by: jilles, ngie
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-MFC-note: KBI breakage
Differential revision: https://reviews.freebsd.org/D19096
There is no reason for it to behave differently from openat(fd, NULL).
Also the handling did not worked because the substituted path was from
the system address space, causing EFAULT.
Submitted by: Jack Halford <jack@gandi.net>
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D18501
When we detected that the vnode is not symlink, return immediately.
This moves the readlink code out of else branch and unindents it.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week