namei de facto expects that the naimeidata object is properly initialized,
but at the same time it mixes consumer-passable and internal flags, while
tolerating this part by explicitly clearing some of them.
Tighten the interface instead.
While here renumber the flags and denote the gap between the 2 variants.
Try to piggy back th renumber on the just bumped __FreeBSD_version.
This means there is no expectation lookup will purge the terminal entry,
which simplifies lockless lookup.
Tested by: pho
Sponsored by: The FreeBSD Foundation
Almost all consumers use the NDF_ONLY_PNBUF macro, making them avoidably branch
a lot in the NDFREE routine. Also note most of them should not need to call
any cleanup anyway as they don't request HASBUF.
Provides full scalability as long as all visited filesystems support the
lookup and terminal vnodes are different.
Inner workings are explained in the comment above cache_fplookup.
Capabilities and fd-relative lookups are not supported and will result in
immediate fallback to regular code.
Symlinks, ".." in the path, mount points without support for lockless lookup
and mismatched counters will result in an attempt to get a reference to the
directory vnode and continue in regular lookup. If this fails, the entire
operation is aborted and regular lookup starts from scratch. However, care is
taken that data is not copied again from userspace.
Sample benchmark:
incremental -j 104 bzImage on tmpfs:
before: 142.96s user 1025.63s system 4924% cpu 23.731 total
after: 147.36s user 313.40s system 3216% cpu 14.326 total
Sample microbenchmark: access calls to separate files in /tmpfs, 104 workers, ops/s:
before: 2165816
after: 151216530
Reviewed by: kib
Tested by: pho (in a patchset)
Differential Revision: https://reviews.freebsd.org/D25578
They are spurious since introduction of struct pwd, which provides them
implicitly.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23885
The new structure is copy-on-write. With the assumption that path lookups are
significantly more frequent than chdirs and chrooting this is a win.
This provides stable root and jail root vnodes without the need to reference
them on lookup, which in turn means less work on globally shared structures.
Note this also happens to fix a bug where jail vnode was never referenced,
meaning subsequent access on lookup could run into use-after-free.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23884
Duplicating the work was putting an avoidable requirement that the filedesc
lock is held across the entire operation (otherwise by the time audit reads
vnode pointers another thread in the same process can chdir somewhere else,
making audit log things using different vnode than the one which will be
used for actual lookup).
Do the obvious thing and pass down vnodes which will be used.
Most notably, we want to make sure we don't clobber any capabilities-related
errors. This is a regression from r357412 (O_SEARCH) that was picked up by
the capsicum tests.
PR: 243839
Reviewed by: kib (committed form recommended by)
Tested by: lwhsu
Differential Revision: https://reviews.freebsd.org/D23479
O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.
This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.
This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.
[0] https://pubs.opengroup.org/onlinepubs/9699919799/
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23247
There are 2 back-to-back atomics on the vnode, but we can check upfront if one
is sufficient. Similarly we can handle relative lookups where current working
directory == root directory.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23427
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.
v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
NDFREE() calculates unlock_dvp after ndp->ni_vp is unlocked and zeroed
out. This makes the comparision of ni_dvp with ni_vp always fail.
Move the calculation of unlock_dvp right after unlock_vp, so that the
code sees correct ni_vp value.
Reproduced by
chdir("/usr");
open("/..", O_BENEATH | O_RDONLY);
Reported by: syzkaller
Reviewed by: markj, mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20304
When renameat(2) is used with:
- absolute path for to;
- tofd not set to AT_FDCWD;
- the target exists
kern_renameat() requires CAP_UNLINK capability on tofd, but
corresponding namei ni_filecap is not initialized at all because the
lookup is absolute. As result, the check was done against empty filecap
and syscall fails erronously.
Fix it by creating a return flags namei member and reporting if the
lookup was absolute, then do not touch to.ni_filecaps at all.
PR: 222258
Reviewed by: jilles, ngie
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-MFC-note: KBI breakage
Differential revision: https://reviews.freebsd.org/D19096
1) filecaps_init was unnecesarily a function call
2) an asignment at the end was preventing tail calling of cap_rights_init
Sponsored by: The FreeBSD Foundation
It is possible that we started with a relative path but during the
lookup, found an absolute symlink. In this case, BENEATH handling
code needs the latch, but it is too late to calculate it.
While there, somewhat improve the assertions. Clear the NI_LCF_LATCH
flag when the latch vnode is released, so that asserts know the state.
Assert that there is a latch if we entered beneath+abs path mode,
after the starting point is processed.
Reported by: wulf
With more input from: pho
Sponsored by: The FreeBSD Foundation
The path must have a tail which does not escape starting/topping
directory. The documentation will come shortly, see the man pages
commit message for the reason of separate commit.
Reviewed by: jilles (previous version)
Discussed with: emaste
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17714
Flags prevent open(2) and *at(2) vfs syscalls name lookup from
escaping the starting directory. Supposedly the interface is similar
to the same proposed Linux flags.
Reviewed by: jilles (code, previous version of manpages), 0mp (manpages)
Discussed with: allanjude, emaste, jonathan
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D17547
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
Previously, symlinks in FreeBSD were artificially limited to PATH_MAX-2.
Add a short test case to verify the change.
Submitted by: Gaurav Gangalwar <ggangalwar AT isilon.com>
Reviewed by: kib
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12589
/compat/linux/path before /path. Stop following symbolic links when
looking up /compat/linux/path so dead symbolic links aren't ignored.
This allows syscalls like readlink(2) and lstat(2) to work on such links.
And open(2) will return an error now instead of trying /path.
uma_zcreate()'s alignment argument is supposed to be sizeof(foo) - 1,
and uma.h provides a set of helper macros for common types. Passing
sizeof(void *) results in all of the members being misaligned triggering
unaligned access faults on certain architectures (notably MIPS).
Reported by: brooks
Obtained from: CheriBSD
MFC after: 3 days
Sponsored by: DARPA / AFRL
a symlink and an autofs mount request. The crash was caused by namei()
calling bcopy() with a negative length, caused by numeric underflow:
in lookup(), in the relookup path, the ni_pathlen was decremented too
many times. The bug was introduced in r296715.
Big thanks to Alex Deiter for his help with debugging this.
Reviewed by: kib@
Tested by: Alex Deiter <alex.deiter at gmail.com>
MFC after: 1 month
This allows blind increment of relevant counters which under contention
is cheaper than inc-not-zero loops at least on amd64.
Use it in some of the places which are guaranteed to see already active
vnodes.
Reviewed by: kib (previous version)
Since the vnode is only expected to be shared locked, we can save a
little overhead by only pretending we are locking in the first place.
Reviewed by: kib
Tested by: pho
If dotdot lookup does not escape from the file descriptor passed as
the lookup root, we can allow the component traversal. Track the
directories traversed, and check the result of dotdot lookup against
the recorded list of the directory vnodes.
Dotdot lookups are enabled by sysctl vfs.lookup_cap_dotdot, currently
disabled by default until more verification of the approach is done.
Disallow non-local filesystems for dotdot, since remote server might
conspire with the local process to allow it to escape the namespace.
This might be too cautious, provide the knob
vfs.lookup_cap_dotdot_nonlocal to override as well.
Idea by: rwatson
Discussed with: emaste, jonathan, rwatson
Reviewed by: mjg (previous version)
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 week
Differential revision: https://reviews.freebsd.org/D8110