The new structure is copy-on-write. With the assumption that path lookups are
significantly more frequent than chdirs and chrooting this is a win.
This provides stable root and jail root vnodes without the need to reference
them on lookup, which in turn means less work on globally shared structures.
Note this also happens to fix a bug where jail vnode was never referenced,
meaning subsequent access on lookup could run into use-after-free.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23884
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
realpath(3) is used a lot e.g., by clang and is a major source of getcwd
and fstatat calls. This can be done more efficiently in the kernel.
This works by performing a regular lookup while saving the name and found
parent directory. If the terminal vnode is a directory we can resolve it using
usual means. Otherwise we can use the name saved by lookup and resolve the
parent.
See the review for sample syscall counts.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23574
O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.
This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.
This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.
[0] https://pubs.opengroup.org/onlinepubs/9699919799/
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23247
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.
v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
. entries are never created and .. can reuse existing entries,
meaning the early count bump is both spurious and leading to
overcounting in certain cases.
The current mechanism is bogus in several ways:
- the limit is a percentage of total entries added, which means negative
entries get evicted all the time even if there are plenty of resources
- evicting code is almost not concurrent, which makes it unable to
remove entries fast enough when doing something as simple as -j 104
buildworld
- there is no support for performing mass removal if necessary
Vast majority of negative entries never get any hits. Only evicting
them when the filesystem demands it results in a significant growth of
the namecache with almost no improvement in the hit ratio.
Sample result about afer 90 minutes of poudriere -j 104:
current no evict % of the original
numneg 219737 2013157 916
numneghits 266711906 263544562 98 [1]
[1] this may look funny but there is a certain dose of variation to the
build
The number was chosen as something which mostly eliminates spurious
evictions during lighter workloads but still keeps the total at bay.
Sponsored by: The FreeBSD Foundation
Continue protecting demotion from the hotlist and selection of the
target list with the ncneg_shrink_lock lock, but drop it before
relocking to zap the node.
While here count how many times we skipped shrinking due to the lock
being already taken.
Sponsored by: The FreeBSD Foundation
- track the total count of hot entries
- pre-read the lock when shrinking since it is typically already taken
- place the lock in its own cacheline
- shorten the hold time of hot lock list when zapping
Sponsored by: The FreeBSD Foundation
Due to lock ordering issues (bucket lock held, vnode locks wanted) the code
starts with trylocking which in face of contention often fails. Prior to
the change it would loop back with a possible yield.
Instead note we know what locks are needed and can take them in the right
order, avoiding retries. Then we can safely re-lookup and see if the entry
we are looking for is still there.
On a 104-way box poudriere would result in constant retries during an 11h
run as seen in the vfs.cache.zap_and_exit_bucket_fail counter.
before: 408866592
after : 0
However, a new stat reports:
vfs.cache.zap_and_exit_bucket_relock_success: 32638
Note this is only a bandaid over current design issues.
Tested by: pho
Sponsored by: The FreeBSD Foundation
It used to be mp_ncpus * 64, but this gives unnecessarily big values for small
machines and at the same time constraints bigger ones. In particular this helps
on a 104-way box for which the count is now doubled.
While here make cache_purgevfs less likely. Currently it is not efficient in
face of contention due to lock ordering issues. These are fixable but not worth
it at the moment.
Sponsored by: The FreeBSD Foundation
vnodes have 2 reference counts - holdcnt to keep the vnode itself from getting
freed and usecount to denote it is actively used.
Previously all operations bumping usecount would also bump holdcnt, which is
not necessary. We can detect if usecount is already > 1 (in which case holdcnt
is also > 1) and utilize it to avoid bumping holdcnt on our own. This saves
on atomic ops.
Reviewed by: kib
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21471
Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes. Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone. Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.
Like r348026, these CUs did not show up in tinderbox as missing the include.
Reported by: peterj (arm64/mp_machdep.c)
X-MFC-With: r347984
Sponsored by: Dell EMC Isilon
If bumping over the counter goes over the limit we have to decrement it back.
Previous code would only bump the counter after adding the entry (thus allowing
the cache to go over the limit).
Sponsored by: The FreeBSD Foundation
cache_lookup's documentation got dislocated by r324378. Relocate and expand
it.
Reviewed by: jhb, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Implement a ddb function walking the namecache to do this.
Reviewed by: jhb, mjg
Inspired by: gdb macro from jhb (old version)
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D14898
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
Since the case of an empty chain was already covered, it si very likely
that the existing entry is matching. Skipping readlocking saves on lock
upgrade.
It is used on each new entry addition to decide whether to whack an existing
negative entry in order to prevent a blow out in size, but the parameter was
set years ago and never revisited.
Building with poudriere results in about 400 evictions per second which
unnecessarily grab entries from the hot list.
With the new parameter there are next to no evictions of the sort.
Lookups of the sort are rare compared to regular ones and succesfull ones
result in removing entries from the cache.
In the current code buckets are rlocked and a trylock dance is performed,
which can fail and cause a restart. Fixing it will require a little bit
of surgery and in order to keep the code maintaineable the 2 cases have
to split.
MFC after: 1 week