The current mechanism is bogus in several ways:
- the limit is a percentage of total entries added, which means negative
entries get evicted all the time even if there are plenty of resources
- evicting code is almost not concurrent, which makes it unable to
remove entries fast enough when doing something as simple as -j 104
buildworld
- there is no support for performing mass removal if necessary
Vast majority of negative entries never get any hits. Only evicting
them when the filesystem demands it results in a significant growth of
the namecache with almost no improvement in the hit ratio.
Sample result about afer 90 minutes of poudriere -j 104:
current no evict % of the original
numneg 219737 2013157 916
numneghits 266711906 263544562 98 [1]
[1] this may look funny but there is a certain dose of variation to the
build
The number was chosen as something which mostly eliminates spurious
evictions during lighter workloads but still keeps the total at bay.
Sponsored by: The FreeBSD Foundation
Continue protecting demotion from the hotlist and selection of the
target list with the ncneg_shrink_lock lock, but drop it before
relocking to zap the node.
While here count how many times we skipped shrinking due to the lock
being already taken.
Sponsored by: The FreeBSD Foundation
- track the total count of hot entries
- pre-read the lock when shrinking since it is typically already taken
- place the lock in its own cacheline
- shorten the hold time of hot lock list when zapping
Sponsored by: The FreeBSD Foundation
Due to lock ordering issues (bucket lock held, vnode locks wanted) the code
starts with trylocking which in face of contention often fails. Prior to
the change it would loop back with a possible yield.
Instead note we know what locks are needed and can take them in the right
order, avoiding retries. Then we can safely re-lookup and see if the entry
we are looking for is still there.
On a 104-way box poudriere would result in constant retries during an 11h
run as seen in the vfs.cache.zap_and_exit_bucket_fail counter.
before: 408866592
after : 0
However, a new stat reports:
vfs.cache.zap_and_exit_bucket_relock_success: 32638
Note this is only a bandaid over current design issues.
Tested by: pho
Sponsored by: The FreeBSD Foundation
It used to be mp_ncpus * 64, but this gives unnecessarily big values for small
machines and at the same time constraints bigger ones. In particular this helps
on a 104-way box for which the count is now doubled.
While here make cache_purgevfs less likely. Currently it is not efficient in
face of contention due to lock ordering issues. These are fixable but not worth
it at the moment.
Sponsored by: The FreeBSD Foundation
vnodes have 2 reference counts - holdcnt to keep the vnode itself from getting
freed and usecount to denote it is actively used.
Previously all operations bumping usecount would also bump holdcnt, which is
not necessary. We can detect if usecount is already > 1 (in which case holdcnt
is also > 1) and utilize it to avoid bumping holdcnt on our own. This saves
on atomic ops.
Reviewed by: kib
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21471
Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes. Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone. Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.
Like r348026, these CUs did not show up in tinderbox as missing the include.
Reported by: peterj (arm64/mp_machdep.c)
X-MFC-With: r347984
Sponsored by: Dell EMC Isilon
If bumping over the counter goes over the limit we have to decrement it back.
Previous code would only bump the counter after adding the entry (thus allowing
the cache to go over the limit).
Sponsored by: The FreeBSD Foundation
cache_lookup's documentation got dislocated by r324378. Relocate and expand
it.
Reviewed by: jhb, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Implement a ddb function walking the namecache to do this.
Reviewed by: jhb, mjg
Inspired by: gdb macro from jhb (old version)
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D14898
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
Since the case of an empty chain was already covered, it si very likely
that the existing entry is matching. Skipping readlocking saves on lock
upgrade.
It is used on each new entry addition to decide whether to whack an existing
negative entry in order to prevent a blow out in size, but the parameter was
set years ago and never revisited.
Building with poudriere results in about 400 evictions per second which
unnecessarily grab entries from the hot list.
With the new parameter there are next to no evictions of the sort.
Lookups of the sort are rare compared to regular ones and succesfull ones
result in removing entries from the cache.
In the current code buckets are rlocked and a trylock dance is performed,
which can fail and cause a restart. Fixing it will require a little bit
of surgery and in order to keep the code maintaineable the 2 cases have
to split.
MFC after: 1 week
This fixes kernel crashes due to misaligned accesses to the 64-bit
time_t embedded in struct namecache_ts in MIPS n32 kernels.
MFC after: 1 week
Sponsored by: DARPA / AFRL
namecache_ts differs from mere namecache by few fields placed mid struct.
The access to the last element (the name) is thus special-cased.
The standard solution is to put new fields at the very beginning anad
embedd the original struct. The pointer shuffled around points to the
embedded part. If needed, access to new fields can be gained through
__containerof.
MFC after: 1 week
All hash sizes are power-of-2, but the compiler does not know that for sure
and 'foo % size' forces doing a division.
Store the size - 1 and use 'foo & hash' instead which allows mere shift.
The size can be changed by side effect of modifying kern.maxvnodes.
Since numbucketlocks was not modified, setting a sufficiently low value
would give more locks than actual buckets, which would then lead to
corruption.
Force the number of buckets to be not smaller.
Note this should not matter for real world cases.
Reported and tested by: pho
The negative list shrinker can demote an entry with only hotlist + neglist
locks held. On the other hand entry removal possibly sets the NCF_DVDROP
without aformentioned locks held prior to detaching it from the respective
netlist., which can lose the update made by the shrinker.
Reported and tested by: truckman
vp->v_mount->mnt_vnodecovered unlocked. This allowed unmount to race.
Lock vnode after we noticed the VV_ROOT flag. See comments for
explanation why unlocked check for the flag is considered safe.
Reported and tested by: avg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
If no negative entry is found on the last list, the ncp pointer will be
left uninitialized and a non-null value will make the function assume an
entry was found.
Fix the problem by initializing to NULL on entry.
Reported by: glebius
This splits the ncneg_mtx lock while preserving the hit ratio at least
during buildworld.
Create N dedicated lists for new negative entries.
Entries with at least one hit get promoted to the hot list, where they
get requeued every M hits.
Shrinking demotes one hot entry and performs a round-robin shrinking of
regular lists.
Reviewed by: kib
Other uses of cache_purgevfs() do rely on the cache purge for correct
operations, when paths are invalidated without unmount.
Reported and tested by: jkim
Discussed with: mjg
Sponsored by: The FreeBSD Foundation