It gets rolled up to the global when deferred requeueing is performed.
A dedicated read routine makes sure to return a value only off by a certain
amount.
This soothes a global serialisation point for all 0<->1 hold count transitions.
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23235
The vnode list lock is only needed to reclaim free vnodes or kick the vnlru
thread (or to block and not miss a wake up (but note the sleep has a timeout so
this would not be a correctness issue)). Try to get away without the lock by
just doing an atomic increment.
The lock is contended e.g., during poudriere -j 104 where about half of all
acquires come from vnode allocation code.
Note the entire scheme needs a rewrite, the above just reduces it's SMP impact.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23140
Semantics are almost identical. Some code is deduplicated and there are
fewer memory accesses.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D23158
Take advantage of global ordering introduced in r356672.
Reviewed by: mckusick (previous version)
Differential Revision: https://reviews.freebsd.org/D23067
Constant requeuing adds significant lock contention in certain
workloads. Lessen the problem by batching it.
Per-cpu areas are locked in order to synchronize against UMA freeing
memory.
vnode's v_mflag is converted to short to prevent the struct from
growing.
Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock: 122.38s user 1780.45s system 6242% cpu 30.480 total
patched: 144.84s user 985.90s system 4856% cpu 23.282 total
Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22998
The current notion of an active vnode is eliminated.
Vnodes transition between 0<->1 hold counts all the time and the
associated traversal between different lists induces significant
scalability problems in certain workloads.
Introduce a global list containing all allocated vnodes. They get
unlinked only when UMA reclaims memory and are only requeued when
hold count reaches 0.
Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock: 118.55s user 3649.73s system 7479% cpu 50.382 total
patched: 122.38s user 1780.45s system 6242% cpu 30.480 total
Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22997
This obviates the need to scan the entire active list looking for vnodes
of interest.
msync is handled by adding all vnodes with write count to the lazy list.
deferred inactive directly adds vnodes as it sets the VI_DEFINACT flag.
Vnodes get dequeued from the list when their hold count reaches 0.
Newly added MNT_VNODE_FOREACH_LAZY* macros support filtering so that
spurious locking is avoided in the common case.
Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22995
This creates a dedicated routine (vn_alloc) to allocate vnodes.
As a side effect code duplicationw with getnewvnode_reserve is eleminated.
Add vn_free for symmetry.
Having a reserved vnode count does not guarantee that getnewvnodes wont
block later. Said blocking partially defeats the purpose of reserving in
the first place.
Preallocate instaed. The only consumer was always passing "1" as count
and never nesting reservations.
vgone dooms the vnode while keeping VI_OWEINACT set and then drops the
interlock.
vputx can pick up the interlock and pass it to vdefer_inactive since the
flag is set.
The race is harmless, just don't defer anything as vgone will take care of it.
Reported by: pho
The previous behavior of leaving VI_OWEINACT vnodes on the active list without
a hold count is eliminated. Hold count is kept and inactive processing gets
explicitly deferred by setting the VI_DEFINACT flag. The syncer is then
responsible for vdrop.
Reviewed by: kib (previous version)
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D23036
- use LK_NOWAIT instead of calling VOP_ISLOCKED before deciding to lock
- evaluate flags before looping over vnodes
Reviewed by: kib
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D23035
Otherwise in code like this:
if (numvnodes > desiredvnodes)
vnlru_free_locked(numvnodes - desiredvnodes, NULL);
numvnodes can drop below desiredvnodes prior to the call and if the
compiler generated another read the subtraction would get a negative
value.
There was only one consumer and it was using it incorrectly.
It is given an equivalent hack.
Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23037
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
1. The only place in the tree which calls getnewvnode with mp == NULL does it
for vp_crossmp which will never execute this codepath. Any vnode which legally
has ->v_mount == NULL is also doomed, which once more wont execute this code.
2. Remove an assertion for v_holdcnt from production kernels. It gets taken care
of by refcount macros in debug kernels.
Any code which would want to pass NULL mp can construct a fake one instead.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D22722
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.
v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.
Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
1. replace hand-rolled macros for operation type with enum
2. unlock the vnode in vput itself, there is no need to branch on it. existence
of VPUTX_VPUT remains significant in that the inactive variant adds LK_NOWAIT
to locking request.
3. remove the useless v_usecount assertion. few lines above the checks if
v_usecount > 0 and leaves. should the value be negative, refcount would fail.
4. the CTR return vnode %p to the freelist is incorrect as vdrop may find the
vnode with holdcnt > 1. if the like should exist, it should be moved there
5. no need to error = 0 for everyone
Reviewed by: kib, jeff (previous version)
Differential Revision: https://reviews.freebsd.org/D22718
Currently si_usecount is effectively a sum of usecounts from all associated
vnodes. This is maintained by special-casing for VCHR every time usecount is
modified. Apart from complicating the code a little bit, it has a scalability
impact since it forces a read from a cacheline shared with said count.
There are no consumers of the feature in the ports tree. In head there are only
2: revoke and devfs_close. Both can get away with a weaker requirement than the
exact usecount, namely just the count of active vnodes. Changing the meaning to
the latter means we only need to modify it on 0<->1 transitions, avoiding the
check plenty of times (and entirely in something like vrefact).
Reviewed by: kib, jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22202
flag and use the same system.
This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22116
Create a sequence point by ending a full expression for call to
vspace() and use of the globals which are modified by vspace().
Reported and reviewed by: imp
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22126
On many filesystems the traversal is effectively a no-op. Add a way to avoid
the overhead.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22009
It is a more natural fit. vfs_msync only deals with active vnodes.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22008
Root vnodes looekd up all the time, e.g. when crossing a mount point.
Currently used routines always perform a costly lookup which can be
trivially avoided.
Reviewed by: jeff (previous version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21646
MAXPATHLEN / PATH_MAX includes space for the terminating NUL, and namei
verifies the presence of the NUL. Thus there is no need to increase the
buffer size here.
The sysctl passes the string excluding the NUL, so req->newlen equal to
PATH_MAX is too long.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21876
The two options are
* nocover/cover: Prevent/allow mounting over an existing root mountpoint.
E.g., "mount -t ufs -o nocover /dev/sd1a /usr/local" will fail if /usr/local
is already a mountpoint.
* emptydir/noemptydir: Prevent/allow mounting on a non-empty directory.
E.g., "mount -t ufs -o emptydir /dev/sd1a /usr" will fail.
Neither of these options is intended to be a default, for historical and
compatibility reasons.
Reviewed by: allanjude, kib
Differential Revision: https://reviews.freebsd.org/D21458
There are 3 counters modified all the time in this structure - one for
keeping the structure alive, one for preventing unmount and one for
tracking active writers. Exact values of these counters are very rarely
needed, which makes them a prime candidate for conversion to a per-cpu
scheme, resulting in much better performance.
Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on
a 104-way 2 socket Skylake system:
before: 852393 ops/s
after: 76682077 ops/s
Reviewed by: kib, jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21637
New primitive is introduced to denote sections can operate locklessly
on aspects of struct mount, but which can also be disabled if necessary.
This provides an opportunity to start scaling common case modifications
while providing stable state of the struct when facing unmount, write
suspendion or other events.
mnt_ref is the first counter to start being managed in this manner with
the intent to make it per-cpu.
Reviewed by: kib, jeff
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21425
1. If we release the last usecount we take ownership of the hold count, which
means the vnode will remain allocated until we vdrop it.
2. If someone else vrefs they will find no usecount and will proceed to add
their own hold count.
3. No code has a problem with v_usecount transitioning to 0 without the
interlock
These facts combined mean we can fetchadd instead of having a cmpset loop.
Reviewed by: kib (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21528
There are 2 problems:
- it introduces a funny bug where it can end up trylocking the same vnode [1]
- it exposes a pre-existing softdep deadlock [2]
Both are easier to run into that the bug which got fixed, so revert until
a complete solution is worked out.
Reported by: cy [1], pho [2]
Sponsored by: The FreeBSD Foundation
Currently the code only bumps holdcnt and clears the VI_FREE flag, not
performing actual vhold. Since the vnode is still visible elsewhere, a
potential new user can find it and incorrectly assume it is properly held.
Use vholdl instead to correctly hold the vnode. Another place recycling
(vlrureclaim) does this already.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21522
vnodes have 2 reference counts - holdcnt to keep the vnode itself from getting
freed and usecount to denote it is actively used.
Previously all operations bumping usecount would also bump holdcnt, which is
not necessary. We can detect if usecount is already > 1 (in which case holdcnt
is also > 1) and utilize it to avoid bumping holdcnt on our own. This saves
on atomic ops.
Reviewed by: kib
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21471
The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure. This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.
With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets. Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only
items in excess of the estimate are reclaimed. Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme
cases.
Now, when under memory pressure, the page daemon will trim zones
rather than draining them. As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.
Reviewed by: jeff
Tested by: pho (an earlier version)
MFC after: 2 months
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16667
Current implementation of vnode_create_vobject() and
vnode_destroy_vobject() is written so that it prepared to handle the
vm object destruction for live vnode. Practically, no filesystems use
this, except for some remnants that were present in UFS till today.
One of the consequences of that model is that each filesystem must
call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result
all of them get rid of the v_object in reclaim.
Move the call to vnode_destroy_vobject() to vgonel() before
VOP_RECLAIM(). This makes v_object stable: either the object is NULL,
or it is valid vm object till the vnode reclamation. Remove code from
vnode_create_vobject() to handle races with the parallel destruction.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21412
vnode usecount drops to 0 all the time (e.g. for directories during path lookup).
When that happens the kernel would always lock the exclusive lock for the vnode
in order to call vinactive(). This blocks other threads who want to use the vnode
for looukp.
vinactive is very rarely needed and can be tested for without the vnode lock held.
This patch gives filesytems an opportunity to do it, sample total wait time for
tmpfs over 500 minutes of poudriere -j 104:
before: 557563641706 (lockmgr:tmpfs)
after: 46309603301 (lockmgr:tmpfs)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21371
The plan is to drop the flags argument. There is also a temporary bug
now that nullfs ignores the flag.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21252
After all the changes, its dynamic scope is same as for MNTK_UNMOUNT,
but to allow the syncer vnode to be re-installed on unmount failure.
But the case of syncer was already handled by using the VV_FORCEINSMQ
flag for quite some time.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
v_inval_buf_range invalidates all buffers within a certain LBA range of a
file. It will be used by fusefs(5). This commit is a partial merge of
r346162, r346606, and r346756 from projects/fuse2.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21032
Add the debug.try_reclaim_vnode sysctl. When a pathname is written to it, it
will be reclaimed, as long as it isn't already or doomed. The purpose is to
gain test coverage for vnode reclamation, which is otherwise hard to
achieve.
Add the debug.ftry_reclaim_vnode sysctl. It does the same thing, except
that its argument is a file descriptor instead of a pathname.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20519
vtruncbuf takes a "struct ucred*" argument. AFAICT, it's been unused ever
since that function was first added in r34611. Remove it. Also, remove some
"struct ucred" arguments from fuse and nfs functions that were only used by
vtruncbuf.
Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20377
Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes. Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone. Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.
Like r348026, these CUs did not show up in tinderbox as missing the include.
Reported by: peterj (arm64/mp_machdep.c)
X-MFC-With: r347984
Sponsored by: Dell EMC Isilon
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.
The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal
The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.
nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.
On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.
Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
Using daddr_t instead of int avoids trunclbn to become negative when it
shouldn't.
This isssue was found by running syzkaller.
Reviewed by: mckusick, kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18763
If invalid, return EINVAL. Note that inode check-hashes greatly
reduce the chance that these errors will go undetected.
Reported by: Christopher Krah <krah@protonmail.com>
Reported as: FS-5-UFS-2: Denial Of Service in nmount-3 (ffs_read)
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix
M sys/fs/ext2fs/ext2_vnops.c
M sys/kern/vfs_subr.c
M sys/ufs/ffs/ffs_snapshot.c
M sys/ufs/ufs/ufs_vnops.c
The presence of allocated v_object does not imply that the buffer is
necessary VMIO kind. Buffer might has been allocated before the
object created, then the buffer is malloced. Although we try to avoid
such situation, it seems to be still legitimate.
Reported and tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Some internal KASSERTs access the v_iflag field without the vnode
interlock held after such a refcount update. The fences are needed for
the assertions to be correct in the face of store reordering.
Reported and tested by: jhibbits
Reviewed by: kib, mjg
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16756
This is needed to avoid a race between the VNASSERT() below, and another
thread updating the VI_FREE flag, on weakly-ordered architectures.
On a 72-thread POWER9, without this barrier a 'make -j72 buildworld' would
panic on the assert regularly.
It may be possible to use a weaker barrier, and I'll investigate that once
all stability issues are worked out on POWER9.
by doing most of the work in a new function prison_add_vfs in kern_jail.c
Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and
the rest is taken care of. This includes adding a jail parameter like
allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed.
Both of these used to be a static list of known filesystems, with
predefined permission bits.
Reviewed by: kib
Differential Revision: D14681
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.
Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.
Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.
Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
It's not sufficient nor required to use the vnode interlock when
checking if we are going to drop the last use count as the code in
vputx() uses refcount (atomic) operations for both checking and
decrementing the use code. Apply the same method to vn_rele_async().
While here, remove vn_rele_inactive(), a wrapper around vrele() that
didn't add any value.
Also, the change required making vfs_refcount_release_if_not_last()
public. I've made vfs_refcount_acquire_if_not_zero() public as well.
They are in sys/refcount.h now. While making the move I've dropped the
vfs_ prefix.
Reviewed by: mjg
MFC after: 2 weeks
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D14869
Provide multiple clean queues partitioned into 'domains'. Each domain manages
its own bufspace and has its own bufspace daemon. Each domain has a set of
subqueues indexed by the current cpuid to reduce lock contention on the cleanq.
Refine the sleep/wakeup around the bufspace daemon to use atomics as much as
possible.
Add a B_REUSE flag that is used to requeue bufs during the scan to approximate
LRU rather than locking the queue on every use of a frequently accessed buf.
Implement bufspace_reserve with only atomic_fetchadd to avoid loop restarts.
Reviewed by: markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14274
whose type depends on the type of vnode. Correct vn_printf so that
it correctly identifies the name of the pointer that it is printing.
Submitted by: Andreas Longwitz <longwitz at incore.de>
MFC after: 1 week
Skip vfs_refcount_release_if_not_last if the interlock is held and just
go straight to refcount_release.
While here do cosmetic rearrangement of _vhold to better show it contains
equivalent behaviour.
This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.
Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
Mainly focus on files that use BSD 3-Clause license.
The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.
Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
The end of the loop must re-lookup the next buf since the bufobj lock
is dropped in the loop body. If the lookup fails, the loop is restarted.
This mechanism non-obviously also terminates the loop when the end of
the buf list is reached. Split up the two loops termination cases to
make the code a bit less fragile. No functional change intended.
Reviewed by: kib
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12730
MNT_VNODE_FOREACH_ALL() is supposed to avoid returning doomed vnodes,
but the VI_DOOMED check it used was done without the vnode interlock
held, so it could race with a concurrent vgone().
Submitted by: Don Morris <don.morris@isilon.com>
Reviewed by: kib, mckusick
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12704
One consequence of the patch is that msyncing unlinked file mappings
no longer reduces the amount of the dirty memory in the system, but I
do not think that there are users of msync(2) that utilize it for such
side-effect.
Reported and tested by: tjil
PR: 222356
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D12411
The old code allowed calling vdrop() before insmntque() to place the vnode back
onto the freelist for later recycling. Some downstream consumers may rely on
this support. Normally insmntque() failing is fine since is uses vgone() and
immediately frees the vnode rather than attempting to add it to the freelist if
vdrop() were used instead.
Also assert that vhold() cannot be used on such a vnode.
Reviewed by: kib, cem, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12126
This mode allows other clean buffers to arrive while we flush the buf
lists for the vnode, which is fine for the targeted use. We only need
that all buffers existed at the time of the function start were
flushed. In fact, only one assert has to be relaxed.
In collaboration with: pho
Reviewed by: rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
X-Differential revision: https://reviews.freebsd.org/D12083
Previously, when the VI_TRYLOCK failed, we would spin under the mutex
that protects the vnode active list until we either succeeded or
noticed that we had hogged the CPU. Since we were violating the lock
order, this would guarantee that we would become a hog under any
deadlock condition (e.g. a race with vdrop(9) on the same vnode). In
the presence of many concurrent threads in sync(2) or vdrop etc, the
victim could hang for a long time.
Now, avoid spinning by dropping and reacquiring the locks in the
conventional lock order when the trylock fails. This requires a dance
with the vnode hold count.
Submitted by: Tom Rix <trix@juniper.net>
Tested by: pho
Differential revision: https://reviews.freebsd.org/D10692
was issued during VM-initiated i/o (pageout), so that the function
does not try to flush or remove pages or wait for the vm object
paging-in-progress counter.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D10241
Don't zero unused pointer members again.
Per discussion with secteam we are not issuing an advisory for this
issue as we have no current evidence it leaks exploitable information.
Reviewed by: rwatson, glebius, delphij
MFC after: 1 day
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D10227
This allows blind increment of relevant counters which under contention
is cheaper than inc-not-zero loops at least on amd64.
Use it in some of the places which are guaranteed to see already active
vnodes.
Reviewed by: kib (previous version)
As of r234483, vnode deactivation causes non-VPO_NOSYNC pages to be
laundered. This behaviour has two problems:
1. Dirty VPO_NOSYNC pages must be laundered before the vnode can be
reclaimed, and this work may be unfairly deferred to the vnlru process
or an unrelated application when the system is under vnode pressure.
2. Deactivation of a vnode with dirty VPO_NOSYNC pages requires a scan of
the corresponding VM object's memq for non-VPO_NOSYNC dirty pages; if
the laundry thread needs to launder pages from an unreferenced such
vnode, it will reactivate and deactivate the vnode with each laundering,
potentially resulting in a large number of expensive scans.
Therefore, ensure that all dirty pages are laundered upon deactivation,
i.e., when all maps of the vnode are removed and all references are
released.
Reviewed by: alc, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D8641
Previously free vnodes would always by directly returned to the global
LRU list. With this change up to mnt_free_list_batch vnodes are collected
first.
syncer runs always return the batch regardless of its size.
While vnodes on per-mnt lists are not counted as free, they can be
returned in case of vnode shortage.
Reviewed by: kib
Tested by: pho
if vnode is VMIO. For VMIO vnodes, set BO_DEAD in vm_object_terminate().
The vnode_destroy_object(), when calling into vm_object_terminate(),
must be able to flush buffers. BO_DEAD purpose is to quickly destroy
buffers on write when the underlying vnode is not operable any more
(one example is the devfs node after geom is gone). Setting BO_DEAD
for reclaiming vnode before object is terminated is premature, and
results in unability to flush buffers with live SU dependencies from
vinvalbuf() in vm_object_terminate().
Reported by: David Cross <dcrosstech@gmail.com>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
does it under the vnode interlock, but the interlock is not owned by the
asserting thread. As result, we might read increased use counter but also
still see VI_OWEINACT.
In collaboration with: nwhitehorn
Hardware donated by: IBM LTC
Sponsored by: The FreeBSD Foundation (kib)
Approved by: re (gjb)
to mount points with the given filesystem type, specified by mount
vfs_ops pointer.
Based on patch by: mckusick
Reviewed by: avg, mckusick
Tested by: allanjude, madpilot
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
It seems that at present vfs_read_dirent() is used only with filesystems
that do not support cookies, so the bug never manifested itself.
MFC after: 1 week
While there, order EVFILT_VNODE notes descriptions alphabetically.
Based on submission, and tested by: Vladimir Kondratyev <wulf@cicgroup.ru>
MFC after: 2 weeks
the monitored directory as the result of rename(2) operation. The
renames staying in the directory are not reported.
Submitted by: Vladimir Kondratyev <wulf@cicgroup.ru>
MFC after: 2 weeks
rename removing or adding subdirectory entry.
Discussed with and tested by: Vladimir Kondratyev <wulf@cicgroup.ru>
NetBSD PR: 48958 (http://gnats.netbsd.org/48958)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
the virtvnodes calculation. Include the size of fs-specific v_data as
the nfs nclnode inline, the NFS nclnode is bigger than either ZFS
znode or UFS inode. Include the size of namecache_ts and short cache
path element, multiplied by the name cache population factor, again
inline.
Inline defines are used to avoid pollution of the vnode.h with the
subsystem-private objects. Non-significant unsynchronized changes of
the definitions are fine, we do not care about that precision, and
e.g. ZFS consumes much malloced memory per vnode for reasons
unaccounted in the formula.
Lower the partition of kmem dedicated to vnodes, from 1/7 to 1/10.
The measures reduce vnode cache pressure on kmem and bring the vnode
cache memory use below some apparent thresholds that were exceeded by
r291244 due to more robust vnode reuse.
Reported and tested by: marius (i386, previous version)
Reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
numbers range.
This effectively skips indirect and extdata blocks on the buffer
queue. Since their logical block numbers are negative, bnoreuselist()
could loop infinitely.
Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
This API has no in-tree consumers at the moment but is useful to at least
one out-of-tree consumer, and naturally complements existing vnode refcount
functions (vholdl(9), vdropl(9)).
Obtained from: kib (sys/ portion)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4947
Differential Revision: https://reviews.freebsd.org/D4953
Advance the logical block number to the lblkno of the found block plus
one, instead of incrementing the block number which was used for
lookup. This change skips sparcely populated buffer ranges, similar
to r292325, instead of doing useless lookups.
Do not restart the bnoreuselist() from the start of the range if
buffer lock cannot be obtained without sleep. Only retry lookup and
lock for the same queue and same logical block number.
Reported by: benno
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
buffer for each block number in the range with gbincore(), look up the
next instantiated buffer with the logical block number which is
greater or equal to the next lblkno. This significantly speeds up the
iteration for sparce-populated range.
Move the iteration into new helper bnoreuselist(), which is structured
similarly to flushbuflist().
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
the type stability of the buffers memory. Instead of memoizing
pointer to the next buffer and validating it, remember the next
logical block number in the bo list and re-lookup.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
For completeness add a VNASSERT that there are no threads waiting on a
range lock (this was previously checked on every vnode free).
Reported by; Rick Macklem
Fix from: Mateusz Guzik
PR: 204949
on every allocation and fully releases them on every free. These
are not trivial costs: it starts by zeroing a large structure then
initializes a mutex, a lock manager lock, an rw lock, four lists,
and six pointers. And looking at vfs.vnodes_created, these operations
are being done millions of times an hour on a busy machine.
As a performance optimization, this code update uses the uma_init
and uma_fini routines to do these initializations and cleanups only
as the vnodes enter and leave the vnode_zone. With this change the
initializations are only done kern.maxvnodes times at system startup
and then only rarely again. The frees are done only if the vnode_zone
shrinks which never happens in practice. For those curious about the
avoided work, look at the vnode_init() and vnode_fini() functions in
kern/vfs_subr.c to see the code that has been removed from the main
vnode allocation/free path.
Reviewed by: kib
Tested by: Peter Holm
active list, into the header comment for vdrop(), which is the
function that decides whether to leave the vnode on the list. Note
that dirty page write-out in vinactive() is asynchronous.
Discussed with: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
targets. See the comment above wantfreevnodes variable for the
description of the algorithm.
The vfs.vlru_alloc_cache_src sysctl is removed. New code frees
namecache sources as the last chance to satisfy the highest watermark,
instead of selecting the source vnodes randomly. This provides good
enough behaviour to keep vn_fullpath() working in most situations.
The filesystem layout with deep trees, where the removed knob was
required, is thus handled automatically.
Submitted by: bde
Discussed with: mckusick
Tested by: pho
MFC after: 1 month
In poll mode, check for and wake VBAD vnodes. (Vnodes that are VBAD at
registration will never be woken by the RECLAIM trigger.)
Add post-VOP_RECLAIM hook to trigger notes on vnode reclamation. (Vnodes that
were fine at registration but are vgoned while being monitored should signal
waiters.)
Reviewed by: kib
Approved by: markj (mentor)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3675
the size of the name cache hash table (mapping file names to vnodes)
and the vnode hash table (mapping mount point and inode number to vnode).
An appropriate locking strategy is the key to changing hash table sizes
while they are in active use.
Reviewed by: kib
Tested by: Peter Holm
Differential Revision: https://reviews.freebsd.org/D2265
MFC after: 2 weeks
reason this didn't result in an unclean shutdown is that devfs ignores
MNT_FORCE flag.
Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3467
It looks like EVFILT_READ and EVFILT_WRITE trigger under the same
conditions as poll()'s POLLRDNORM and POLLWRNORM as described by POSIX.
The only difference is that POLLRDNORM has to be triggered on regular
files unconditionally, whereas EVFILT_READ only triggers when not EOF.
Introduce a new flag, NOTE_FILE_POLL, that can be used to make
EVFILT_READ and EVFILT_WRITE behave identically to poll(). This flag
will be used by cloudlibc's poll() function.
Reviewed by: jmg
Differential Revision: https://reviews.freebsd.org/D3303
Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free
list) of either counter are still guarded with vnode interlock.
Reviewed by: kib (earlier version)
Tested by: pho