Commit Graph

1150 Commits

Author SHA1 Message Date
Mateusz Guzik
2d0c620272 vfs: distribute freevnodes counter per-cpu
It gets rolled up to the global when deferred requeueing is performed.
A dedicated read routine makes sure to return a value only off by a certain
amount.

This soothes a global serialisation point for all 0<->1 hold count transitions.

Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D23235
2020-01-18 01:29:02 +00:00
Mateusz Guzik
1ad72b270c vfs: shorten lock hold time in vdbatch_process 2020-01-17 14:39:00 +00:00
Mateusz Guzik
66f67d5e5e vfs: increment numvnodes without the vnode list lock unless under pressure
The vnode list lock is only needed to reclaim free vnodes or kick the vnlru
thread (or to block and not miss a wake up (but note the sleep has a timeout so
this would not be a correctness issue)). Try to get away without the lock by
just doing an atomic increment.

The lock is contended e.g., during poudriere -j 104 where about half of all
acquires come from vnode allocation code.

Note the entire scheme needs a rewrite, the above just reduces it's SMP impact.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23140
2020-01-16 21:45:21 +00:00
Mateusz Guzik
b7f50b9ad1 vfs: refcator vnode allocation
Semantics are almost identical. Some code is deduplicated and there are
fewer memory accesses.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D23158
2020-01-16 21:43:13 +00:00
Mateusz Guzik
875cfc082d vfs: reimplement vlrureclaim to actually use LRU
Take advantage of global ordering introduced in r356672.

Reviewed by:	mckusick (previous version)
Differential Revision:	https://reviews.freebsd.org/D23067
2020-01-16 10:44:02 +00:00
Mateusz Guzik
0c236d3d52 vfs: per-cpu batched requeuing of free vnodes
Constant requeuing adds significant lock contention in certain
workloads. Lessen the problem by batching it.

Per-cpu areas are locked in order to synchronize against UMA freeing
memory.

vnode's v_mflag is converted to short to prevent the struct from
growing.

Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock:   122.38s user 1780.45s system 6242% cpu 30.480 total
patched: 144.84s user 985.90s system 4856% cpu 23.282 total

Reviewed by:	jeff
Tested by:	pho (in a larger patch, previous version)
Differential Revision:	https://reviews.freebsd.org/D22998
2020-01-13 02:39:41 +00:00
Mateusz Guzik
cc3593fbd9 vfs: rework vnode list management
The current notion of an active vnode is eliminated.

Vnodes transition between 0<->1 hold counts all the time and the
associated traversal between different lists induces significant
scalability problems in certain workloads.

Introduce a global list containing all allocated vnodes. They get
unlinked only when UMA reclaims memory and are only requeued when
hold count reaches 0.

Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock:   118.55s user 3649.73s system 7479% cpu 50.382 total
patched: 122.38s user 1780.45s system 6242% cpu 30.480 total

Reviewed by:	jeff
Tested by:	pho (in a larger patch, previous version)
Differential Revision:	https://reviews.freebsd.org/D22997
2020-01-13 02:37:25 +00:00
Mateusz Guzik
57083d2576 vfs: add per-mount vnode lazy list and use it for deferred inactive + msync
This obviates the need to scan the entire active list looking for vnodes
of interest.

msync is handled by adding all vnodes with write count to the lazy list.

deferred inactive directly adds vnodes as it sets the VI_DEFINACT flag.

Vnodes get dequeued from the list when their hold count reaches 0.

Newly added MNT_VNODE_FOREACH_LAZY* macros support filtering so that
spurious locking is avoided in the common case.

Reviewed by:	jeff
Tested by:	pho (in a larger patch, previous version)
Differential Revision:	https://reviews.freebsd.org/D22995
2020-01-13 02:34:02 +00:00
Mateusz Guzik
879e0604ee Add KERNEL_PANICKED macro for use in place of direct panicstr tests 2020-01-12 06:07:54 +00:00
Mateusz Guzik
91de98e6d4 vfs: only recalculate watermarks when limits are changing
Previously they would get recalculated all the time, in particular in:
getnewvnode -> vcheckspace -> vspace
2020-01-11 23:00:57 +00:00
Mateusz Guzik
e6ae744e0e vfs: deduplicate vnode allocation logic
This creates a dedicated routine (vn_alloc) to allocate vnodes.

As a side effect code duplicationw with getnewvnode_reserve is eleminated.

Add vn_free for symmetry.
2020-01-11 22:59:44 +00:00
Mateusz Guzik
b52d50cf69 vfs: prealloc vnodes in getnewvnode_reserve
Having a reserved vnode count does not guarantee that getnewvnodes wont
block later. Said blocking partially defeats the purpose of reserving in
the first place.

Preallocate instaed. The only consumer was always passing "1" as count
and never nesting reservations.
2020-01-11 22:58:14 +00:00
Mateusz Guzik
6928306764 vfs: incomplete pass at converting more ints to u_long
Most notably numvnodes and freevnodes were u_long, but parameters used to
govern them remained as ints.
2020-01-11 22:56:20 +00:00
Mateusz Guzik
bf62296f35 vfs: add missing CLTFLA_MPSAFE annotations
This covers all kern/vfs_*.c files.
2020-01-11 22:55:12 +00:00
Mateusz Guzik
a9a047bc87 vfs: handle doomed vnodes in vdefer_inactive
vgone dooms the vnode while keeping VI_OWEINACT set and then drops the
interlock.

vputx can pick up the interlock and pass it to vdefer_inactive since the
flag is set.

The race is harmless, just don't defer anything as vgone will take care of it.

Reported by:	pho
2020-01-07 20:24:21 +00:00
Mateusz Guzik
c8b3463dd0 vfs: reimplement deferred inactive to use a dedicated flag (VI_DEFINACT)
The previous behavior of leaving VI_OWEINACT vnodes on the active list without
a hold count is eliminated. Hold count is kept and inactive processing gets
explicitly deferred by setting the VI_DEFINACT flag. The syncer is then
responsible for vdrop.

Reviewed by:	kib (previous version)
Tested by:	pho (in a larger patch, previous version)
Differential Revision:	https://reviews.freebsd.org/D23036
2020-01-07 15:56:24 +00:00
Mateusz Guzik
b7cc9d1847 vfs: trylock in vfs_msync and refactor the func
- use LK_NOWAIT instead of calling VOP_ISLOCKED before deciding to lock
- evaluate flags before looping over vnodes

Reviewed by:	kib
Tested by:	pho (in a larger patch, previous version)
Differential Revision:	https://reviews.freebsd.org/D23035
2020-01-07 15:44:19 +00:00
Mateusz Guzik
c92fe112a7 vfs: use a dedicated counter for free vnode recycling
Otherwise vlrureclaim activitity is mixed in and it is hard to tell which
vnodes got reclaimed.
2020-01-07 15:42:01 +00:00
Mateusz Guzik
cc2b586d69 vfs: prevent numvnodes and freevnodes re-reads when appropriate
Otherwise in code like this:
if (numvnodes > desiredvnodes)
	vnlru_free_locked(numvnodes - desiredvnodes, NULL);

numvnodes can drop below desiredvnodes prior to the call and if the
compiler generated another read the subtraction would get a negative
value.
2020-01-07 04:34:03 +00:00
Mateusz Guzik
37fe521a6f vfs: annotate numvnodes and vnode_free_list_mtx with __exclusive_cache_line 2020-01-07 04:30:49 +00:00
Mateusz Guzik
478368ca41 vfs: eliminate v_tag from struct vnode
There was only one consumer and it was using it incorrectly.

It is given an equivalent hack.

Reviewed by:	jeff
Differential Revision:	https://reviews.freebsd.org/D23037
2020-01-07 04:29:34 +00:00
Mateusz Guzik
a91190c63e vfs: add a helper for allocating marker vnodes 2020-01-07 04:27:40 +00:00
Mateusz Guzik
8dbc63520c vfs: drop thread argument from vinactive 2020-01-05 00:59:47 +00:00
Mateusz Guzik
867fd730c6 vfs: patch up vnode count assertions to report found value 2020-01-05 00:59:16 +00:00
Mateusz Guzik
b249ce48ea vfs: drop the mostly unused flags argument from VOP_UNLOCK
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D21427
2020-01-03 22:29:58 +00:00
Mateusz Guzik
57db0e12c8 vfs: drop an always-false check from vlrureclaim
The vnode gets held few lines prior, making the VI_FREE condition
illegal.
2020-01-01 22:51:17 +00:00
Mateusz Guzik
eb9764615d vfs: remove production kernel checks and mp == NULL support from vdrop
1. The only place in the tree which calls getnewvnode with mp == NULL does it
for vp_crossmp which will never execute this codepath. Any vnode which legally
has ->v_mount == NULL is also doomed, which once more wont execute this code.
2. Remove an assertion for v_holdcnt from production kernels. It gets taken care
of by refcount macros in debug kernels.

Any code which would want to pass NULL mp can construct a fake one instead.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D22722
2019-12-27 11:26:12 +00:00
Mateusz Guzik
6fa079fc3f vfs: flatten vop vectors
This eliminates the following loop from all VOP calls:

while(vop != NULL && \
    vop->vop_spare2 == NULL && vop->vop_bypass == NULL)
        vop = vop->vop_default;

Reviewed by:	jeff
Tesetd by:	pho
Differential Revision:	https://reviews.freebsd.org/D22738
2019-12-16 00:06:22 +00:00
Mateusz Guzik
ff4486e827 vfs: refactor vhold and vdrop
No fuctional changes.
2019-12-10 00:08:05 +00:00
Mateusz Guzik
abd80ddb94 vfs: introduce v_irflag and make v_type smaller
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D22715
2019-12-08 21:30:04 +00:00
Mateusz Guzik
791a24c7ea vfs: clean up vputx a little
1. replace hand-rolled macros for operation type with enum
2. unlock the vnode in vput itself, there is no need to branch on it. existence
of VPUTX_VPUT remains significant in that the inactive variant adds LK_NOWAIT
to locking request.
3. remove the useless v_usecount assertion. few lines above the checks if
v_usecount > 0 and leaves. should the value be negative, refcount would fail.
4. the CTR return vnode %p to the freelist is incorrect as vdrop may find the
vnode with holdcnt > 1. if the like should exist, it should be moved there
5. no need to error = 0 for everyone

Reviewed by:	kib, jeff (previous version)
Differential Revision:	https://reviews.freebsd.org/D22718
2019-12-08 21:13:07 +00:00
Mateusz Guzik
fd6e0c43a6 vfs: factor out vnode destruction out of vdrop
Sponsored by:	The FreeBSD Foundation
2019-12-08 21:11:25 +00:00
Mateusz Guzik
12e483e5f7 vfs: clean up delmntque similarly to vdrop r355414 2019-12-07 12:56:24 +00:00
Mateusz Guzik
4f4d9a086a vfs: catch vn_printf up with reality
- add the missing VV_VMSIZEVNLOCK and VV_READLINK flags
- add decoding v_mflag

While here sort flags.
2019-12-07 12:55:58 +00:00
Mateusz Guzik
3eeb8a1fba vfs: remove 'active' variable from _vdrop
No functional changes.
2019-12-05 13:40:10 +00:00
Mateusz Guzik
d957f3a4f0 vfs: perform a more racy check in vfs_notify_upper
Locking mp does not buy anything interms of correctness and only contributes to
contention.
2019-11-20 12:07:54 +00:00
Mateusz Guzik
1fccb43c39 vfs: change si_usecount management to count used vnodes
Currently si_usecount is effectively a sum of usecounts from all associated
vnodes. This is maintained by special-casing for VCHR every time usecount is
modified. Apart from complicating the code a little bit, it has a scalability
impact since it forces a read from a cacheline shared with said count.

There are no consumers of the feature in the ports tree. In head there are only
2: revoke and devfs_close. Both can get away with a weaker requirement than the
exact usecount, namely just the count of active vnodes. Changing the meaning to
the latter means we only need to modify it on 0<->1 transitions, avoiding the
check plenty of times (and entirely in something like vrefact).

Reviewed by:	kib, jeff
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22202
2019-11-20 12:05:59 +00:00
Jeff Roberson
67d0e29304 Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY
flag and use the same system.

This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22116
2019-10-29 21:06:34 +00:00
Konstantin Belousov
c92f130498 Fix undefined behavior.
Create a sequence point by ending a full expression for call to
vspace() and use of the globals which are modified by vspace().

Reported and reviewed by:	imp
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22126
2019-10-23 16:06:47 +00:00
Konstantin Belousov
8076c4e7d1 vn_printf(): Decode VI_TEXT_REF.
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2019-10-23 15:51:26 +00:00
Mateusz Guzik
d1cbf3eeea vfs: add MNTK_NOMSYNC
On many filesystems the traversal is effectively a no-op. Add a way to avoid
the overhead.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22009
2019-10-13 15:40:34 +00:00
Mateusz Guzik
737241cd51 vfs: return free vnode batches in sync instead of vfs_msync
It is a more natural fit. vfs_msync only deals with active vnodes.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22008
2019-10-13 15:39:11 +00:00
Mateusz Guzik
dc20b834ca vfs: add optional root vnode caching
Root vnodes looekd up all the time, e.g. when crossing a mount point.
Currently used routines always perform a costly lookup which can be
trivially avoided.

Reviewed by:	jeff (previous version), kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21646
2019-10-06 22:14:32 +00:00
Eric van Gyzen
e61e783b83 Add CTLFLAG_STATS to some vfs sysctl OIDs
Add CTLFLAG_STATS to the following OIDs:

vfs.altbufferflushes
vfs.recursiveflushes
vfs.barrierwrites
vfs.flushwithdeps
vfs.reassignbufcalls

Refer to r353111.

MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
2019-10-04 21:43:43 +00:00
Ed Maste
f91dd6091b simplify path handling in sysctl_try_reclaim_vnode
MAXPATHLEN / PATH_MAX includes space for the terminating NUL, and namei
verifies the presence of the NUL.  Thus there is no need to increase the
buffer size here.

The sysctl passes the string excluding the NUL, so req->newlen equal to
PATH_MAX is too long.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21876
2019-10-02 21:01:23 +00:00
Sean Eric Fagan
ba7a55d934 Add two options to allow mount to avoid covering up existing mount points.
The two options are

* nocover/cover:  Prevent/allow mounting over an existing root mountpoint.
E.g., "mount -t ufs -o nocover /dev/sd1a /usr/local" will fail if /usr/local
is already a mountpoint.
* emptydir/noemptydir:  Prevent/allow mounting on a non-empty directory.
E.g., "mount -t ufs -o emptydir /dev/sd1a /usr" will fail.

Neither of these options is intended to be a default, for historical and
compatibility reasons.

Reviewed by:	allanjude, kib
Differential Revision:	https://reviews.freebsd.org/D21458
2019-09-23 04:28:07 +00:00
Mateusz Guzik
4cace859c2 vfs: convert struct mount counters to per-cpu
There are 3 counters modified all the time in this structure - one for
keeping the structure alive, one for preventing unmount and one for
tracking active writers. Exact values of these counters are very rarely
needed, which makes them a prime candidate for conversion to a per-cpu
scheme, resulting in much better performance.

Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on
a 104-way 2 socket Skylake system:
before:   852393 ops/s
after:  76682077 ops/s

Reviewed by:	kib, jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21637
2019-09-16 21:37:47 +00:00
Mateusz Guzik
ee831b2543 vfs: manage mnt_lockref with atomics
See r352424.

Reviewed by:	kib, jeff
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21574
2019-09-16 21:32:21 +00:00
Mateusz Guzik
a8c8e44bf0 vfs: manage mnt_ref with atomics
New primitive is introduced to denote sections can operate locklessly
on aspects of struct mount, but which can also be disabled if necessary.
This provides an opportunity to start scaling common case modifications
while providing stable state of the struct when facing unmount, write
suspendion or other events.

mnt_ref is the first counter to start being managed in this manner with
the intent to make it per-cpu.

Reviewed by:	kib, jeff
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21425
2019-09-16 21:31:02 +00:00
Mateusz Guzik
ce3ba63f67 vfs: release usecount using fetchadd
1. If we release the last usecount we take ownership of the hold count, which
means the vnode will remain allocated until we vdrop it.
2. If someone else vrefs they will find no usecount and will proceed to add
their own hold count.
3. No code has a problem with v_usecount transitioning to 0 without the
interlock

These facts combined mean we can fetchadd instead of having a cmpset loop.

Reviewed by:	kib (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21528
2019-09-13 15:49:04 +00:00
Mateusz Guzik
68c3c1abe1 vfs: temporarily revert r351825
There are 2 problems:
- it introduces a funny bug where it can end up trylocking the same vnode [1]
- it exposes a pre-existing softdep deadlock [2]

Both are easier to run into that the bug which got fixed, so revert until
a complete solution is worked out.

Reported by:	cy [1], pho [2]
Sponsored by:	The FreeBSD Foundation
2019-09-05 18:19:51 +00:00
Mateusz Guzik
c07d4a0a68 vfs: fully hold vnodes in vnlru_free_locked
Currently the code only bumps holdcnt and clears the VI_FREE flag, not
performing actual vhold. Since the vnode is still visible elsewhere, a
potential new user can find it and incorrectly assume it is properly held.

Use vholdl instead to correctly hold the vnode. Another place recycling
(vlrureclaim) does this already.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21522
2019-09-04 19:23:18 +00:00
Mateusz Guzik
e3c3248cc7 vfs: implement usecount implying holdcnt
vnodes have 2 reference counts - holdcnt to keep the vnode itself from getting
freed and usecount to denote it is actively used.

Previously all operations bumping usecount would also bump holdcnt, which is
not necessary. We can detect if usecount is already > 1 (in which case holdcnt
is also > 1) and utilize it to avoid bumping holdcnt on our own. This saves
on atomic ops.

Reviewed by:	kib
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21471
2019-09-03 15:42:11 +00:00
Mark Johnston
08cfa56ea3 Extend uma_reclaim() to permit different reclamation targets.
The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure.  This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.

With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets.  Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure.  In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim().  When trimming, only
items in excess of the estimate are reclaimed.  Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme
cases.

Now, when under memory pressure, the page daemon will trim zones
rather than draining them.  As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.

Reviewed by:	jeff
Tested by:	pho (an earlier version)
MFC after:	2 months
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D16667
2019-09-01 22:22:43 +00:00
Mateusz Guzik
c2b600f98f vfs: add a missing VNODE_REFCOUNT_FENCE_REL to v_incr_usecount_locked
Sponsored by:	The FreeBSD Foundation
2019-08-30 21:54:45 +00:00
Mateusz Guzik
3bb8d8d8c9 vfs: tidy up assertions in vfs_subr
- assert unlocked vnode interlock in vref
- assert right counts in vputx
- print debug info for panic in vdrop

Sponsored by:	The FreeBSD Foundation
2019-08-30 00:45:53 +00:00
Konstantin Belousov
6470c8d3db Rework v_object lifecycle for vnodes.
Current implementation of vnode_create_vobject() and
vnode_destroy_vobject() is written so that it prepared to handle the
vm object destruction for live vnode.  Practically, no filesystems use
this, except for some remnants that were present in UFS till today.
One of the consequences of that model is that each filesystem must
call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result
all of them get rid of the v_object in reclaim.

Move the call to vnode_destroy_vobject() to vgonel() before
VOP_RECLAIM().  This makes v_object stable: either the object is NULL,
or it is valid vm object till the vnode reclamation.  Remove code from
vnode_create_vobject() to handle races with the parallel destruction.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21412
2019-08-29 07:50:25 +00:00
Mateusz Guzik
1e2f0ceb2f vfs: add VOP_NEED_INACTIVE
vnode usecount drops to 0 all the time (e.g. for directories during path lookup).
When that happens the kernel would always lock the exclusive lock for the vnode
in order to call vinactive(). This blocks other threads who want to use the vnode
for looukp.

vinactive is very rarely needed and can be tested for without the vnode lock held.

This patch gives filesytems an opportunity to do it, sample total wait time for
tmpfs over 500 minutes of poudriere -j 104:

before: 557563641706 (lockmgr:tmpfs)
after:   46309603301 (lockmgr:tmpfs)

Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21371
2019-08-28 20:34:24 +00:00
Mateusz Guzik
368cabbcb5 vfs: stop passing LK_INTERLOCK to VOP_UNLOCK
The plan is to drop the flags argument. There is also a temporary bug
now that nullfs ignores the flag.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21252
2019-08-27 20:30:56 +00:00
Mateusz Guzik
0256405e98 vfs: add vholdnz (for already held vnodes)
Reviewed by:	kib (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21358
2019-08-25 05:11:43 +00:00
Konstantin Belousov
e671edac06 De-commision the MNTK_NOINSMNTQ kernel mount flag.
After all the changes, its dynamic scope is same as for MNTK_UNMOUNT,
but to allow the syncer vnode to be re-installed on unmount failure.
But the case of syncer was already handled by using the VV_FORCEINSMQ
flag for quite some time.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-08-23 19:40:10 +00:00
Jeff Roberson
cf27e0d125 Use an atomic reference count for paging in progress so that callers do not
require the object lock.

Reviewed by:	markj
Tested by:	pho (as part of a larger branch)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21311
2019-08-19 23:09:38 +00:00
Alan Somers
e7d8ebc8ca Better comments for vlrureclaim
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2019-07-28 16:07:27 +00:00
Alan Somers
2240d8c465 Add v_inval_buf_range, like vtruncbuf but for a range of a file
v_inval_buf_range invalidates all buffers within a certain LBA range of a
file. It will be used by fusefs(5). This commit is a partial merge of
r346162, r346606, and r346756 from projects/fuse2.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21032
2019-07-28 00:48:28 +00:00
Alan Somers
46f8169aea Add a testing facility to manually reclaim a vnode
Add the debug.try_reclaim_vnode sysctl. When a pathname is written to it, it
will be reclaimed, as long as it isn't already or doomed. The purpose is to
gain test coverage for vnode reclamation, which is otherwise hard to
achieve.

Add the debug.ftry_reclaim_vnode sysctl.  It does the same thing, except
that its argument is a file descriptor instead of a pathname.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20519
2019-06-06 15:04:50 +00:00
Alan Somers
65417f5e27 Remove "struct ucred*" argument from vtruncbuf
vtruncbuf takes a "struct ucred*" argument. AFAICT, it's been unused ever
since that function was first added in r34611. Remove it.  Also, remove some
"struct ucred" arguments from fuse and nfs functions that were only used by
vtruncbuf.

Reviewed by:	cem
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20377
2019-05-24 20:27:50 +00:00
Conrad Meyer
daec92844e Include ktr.h in more compilation units
Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes.  Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone.  Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.

Like r348026, these CUs did not show up in tinderbox as missing the include.

Reported by:	peterj (arm64/mp_machdep.c)
X-MFC-With:	r347984
Sponsored by:	Dell EMC Isilon
2019-05-21 20:38:48 +00:00
Konstantin Belousov
5422b0e63d Fix rw->ro remount when there is a text vnode mapping.
Reported and tested by:	hrs
Sponsored by:	The FreeBSD Foundation
MFC after:	16 days
2019-05-19 09:18:09 +00:00
Konstantin Belousov
78022527bb Switch to use shared vnode locks for text files during image activation.
kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0
condition.

The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount.  To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal

The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change.  vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP.  vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.

nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.

On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.

Reviewed by:	markj, trasz
Tested by:	mjg, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D19923
2019-05-05 11:20:43 +00:00
Kirk McKusick
c11cbfd957 Update the main loop in the flushbuflist() routine to properly select
buffers for flushing when requested to flush both normal and extended
attributes buffers.

Sponsored by: Netflix
2019-03-11 22:42:33 +00:00
Michael Tuexen
735835ed5c Avoid overfow in vtruncbuf()
Using daddr_t instead of int avoids trunclbn to become negative when it
shouldn't.
This isssue was found by running syzkaller.

Reviewed by:		mckusick, kib, markj
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D18763
2019-01-08 09:04:27 +00:00
Kirk McKusick
c0029546f8 When loading an inode from disk, verify that its mode is valid.
If invalid, return EINVAL. Note that inode check-hashes greatly
reduce the chance that these errors will go undetected.

Reported by:  Christopher Krah <krah@protonmail.com>
Reported as:  FS-5-UFS-2: Denial Of Service in nmount-3 (ffs_read)
Reviewed by:  kib
MFC after:    1 week
Sponsored by: Netflix

M    sys/fs/ext2fs/ext2_vnops.c
M    sys/kern/vfs_subr.c
M    sys/ufs/ffs/ffs_snapshot.c
M    sys/ufs/ufs/ufs_vnops.c
2018-12-27 07:18:53 +00:00
Konstantin Belousov
6c59824b31 Properly test for vmio buffer in bnoreuselist().
The presence of allocated v_object does not imply that the buffer is
necessary VMIO kind.  Buffer might has been allocated before the
object created, then the buffer is malloced.  Although we try to avoid
such situation, it seems to be still legitimate.

Reported and tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-12-23 18:52:02 +00:00
Mateusz Guzik
cc426dd319 Remove unused argument to priv_check_cred.
Patch mostly generated with cocinnelle:

@@
expression E1,E2;
@@

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by:	The FreeBSD Foundation
2018-12-11 19:32:16 +00:00
Mark Johnston
1436ff1ebb Typo.
X-MFC with:	r337974
2018-08-17 16:07:06 +00:00
Mark Johnston
3ccbdc8254 Add INVARIANTS-only fences around lockless vnode refcount updates.
Some internal KASSERTs access the v_iflag field without the vnode
interlock held after such a refcount update.  The fences are needed for
the assertions to be correct in the face of store reordering.

Reported and tested by:	jhibbits
Reviewed by:	kib, mjg
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16756
2018-08-17 15:41:01 +00:00
Justin Hibbits
3f9e1fc8ee Revert r334708
This is the wrong place to put the barrier.
Requested by:	kib,mjg
2018-06-06 15:12:19 +00:00
Justin Hibbits
32c369f40c Add a memory barrier after taking a reference on the vnode holdcnt in _vhold
This is needed to avoid a race between the VNASSERT() below, and another
thread updating the VI_FREE flag, on weakly-ordered architectures.

On a 72-thread POWER9, without this barrier a 'make -j72 buildworld' would
panic on the assert regularly.

It may be possible to use a weaker barrier, and I'll investigate that once
all stability issues are worked out on POWER9.
2018-06-06 12:57:11 +00:00
Matt Macy
84482abd21 vfs: annotate variables only used by debug builds as __unused 2018-05-19 04:59:39 +00:00
Jamie Gritton
0e5c6bd436 Make it easier for filesystems to count themselves as jail-enabled,
by doing most of the work in a new function prison_add_vfs in kern_jail.c
Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and
the rest is taken care of.  This includes adding a jail parameter like
allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed.
Both of these used to be a static list of known filesystems, with
predefined permission bits.

Reviewed by:	kib
Differential Revision:	D14681
2018-05-04 20:54:27 +00:00
Brooks Davis
6469bdcdb6 Move most of the contents of opt_compat.h to opt_global.h.
opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c.  A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by:	kib, cem, jhb, jtl
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D14941
2018-04-06 17:35:35 +00:00
Andriy Gapon
f4043145f2 ZFS vn_rele_async: catch up with the use of refcount(9) for the vnode use count
It's not sufficient nor required to use the vnode interlock when
checking if we are going to drop the last use count as the code in
vputx() uses refcount (atomic) operations for both checking and
decrementing the use code.  Apply the same method to vn_rele_async().
While here, remove vn_rele_inactive(), a wrapper around vrele() that
didn't add any value.

Also, the change required making vfs_refcount_release_if_not_last()
public.  I've made vfs_refcount_acquire_if_not_zero() public as well.
They are in sys/refcount.h now.  While making the move I've dropped the
vfs_ prefix.

Reviewed by:	mjg
MFC after:	2 weeks
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D14869
2018-03-28 08:55:31 +00:00
Jeff Roberson
06220fa737 Further parallelize the buffer cache.
Provide multiple clean queues partitioned into 'domains'.  Each domain manages
its own bufspace and has its own bufspace daemon.  Each domain has a set of
subqueues indexed by the current cpuid to reduce lock contention on the cleanq.

Refine the sleep/wakeup around the bufspace daemon to use atomics as much as
possible.

Add a B_REUSE flag that is used to requeue bufs during the scan to approximate
LRU rather than locking the queue on every use of a frequently accessed buf.

Implement bufspace_reserve with only atomic_fetchadd to avoid loop restarts.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	Netflix, Dell/EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D14274
2018-02-20 00:06:07 +00:00
Kirk McKusick
5a35a04255 One of the vnode fields listed by vn_printf is the union of pointers
whose type depends on the type of vnode. Correct vn_printf so that
it correctly identifies the name of the pointer that it is printing.

Submitted by: Andreas Longwitz <longwitz at incore.de>
MFC after: 1 week
2018-01-31 22:49:50 +00:00
Mateusz Guzik
31c2c6e95e vfs: tidy up vdrop
Skip vfs_refcount_release_if_not_last if the interlock is held and just
go straight to refcount_release.

While here do cosmetic rearrangement of _vhold to better show it contains
equivalent behaviour.
2018-01-12 13:39:02 +00:00
Eitan Adler
caa7e52f3f kernel: Fix several typos and minor errors
- duplicate words
- typos
- references to old versions of FreeBSD

Reviewed by:	imp, benno
2017-12-27 03:23:21 +00:00
Alexander Kabaev
151ba7933a Do pass removing some write-only variables from the kernel.
This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.

Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
2017-12-25 04:48:39 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Mark Johnston
a3e8a25a52 Avoid the nbp lookup in the final loop iteration in flushbuflist().
The end of the loop must re-lookup the next buf since the bufobj lock
is dropped in the loop body. If the lookup fails, the loop is restarted.
This mechanism non-obviously also terminates the loop when the end of
the buf list is reached. Split up the two loops termination cases to
make the code a bit less fragile. No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12730
2017-10-20 14:56:13 +00:00
Mark Johnston
fa00affd18 Fix a racy VI_DOOMED check in MNT_VNODE_FOREACH_ALL().
MNT_VNODE_FOREACH_ALL() is supposed to avoid returning doomed vnodes,
but the VI_DOOMED check it used was done without the vnode interlock
held, so it could race with a concurrent vgone().

Submitted by:	Don Morris <don.morris@isilon.com>
Reviewed by:	kib, mckusick
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12704
2017-10-17 19:41:45 +00:00
Konstantin Belousov
5bf949377e For unlinked files, do not msync(2) or sync on the vnode deactivation.
One consequence of the patch is that msyncing unlinked file mappings
no longer reduces the amount of the dirty memory in the system, but I
do not think that there are users of msync(2) that utilize it for such
side-effect.

Reported and tested by:	tjil
PR:	222356
Reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D12411
2017-09-19 16:46:37 +00:00
Bryan Drewery
8359a6b7b3 Allow vdrop() of a vnode not yet on the per-mount list after r306512.
The old code allowed calling vdrop() before insmntque() to place the vnode back
onto the freelist for later recycling.  Some downstream consumers may rely on
this support.  Normally insmntque() failing is fine since is uses vgone() and
immediately frees the vnode rather than attempting to add it to the freelist if
vdrop() were used instead.

Also assert that vhold() cannot be used on such a vnode.

Reviewed by:	kib, cem, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D12126
2017-08-28 19:29:51 +00:00
Konstantin Belousov
b59ea73029 Allow vinvalbuf() to operate with the shared vnode lock.
This mode allows other clean buffers to arrive while we flush the buf
lists for the vnode, which is fine for the targeted use.  We only need
that all buffers existed at the time of the function start were
flushed.  In fact, only one assert has to be relaxed.

In collaboration with:	pho
Reviewed by:	rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
X-Differential revision:	https://reviews.freebsd.org/D12083
2017-08-20 10:07:45 +00:00
Gleb Smirnoff
0c3c207ffd For UNIX sockets make vnode point not to the socket, but to the UNIX PCB,
since the latter is the thing that links together VFS and sockets.

While here, make the union in the struct vnode anonymous.
2017-06-02 17:31:25 +00:00
Konstantin Belousov
391aba32e6 mnt_vnode_next_active: use conventional lock order when trylock fails.
Previously, when the VI_TRYLOCK failed, we would spin under the mutex
that protects the vnode active list until we either succeeded or
noticed that we had hogged the CPU. Since we were violating the lock
order, this would guarantee that we would become a hog under any
deadlock condition (e.g. a race with vdrop(9) on the same vnode). In
the presence of many concurrent threads in sync(2) or vdrop etc, the
victim could hang for a long time.

Now, avoid spinning by dropping and reacquiring the locks in the
conventional lock order when the trylock fails. This requires a dance
with the vnode hold count.

Submitted by:	Tom Rix <trix@juniper.net>
Tested by:	pho
Differential revision:	https://reviews.freebsd.org/D10692
2017-05-15 10:02:45 +00:00
Konstantin Belousov
0226f65940 Add V_VMIO flag for vinvalbuf(9) to indicate that the flush request
was issued during VM-initiated i/o (pageout), so that the function
does not try to flush or remove pages or wait for the vm object
paging-in-progress counter.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-Differential revision:	https://reviews.freebsd.org/D10241
2017-04-05 16:57:53 +00:00
Brooks Davis
db3625531e Correct a kernel stack leak in 32-bit compat when vfc_name is short.
Don't zero unused pointer members again.

Per discussion with secteam we are not issuing an advisory for this
issue as we have no current evidence it leaks exploitable information.

Reviewed by:	rwatson, glebius, delphij
MFC after:	1 day
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D10227
2017-04-04 17:32:08 +00:00
Ian Lepore
e3f87f6c70 Change 'Hz' back to 'HZ'... it's referring to the kernel config option
named HZ, not being used as an abbreviation of the unit of measure.
2017-03-12 18:07:03 +00:00
Ian Lepore
8a3966405e Correct the abbreviations for microseconds (us, not ms), and for Hz (not HZ). 2017-03-12 17:43:45 +00:00
Mateusz Guzik
2d78a5531e vfs: use atomic_fcmpset in vfs_refcount_* 2017-02-05 03:23:16 +00:00
Edward Tomasz Napierala
8acac5a9f5 Improve debugging printf. 2017-01-22 15:27:14 +00:00
Mateusz Guzik
067115e050 vfs: hide the getvnode NULL mp message behind DIAGNOSTIC
Since crossmp vnode changes the message was being printed on each boot.

Reported by:	trasz
Discussed with:	kib
2017-01-21 16:59:50 +00:00
Mateusz Guzik
41b0046a4d vfs: switch nodes_created, recycles_count and free_owe_inact to counter(9)
Reviewed by:	kib
2016-12-31 19:59:31 +00:00
Mateusz Guzik
5afb134c32 vfs: add vrefact, to be used when the vnode has to be already active
This allows blind increment of relevant counters which under contention
is cheaper than inc-not-zero loops at least on amd64.

Use it in some of the places which are guaranteed to see already active
vnodes.

Reviewed by:	kib (previous version)
2016-12-12 15:37:11 +00:00
Mark Johnston
64910ddbff Launder VPO_NOSYNC pages upon vnode deactivation.
As of r234483, vnode deactivation causes non-VPO_NOSYNC pages to be
laundered. This behaviour has two problems:

1. Dirty VPO_NOSYNC pages must be laundered before the vnode can be
   reclaimed, and this work may be unfairly deferred to the vnlru process
   or an unrelated application when the system is under vnode pressure.
2. Deactivation of a vnode with dirty VPO_NOSYNC pages requires a scan of
   the corresponding VM object's memq for non-VPO_NOSYNC dirty pages; if
   the laundry thread needs to launder pages from an unreferenced such
   vnode, it will reactivate and deactivate the vnode with each laundering,
   potentially resulting in a large number of expensive scans.

Therefore, ensure that all dirty pages are laundered upon deactivation,
i.e., when all maps of the vnode are removed and all references are
released.

Reviewed by:	alc, kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D8641
2016-11-26 21:00:27 +00:00
Mateusz Guzik
c6c44ff7eb vfs: clear the tmp free list flag before taking the free vnode list lock
Safe access is already guaranteed because of the mnt_listmx lock.
2016-10-08 13:36:59 +00:00
Bryan Drewery
32641585a9 vrefl: Assert that the interlock is held.
Sponsored by:	Dell EMC Isilon
MFC after:	2 weeks
2016-10-06 18:10:19 +00:00
Bryan Drewery
5a22c9582c Add vrecyclel() to vrecycle() a vnode with the interlock already held.
Obtained from:	OneFS
Sponsored by:	Dell EMC Isilon
MFC after:	2 weeks
2016-10-06 18:09:22 +00:00
Bryan Drewery
0617f64ec6 Correct some comments after r294299.
Sponsored by:	Dell EMC Isilon
2016-10-04 21:44:20 +00:00
Mateusz Guzik
5bb81f9b2d vfs: batch free vnodes in per-mnt lists
Previously free vnodes would always by directly returned to the global
LRU list. With this change up to mnt_free_list_batch vnodes are collected
first.

syncer runs always return the batch regardless of its size.

While vnodes on per-mnt lists are not counted as free, they can be
returned in case of vnode shortage.

Reviewed by:	kib
Tested by:	pho
2016-09-30 17:27:17 +00:00
Mateusz Guzik
8660b707ff vfs: remove the __bo_vnode field from struct vnode
The pointer can be obtained using __containerof instead.

Reviewed by:	kib
2016-09-30 17:11:03 +00:00
Ed Maste
69a2875821 Renumber license clauses in sys/kern to avoid skipping #3 2016-09-15 13:16:20 +00:00
Edward Tomasz Napierala
f83cc0aae4 Print vnode details when vnode locking assertion gets triggered.
MFC after:	1 month
2016-08-12 22:20:52 +00:00
Edward Tomasz Napierala
411455a8fb Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after:	1 month
2016-08-10 16:12:31 +00:00
Edward Tomasz Napierala
7b255097eb Remove unused - never actually implemented - vnode lock types
from vnode_if.src.

MFC after:	1 month
2016-08-04 13:45:18 +00:00
Konstantin Belousov
725496ced5 Fix grammar.
Submitted by:	alc
MFC after:	2 weeks
2016-07-11 17:04:22 +00:00
Konstantin Belousov
19efd8a5a8 In vgonel(), postpone setting BO_DEAD until VOP_RECLAIM() is called,
if vnode is VMIO.  For VMIO vnodes, set BO_DEAD in vm_object_terminate().

The vnode_destroy_object(), when calling into vm_object_terminate(),
must be able to flush buffers.  BO_DEAD purpose is to quickly destroy
buffers on write when the underlying vnode is not operable any more
(one example is the devfs node after geom is gone).  Setting BO_DEAD
for reclaiming vnode before object is terminated is premature, and
results in unability to flush buffers with live SU dependencies from
vinvalbuf() in vm_object_terminate().

Reported by:	David Cross <dcrosstech@gmail.com>
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-07-11 14:19:09 +00:00
Konstantin Belousov
d12d6a8476 Remove racy assert. The thread which changes vnode usecount from 0 to 1
does it under the vnode interlock, but the interlock is not owned by the
asserting thread.  As result, we might read increased use counter but also
still see VI_OWEINACT.

In collaboration with: nwhitehorn
Hardware donated by: IBM LTC
Sponsored by:	The FreeBSD Foundation (kib)
Approved by:	re (gjb)
2016-07-03 01:56:48 +00:00
Konstantin Belousov
d9a503bec4 Fix typo. Note that atomic is still required even for interlocked case.
Sponsored by:	The FreeBSD Foundation
Approved by:	re (marius)
2016-06-20 15:45:50 +00:00
Mateusz Guzik
e896fb3bae vfs: ifdef out noop vop_* primitives on !DEBUG_VFS_LOCKS kernels
This removes calls to empty functions like vop_lock_{pre/post} from
common vfs routines.

Approved by:	re (gjb)
2016-06-17 19:41:30 +00:00
Konstantin Belousov
f8a75278dc Add VFS interface to flush specified amount of free vnodes belonging
to mount points with the given filesystem type, specified by mount
vfs_ops pointer.

Based on patch by:	mckusick
Reviewed by:	avg, mckusick
Tested by:	allanjude, madpilot
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
2016-06-17 17:33:25 +00:00
Edward Tomasz Napierala
f7bd221730 Cosmetics - add missing space after ellipses in shutdown messages.
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2016-05-31 15:27:33 +00:00
Andriy Gapon
27d4b35f6e vfs_read_dirent: increment ncookies after adding a cookie
It seems that at present vfs_read_dirent() is used only with filesystems
that do not support cookies, so the bug never manifested itself.

MFC after:	1 week
2016-05-16 07:31:11 +00:00
Konstantin Belousov
c89e1b8739 Add EVFILT_VNODE open, read and close notifications.
While there, order EVFILT_VNODE notes descriptions alphabetically.

Based on submission, and tested by:	Vladimir Kondratyev <wulf@cicgroup.ru>
MFC after:	2 weeks
2016-05-03 15:17:43 +00:00
Konstantin Belousov
f7b71c8a5b Issue NOTE_EXTEND when a directory entry is added to or removed from
the monitored directory as the result of rename(2) operation.  The
renames staying in the directory are not reported.

Submitted by:	Vladimir Kondratyev <wulf@cicgroup.ru>
MFC after:	2 weeks
2016-05-02 13:18:17 +00:00
Konstantin Belousov
bd2ead6b2e Fix reporting of NOTE_LINK when directory link count changes due to
rename removing or adding subdirectory entry.

Discussed with and tested by:	Vladimir Kondratyev <wulf@cicgroup.ru>
NetBSD PR:	48958 (http://gnats.netbsd.org/48958)
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2016-05-02 13:13:32 +00:00
Pedro F. Giffuni
e3043798aa sys/kern: spelling fixes in comments.
No functional change.
2016-04-29 22:15:33 +00:00
Pedro F. Giffuni
55e0987aea sys: extend use of the howmany() macro when available.
We have a howmany() macro in the <sys/param.h> header that is
convenient to re-use as it makes things easier to read.
2016-04-26 15:38:17 +00:00
Konstantin Belousov
0791e0c0e7 Provide more correct sizing of the KVA consumed by a vnode, used by
the virtvnodes calculation.  Include the size of fs-specific v_data as
the nfs nclnode inline, the NFS nclnode is bigger than either ZFS
znode or UFS inode.  Include the size of namecache_ts and short cache
path element, multiplied by the name cache population factor, again
inline.

Inline defines are used to avoid pollution of the vnode.h with the
subsystem-private objects.  Non-significant unsynchronized changes of
the definitions are fine, we do not care about that precision, and
e.g. ZFS consumes much malloced memory per vnode for reasons
unaccounted in the formula.

Lower the partition of kmem dedicated to vnodes, from 1/7 to 1/10.

The measures reduce vnode cache pressure on kmem and bring the vnode
cache memory use below some apparent thresholds that were exceeded by
r291244 due to more robust vnode reuse.

Reported and tested by:	marius (i386, previous version)
Reviewed by:	bde
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2016-02-24 15:15:46 +00:00
Konstantin Belousov
fa48f413ef In bnoreuselist(), check both ends of the specified logical block
numbers range.

This effectively skips indirect and extdata blocks on the buffer
queue.  Since their logical block numbers are negative, bnoreuselist()
could loop infinitely.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2016-02-17 19:39:57 +00:00
Mark Johnston
793c381706 Add vrefl(), a locked variant of vref(9).
This API has no in-tree consumers at the moment but is useful to at least
one out-of-tree consumer, and naturally complements existing vnode refcount
functions (vholdl(9), vdropl(9)).

Obtained from:	kib (sys/ portion)
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D4947
Differential Revision:	https://reviews.freebsd.org/D4953
2016-01-18 22:21:46 +00:00
Konstantin Belousov
1041e09089 Two fixes for excessive iterations after r292326.
Advance the logical block number to the lblkno of the found block plus
one, instead of incrementing the block number which was used for
lookup.  This change skips sparcely populated buffer ranges, similar
to r292325, instead of doing useless lookups.

Do not restart the bnoreuselist() from the start of the range if
buffer lock cannot be obtained without sleep.  Only retry lookup and
lock for the same queue and same logical block number.

Reported by:	benno
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2016-01-05 14:48:40 +00:00
Konstantin Belousov
106ebb761a Optimize vop_stdadvise(POSIX_FADV_DONTNEED). Instead of looking up a
buffer for each block number in the range with gbincore(), look up the
next instantiated buffer with the logical block number which is
greater or equal to the next lblkno.  This significantly speeds up the
iteration for sparce-populated range.

Move the iteration into new helper bnoreuselist(), which is structured
similarly to flushbuflist().

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
2015-12-16 08:48:37 +00:00
Konstantin Belousov
8549b4b9fe Simplify the loop step in the flushbuflist() and make it independed on
the type stability of the buffers memory.  Instead of memoizing
pointer to the next buffer and validating it, remember the next
logical block number in the bo list and re-lookup.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2015-12-16 08:39:51 +00:00
Kirk McKusick
d9ea698c75 We need to zero out the clustering variables in a freed vnode structure.
For completeness add a VNASSERT that there are no threads waiting on a
range lock (this was previously checked on every vnode free).

Reported by; Rick Macklem
Fix from:    Mateusz Guzik
PR:          204949
2015-12-04 03:54:18 +00:00
Kirk McKusick
003a7c2b01 We need to zero out the union of pointers in a freed vnode structure.
PR:        204949
Fix from:  Mateusz Guzik
Tested by: Jason Unovitch
2015-12-03 02:04:22 +00:00
Kirk McKusick
41d4f10391 As the kernel allocates and frees vnodes, it fully initializes them
on every allocation and fully releases them on every free.  These
are not trivial costs: it starts by zeroing a large structure then
initializes a mutex, a lock manager lock, an rw lock, four lists,
and six pointers. And looking at vfs.vnodes_created, these operations
are being done millions of times an hour on a busy machine.

As a performance optimization, this code update uses the uma_init
and uma_fini routines to do these initializations and cleanups only
as the vnodes enter and leave the vnode_zone. With this change the
initializations are only done kern.maxvnodes times at system startup
and then only rarely again. The frees are done only if the vnode_zone
shrinks which never happens in practice. For those curious about the
avoided work, look at the vnode_init() and vnode_fini() functions in
kern/vfs_subr.c to see the code that has been removed from the main
vnode allocation/free path.

Reviewed by: kib
Tested by:   Peter Holm
2015-11-29 21:42:26 +00:00
Konstantin Belousov
f186a80d8a Remove VI_AGE vnode iflag, it is unused.
Noted by:	bde
Sponsored by:	The FreeBSD Foundation
2015-11-27 01:45:40 +00:00
Konstantin Belousov
b3162b45e1 Move the comment about resident pages preventing vnode from leaving
active list, into the header comment for vdrop(), which is the
function that decides whether to leave the vnode on the list.  Note
that dirty page write-out in vinactive() is asynchronous.

Discussed with:	alc
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-11-27 01:16:35 +00:00
Konstantin Belousov
547831b6fd Rework the vnode cache recycling to meet free and unused vnodes
targets.  See the comment above wantfreevnodes variable for the
description of the algorithm.

The vfs.vlru_alloc_cache_src sysctl is removed.  New code frees
namecache sources as the last chance to satisfy the highest watermark,
instead of selecting the source vnodes randomly. This provides good
enough behaviour to keep vn_fullpath() working in most situations.
The filesystem layout with deep trees, where the removed knob was
required, is thus handled automatically.

Submitted by:	bde
Discussed with:	mckusick
Tested by:	pho
MFC after:	1 month
2015-11-24 09:45:36 +00:00
Gleb Smirnoff
09c837b897 Remove remnants of the old NFS from vnode pager.
Reviewed by:	kib
Sponsored by:	Netflix
2015-11-20 23:52:27 +00:00
Mark Johnston
0a805de6f3 Remove a check for a condition that is always false by a preceding KASSERT
that was added in r144704.
2015-09-26 22:26:55 +00:00
Mark Johnston
d925c2e800 Fix argument ordering in vn_printf().
MFC after:	3 days
2015-09-26 22:16:54 +00:00
Conrad Meyer
55d33667ee kevent(2): Note DOOMED vnodes with NOTE_REVOKE
In poll mode, check for and wake VBAD vnodes.  (Vnodes that are VBAD at
registration will never be woken by the RECLAIM trigger.)

Add post-VOP_RECLAIM hook to trigger notes on vnode reclamation.  (Vnodes that
were fine at registration but are vgoned while being monitored should signal
waiters.)

Reviewed by:	kib
Approved by:	markj (mentor)
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3675
2015-09-15 20:22:30 +00:00
Kirk McKusick
17518b1a2b Track changes to kern.maxvnodes and appropriately increase or decrease
the size of the name cache hash table (mapping file names to vnodes)
and the vnode hash table (mapping mount point and inode number to vnode).
An appropriate locking strategy is the key to changing hash table sizes
while they are in active use.

Reviewed by: kib
Tested by:   Peter Holm
Differential Revision: https://reviews.freebsd.org/D2265
MFC after:   2 weeks
2015-09-06 05:50:51 +00:00
Edward Tomasz Napierala
c9ba65040f Make vfs_unmountall() unmount /dev after /, not before. The only
reason this didn't result in an unclean shutdown is that devfs ignores
MNT_FORCE flag.

Reviewed by:	kib@
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D3467
2015-08-24 13:18:13 +00:00
Edward Tomasz Napierala
6e572e084b After r286237 it should be fine to call vgone(9) on a busy GEOM vnode;
remove KASSERT that would prevent forced devfs unmount from working.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-08-23 14:53:54 +00:00
Ed Schouten
2433a4eb04 Make it possible to implement poll(2) on top of kqueue(2).
It looks like EVFILT_READ and EVFILT_WRITE trigger under the same
conditions as poll()'s POLLRDNORM and POLLWRNORM as described by POSIX.
The only difference is that POLLRDNORM has to be triggered on regular
files unconditionally, whereas EVFILT_READ only triggers when not EOF.

Introduce a new flag, NOTE_FILE_POLL, that can be used to make
EVFILT_READ and EVFILT_WRITE behave identically to poll(). This flag
will be used by cloudlibc's poll() function.

Reviewed by:	jmg
Differential Revision:	https://reviews.freebsd.org/D3303
2015-08-05 07:34:29 +00:00
Edward Tomasz Napierala
57a73b26e0 Mark vgonel() as static. It was already declared static earlier;
no idea why compilers don't warn about this.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2015-08-04 08:51:56 +00:00
Mateusz Guzik
752fc07d33 vfs: implement v_holdcnt/v_usecount manipulation using atomic ops
Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free
list) of either counter are still guarded with vnode interlock.

Reviewed by:	kib (earlier version)
Tested by:	pho
2015-07-16 13:57:05 +00:00