Commit Graph

306 Commits

Author SHA1 Message Date
Mateusz Guzik
b403aa126e cache: add NCF_WIP flag
This allows making half-constructed entries visible to the lockless lookup,
which now can check for either "not yet fully constructed" and "no longer valid"
state.

This will be used for .. lookup.
2020-08-04 23:07:00 +00:00
Mateusz Guzik
6e10434c02 cache: add cache_purge_vgone
cache_purge locklessly checks whether the vnode at hand has any namecache
entries. This can race with a concurrent purge which managed to remove
the last entry, but may not be done touching the vnode.

Make sure we observe the relevant vnode lock as not taken before proceeding
with vgone.

Paired with the fact that doomed vnodes cannnot receive entries this restores
the invariant that there are no namecache-related writing users past cache_purge
in vgone.

Reported by:	pho
2020-08-04 23:04:29 +00:00
Mateusz Guzik
1164f7a566 cache: factor away failed vexec handling 2020-08-04 19:55:26 +00:00
Mateusz Guzik
0439b00ea8 cache: assorted tidy ups 2020-08-04 19:55:00 +00:00
Mateusz Guzik
18bd02e2ce cache: factor away lockless dot lookup and add missing stat + sdt probe 2020-08-04 19:54:37 +00:00
Mateusz Guzik
17a66c7087 vfs: add vfs_op_thread_enter/exit _crit variants
and employ them in the namecache. Eliminates all spurious checks for preemption.
2020-08-04 19:54:10 +00:00
Mateusz Guzik
0311b05fec cache: add missing numcache detrement on insertion failure 2020-08-04 19:52:52 +00:00
Mateusz Guzik
7ad2f1105e vfs: store precomputed namecache hash in the vnode
This significantly speeds up path lookup, Cascade Lake doing access(2) on ufs
on /usr/obj/usr/src/amd64.amd64/sys/GENERIC/vnode_if.c, ops/s:
before: 2535298
after: 2797621

Over +10%.

The reversed order of computation here does not seem to matter for hash
distribution.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D25921
2020-08-02 20:02:06 +00:00
Mateusz Guzik
838984de32 vfs: move namecache initialisation into cache_vnode_init 2020-08-02 19:42:06 +00:00
Mateusz Guzik
8a7ec17095 cache: reshuffle struct cache_fpl and nameidata_saved
Shaves 16 bytes.
2020-08-01 06:35:18 +00:00
Mateusz Guzik
5a3944334c cache: mark climb_mount as __noinline 2020-08-01 06:34:18 +00:00
Mateusz Guzik
cb90ef2875 cache: drop the useless numchecks counter 2020-07-30 22:52:18 +00:00
Mateusz Guzik
404927357d vfs: add support for WANTPARENT and LOCKPARENT to lockless lookup
This makes the realpath syscall operational with the new lookup. Note that the
walk to obtain the full path name still takes locks.

Tested by:      pho
Differential Revision:	https://reviews.freebsd.org/D23917
2020-07-30 15:45:11 +00:00
Mateusz Guzik
8230d29357 vfs: support negative entry promotion in lockless lookup
Tested by:	pho
2020-07-30 15:44:10 +00:00
Mateusz Guzik
4057e3eaaa vfs: add NOMACCHECK and AUDITVNODE2 to lockless lookup
They are both nops since lookup does not progress with either mac or audit enabled.

Tested by:	pho
2020-07-30 15:43:16 +00:00
Mateusz Guzik
9dbd12fb52 vfs: add support for !LOCKLEAF to lockless lookup
Tested by:      pho (in a patchset)
Differential Revision:	https://reviews.freebsd.org/D23916
2020-07-25 10:40:38 +00:00
Mateusz Guzik
c42b77e694 vfs: lockless lookup
Provides full scalability as long as all visited filesystems support the
lookup and terminal vnodes are different.

Inner workings are explained in the comment above cache_fplookup.

Capabilities and fd-relative lookups are not supported and will result in
immediate fallback to regular code.

Symlinks, ".." in the path, mount points without support for lockless lookup
and mismatched counters will result in an attempt to get a reference to the
directory vnode and continue in regular lookup. If this fails, the entire
operation is aborted and regular lookup starts from scratch. However, care is
taken that data is not copied again from userspace.

Sample benchmark:
incremental -j 104 bzImage on tmpfs:
before: 142.96s user 1025.63s system 4924% cpu 23.731 total
after: 147.36s user 313.40s system 3216% cpu 14.326 total

Sample microbenchmark: access calls to separate files in /tmpfs, 104 workers, ops/s:
before:   2165816
after:  151216530

Reviewed by:    kib
Tested by:      pho (in a patchset)
Differential Revision:	https://reviews.freebsd.org/D25578
2020-07-25 10:37:15 +00:00
Mateusz Guzik
29f3e5ea41 cache: make negative shrinker round robin on all lists every time
Previously it would check 4, 3, 2, 1 lists. In practice by the time
it is getting called all lists have some elements and consequently
this does not result in new evictions.

Nonetheless, the code is clearer.

Tested by:	pho
2020-07-14 21:19:33 +00:00
Mateusz Guzik
a110fa2ee1 cache: remove numcalls
The counter is not very useful and if necessary the value can be
found by summing up other counters.
2020-07-14 21:17:46 +00:00
Mateusz Guzik
4516c7eed9 cache: count dropped entries 2020-07-14 21:17:08 +00:00
Mateusz Guzik
654e644e80 cache: remove neg_locked argument from cache_zap_locked
Tested by:	pho
2020-07-14 21:16:48 +00:00
Mateusz Guzik
ffb0abddf1 cache: remove a useless argument from cache_negative_insert 2020-07-14 21:16:07 +00:00
Mateusz Guzik
9f8d452173 cache: create a dedicate struct for negative entries
.. and stuff if into the unused target vnode field

This gets rid of concurrent nc_flag modifications racing with the
shrinker and consequently fixes a bug where such a change could have
been missed when cache_ncp_invalidate was being issued..

Reported by:	zeising
Tested by:	pho, zeising
Fixes:	r362828 ("cache: lockless forward lookup with smr")
2020-07-14 21:14:59 +00:00
Mateusz Guzik
d23850207b cache: add missing call to cache_ncp_invalid for negative hits
Note the dtrace probe can fire even the entry is gone, but I don't think that's
worth fixing.
2020-07-02 12:56:20 +00:00
Mateusz Guzik
d129e0eba0 cache: fix misplaced fence in cache_ncp_invalidate
The intent was to mark the entry as invalid before cache_zap starts messing
with it.

While here add some comments.
2020-07-02 12:54:50 +00:00
Mateusz Guzik
5d1c042d32 cache: lockless forward lookup with smr
This eliminates the need to take bucket locks in the common case.

Concurrent lookup utilizng the same vnodes is still bottlenecked on referencing
and locking path components, this will be taken care of separately.

Reviewed by:	kib
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D23913
2020-07-01 05:59:08 +00:00
Mark Johnston
d869a17e62 Use COUNTER_U64_DEFINE_EARLY() in places where it simplifies things.
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23978
2020-03-06 19:10:00 +00:00
Mateusz Guzik
8d03b99b9d fd: move vnodes out of filedesc into a dedicated structure
The new structure is copy-on-write. With the assumption that path lookups are
significantly more frequent than chdirs and chrooting this is a win.

This provides stable root and jail root vnodes without the need to reference
them on lookup, which in turn means less work on globally shared structures.
Note this also happens to fix a bug where jail vnode was never referenced,
meaning subsequent access on lookup could run into use-after-free.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23884
2020-03-01 21:53:46 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Mateusz Guzik
0573d0a9b8 vfs: add realpathat syscall
realpath(3) is used a lot e.g., by clang and is a major source of getcwd
and fstatat calls. This can be done more efficiently in the kernel.

This works by performing a regular lookup while saving the name and found
parent directory. If the terminal vnode is a directory we can resolve it using
usual means. Otherwise we can use the name saved by lookup and resolve the
parent.

See the review for sample syscall counts.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23574
2020-02-20 16:58:19 +00:00
Kyle Evans
6a5abb1ee5 Provide O_SEARCH
O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.

This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for
O_SEARCH.

This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.

[0] https://pubs.opengroup.org/onlinepubs/9699919799/

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D23247
2020-02-02 16:34:57 +00:00
Mateusz Guzik
7739d92766 cache: replace kern___getcwd with vn_getcwd
The previous routine was resulting in extra data copies most notably in
linux_getcwd.
2020-02-01 20:38:38 +00:00
Mateusz Guzik
921e7210f8 cache: return the total length from vn_fullpath1
This removes strlen from getcwd.
2020-02-01 20:37:11 +00:00
Mateusz Guzik
4511dd9d41 cache: remove vnode -> path lookup disablement
It seems to be of little to no use even when debugging.

Interested parties can resurrect it and gate compilation with a macro.
2020-02-01 20:36:35 +00:00
Mateusz Guzik
45757984f8 vfs: consistently use size_t for buflen around VOP_VPTOCNP 2020-02-01 20:34:43 +00:00
Mateusz Guzik
6403455301 cache: revert r352613 now that vhold does not take locks 2020-01-20 19:52:23 +00:00
Mateusz Guzik
8bba93c7e0 cache: make numcachehv use counter(9) on all archs
Requested by:	kib
2020-01-20 14:42:11 +00:00
Mateusz Guzik
059cb4843b cache: counter_u64_add_protected -> counter_u64_add
Fixes booting on RISC-V where it does happen to not be equivalent.

Reported by:	lwhsu
2020-01-19 17:05:26 +00:00
Mateusz Guzik
1399033590 cache: convert numcachehv to counter(9) on 64-bit platforms 2020-01-19 05:37:27 +00:00
Mateusz Guzik
6928306764 vfs: incomplete pass at converting more ints to u_long
Most notably numvnodes and freevnodes were u_long, but parameters used to
govern them remained as ints.
2020-01-11 22:56:20 +00:00
Mateusz Guzik
b249ce48ea vfs: drop the mostly unused flags argument from VOP_UNLOCK
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D21427
2020-01-03 22:29:58 +00:00
Mateusz Guzik
abd80ddb94 vfs: introduce v_irflag and make v_type smaller
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D22715
2019-12-08 21:30:04 +00:00
Mateusz Guzik
588e69e2fd cache: stop reusing .. entries on enter
It almost never happens in practice anyway. With this eliminated ->nc_vp
cannot change vnodes, removing an obstacle on the road to lockless
lookup.
2019-11-27 01:21:42 +00:00
Mateusz Guzik
2ac930e32c cache: fix numcache accounting on entry
. entries are never created and .. can reuse existing entries,
meaning the early count bump is both spurious and leading to
overcounting in certain cases.
2019-11-27 01:20:55 +00:00
Mateusz Guzik
36afce39ae cache: hide "doingcache" behind DEBUG_CACHE 2019-11-27 01:20:21 +00:00
Mateusz Guzik
d578a4256e cache: minor stat cleanup
Remove duplicated stats and move numcachehv from debug to vfs.cache.
2019-11-20 12:08:32 +00:00
Mateusz Guzik
708cf7eb6c cache: decrease ncnegfactor to 5
The current mechanism is bogus in several ways:
- the limit is a percentage of total entries added, which means negative
entries get evicted all the time even if there are plenty of resources
- evicting code is almost not concurrent, which makes it unable to
remove entries fast enough when doing something as simple as -j 104
buildworld
- there is no support for performing mass removal if necessary

Vast majority of negative entries never get any hits. Only evicting
them when the filesystem demands it results in a significant growth of
the namecache with almost no improvement in the hit ratio.

Sample result about afer 90 minutes of poudriere -j 104:

           current    no evict   % of the original
numneg     219737     2013157    916
numneghits 266711906  263544562  98 [1]

[1] this may look funny but there is a certain dose of variation to the
build

The number was chosen as something which mostly eliminates spurious
evictions during lighter workloads but still keeps the total at bay.

Sponsored by:	The FreeBSD Foundation
2019-09-27 19:14:03 +00:00
Mateusz Guzik
e643141838 cache: stop requeuing negative entries on the hot list
Turns out it does not improve hit ratio, but it does come with a cost
induces stemming from dirtying hit entries.

Sample result: hit counts of evicted entries after 2 buildworlds

before:

           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@                180865
               1 |@@@@@@@                                  49150
               2 |@@@                                      19067
               4 |@                                        9825
               8 |@                                        7340
              16 |@                                        5952
              32 |@                                        5243
              64 |@                                        4446
             128 |                                         3556
             256 |                                         3035
             512 |                                         1705
            1024 |                                         1078
            2048 |                                         365
            4096 |                                         95
            8192 |                                         34
           16384 |                                         26
           32768 |                                         23
           65536 |                                         8
          131072 |                                         6
          262144 |                                         0

after:
           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@                184004
               1 |@@@@@@                                   47577
               2 |@@@                                      19446
               4 |@                                        10093
               8 |@                                        7470
              16 |@                                        5544
              32 |@                                        5475
              64 |@                                        5011
             128 |                                         3451
             256 |                                         3002
             512 |                                         1729
            1024 |                                         1086
            2048 |                                         363
            4096 |                                         86
            8192 |                                         26
           16384 |                                         25
           32768 |                                         24
           65536 |                                         7
          131072 |                                         5
          262144 |                                         0

Sponsored by:	The FreeBSD Foundation
2019-09-27 19:13:22 +00:00
Mateusz Guzik
312196df0f cache: make negative list shrinking a little bit concurrent
Continue protecting demotion from the hotlist and selection of the
target list with the ncneg_shrink_lock lock, but drop it before
relocking to zap the node.

While here count how many times we skipped shrinking due to the lock
being already taken.

Sponsored by:	The FreeBSD Foundation
2019-09-27 19:12:43 +00:00
Mateusz Guzik
95c6dd890a cache: stop recalculating upper limit each time a new entry is added
Sponsored by:	The FreeBSD Foundation
2019-09-27 19:12:20 +00:00