Commit Graph

17816 Commits

Author SHA1 Message Date
Alexander Motin
8b220f8915 Fix asymmetry in devstat(9) calls by GEOM.
Before this GEOM passed bio pointer to transaction start, but not end.
It was irrelevant until devstat(9) got DTrace hooks, that appeared to
provide bio pointer on I/O completion, but not on submission.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-10-24 21:07:10 +00:00
Ruslan Bukin
f32f0095e9 o Add iommu de-initialization method for MSI interface.
o Add iommu_unmap_msi() to release the msi GAS entry.
o Provide default implementations for iommu init/deinit methods.

Reviewed by:	kib
Sponsored by:	Innovate DSbD
Differential Revision:	https://reviews.freebsd.org/D26906
2020-10-24 20:09:27 +00:00
Ryan Moeller
e58483c4fb sysctl+kern_sysctl: Honor SKIP for descendant nodes
Ensure we also skip descendants of SKIP nodes when iterating through children
of an explicitly specified node.

Reported by:	np
Reviewed by:	np
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D26833
2020-10-24 16:17:07 +00:00
Ryan Moeller
0595c12484 kern_sysctl: Misc code cleanup
Remove unused oidpp parameter from sysctl_sysctl_next_ls and
add high level comments to describe how it works.

No functional change.

Reviewed by:	imp
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D26854
2020-10-24 14:46:38 +00:00
Kyle Evans
275c821d3d audit: correct reporting of *execve(2) success
r326145 corrected do_execve() to return EJUSTRETURN upon success so that
important registers are not clobbered. This had the side effect of tapping
out 'failures' for all *execve(2) audit records, which is less than useful
for auditing purposes.

Audit exec returns earlier, where we can know for sure that EJUSTRETURN
translates to success. Note that this unsets TDP_AUDITREC as we commit the
audit record, so the usual audit in the syscall return path will do nothing.

PR:		249179
Reported by:	Eirik Oeverby <ltning-freebsd anduin net>
Reviewed by:	csjp, kib
MFC after:	1 week
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26922
2020-10-24 14:39:17 +00:00
Mateusz Guzik
eb65cde4f5 cache: assorted typo fixes 2020-10-24 13:31:40 +00:00
Mateusz Guzik
029cfccc71 cache: add the missing NC_NOMAKEENTRY and NC_KEEPPOSENTRY to lockless lookup
They are de facto ignored.
2020-10-24 13:31:25 +00:00
Mateusz Guzik
7cc1718613 vfs: fix a race where reclaim vholds freed vnodes
Reported by:	pho
Tested by:	pho (previous version)
Fixes:	r366974 ("vfs: stop taking the interlock in vnode reclaim")
2020-10-24 13:30:37 +00:00
Mateusz Guzik
acb41008f3 cache: batch updates to numcache in case of mass removal 2020-10-24 01:14:52 +00:00
Mateusz Guzik
208cb7c4b6 cache: refactor alloc/free
This in particular centralizes manipulation of numcache.
2020-10-24 01:14:17 +00:00
Mateusz Guzik
1d44405690 cache: fold branch prediction into cache_ncp_canuse 2020-10-24 01:13:47 +00:00
Mateusz Guzik
c13d7d1f98 cache: fix some typos 2020-10-24 01:13:16 +00:00
Mateusz Guzik
f878526f20 cache: drop write-only vars 2020-10-24 01:13:02 +00:00
Ruslan Bukin
9729b14985 Move the iommu stubs to a generic place, so they are available on all the
platforms.

This allows to not depend on the IOMMU macro in AHCI driver.

Requested by:	kib
Suggested by:	andrew
Reviewed by:	kib
Sponsored by:	Innovate DSbD
Differential Revision:	https://reviews.freebsd.org/D26887
2020-10-23 21:27:48 +00:00
Mateusz Guzik
3862838921 cache: reduce memory waste in struct namecache
The previous scheme for calculating the total size was doing sizeof
on the struct and then adding the wanted space for the buffer.

nc_name is at offset 58 while sizeof(struct namecache) is 64.
With CACHE_PATH_CUTOFF of 39 bytes and 1 byte of padding we were
allocating 104 bytes for the entry and never accounting for the 6
byte padding, wasting that space.
2020-10-23 15:56:22 +00:00
Mateusz Guzik
703f3fafa5 vfs: stop taking the interlock in vnode reclaim
It no longer protects any of tested fields, keeping all the checks racy.

While here make vtryrecycle drop the vnode on its own. Avoids an additional
lock trip.
2020-10-23 15:49:18 +00:00
Mateusz Guzik
c7520caa4f vfs: prevent avoidable evictions on mkdir of existing directories
mkdir -p /foo/bar/baz will mkdir each path component and ignore EEXIST.

The NOCACHE lookup will make the namecache unnecessarily evict the existing entry,
and then fallback to the fs lookup routine eventually leading namei to return an
error as the directory is already there.

For invocations like mkdir -p /usr/obj/usr/src/sys/GENERIC/modules this triggers
fallbacks to the slowpath for concurrently executing lookups.

Tested by:	pho
Discussed with:	kib
2020-10-22 19:28:12 +00:00
Mateusz Guzik
54f09403a3 cache: assert the created entry does not point to itself 2020-10-22 19:22:34 +00:00
Konstantin Belousov
18b8496c23 sysv_sem: semusz depends on semume.
Size of the per-process semaphore undo structure (semusz) depends on
the number of the per-process undos.  If kern.ipc.semume is adjusted,
semusz must be adjusted as well, and it makes no sense to delegate
adjustment to user.  Make it automatic.

Reported and tested by:	Olef <o.vandestadt@gmail.com>
PR:	250361
Reviewed by:	jhb, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26826
2020-10-22 09:28:11 +00:00
Hans Petter Selasky
2ae634c6db Implement mbuf hashing routines for IP over infiniband, IPoIB.
No functional change intended.

Differential Revision:	https://reviews.freebsd.org/D26254
Reviewed by:		melifaro@
MFC after:		1 week
Sponsored by:		Mellanox Technologies // NVIDIA Networking
2020-10-22 09:17:56 +00:00
Brooks Davis
44ca4575ea vmapbuf: don't smuggle address or length in buf
Instead, add arguments to vmapbuf.  Since this argument is
always a pointer use a type of void * and cast to vm_offset_t in
vmapbuf.  (In CheriBSD we've altered vm_fault_quick_hold_pages to
take a pointer and check its bounds.)

In no other situtation does b_data contain a user pointer and vmapbuf
replaces b_data with the actual mapping.

Suggested by:	jhb
Reviewed by:	imp, jhb
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26784
2020-10-21 16:00:15 +00:00
Mateusz Guzik
2f1c35053c cache: drop the spurious slash_prefixed argument 2020-10-21 05:57:25 +00:00
Mateusz Guzik
ab21ed17ed vfs: drop the de facto curthread argument from VOP_INACTIVE 2020-10-20 07:19:03 +00:00
Mateusz Guzik
8ecd87a3e7 vfs: drop spurious cred argument from VOP_VPTOCNP 2020-10-20 07:18:27 +00:00
Konstantin Belousov
c0baa3dc4a vgonel(): avoid recursing into VOP_INACTIVE().
It is a common pattern for filesystems' VOP_INACTIVE() implementation
to forcibly reclaim the vnode when its state is final.  For instance,
UFS vnode with zero link count is removed, and since it is
inactivated, the last open reference on it is dropped.

On the other hand, vnode might get spurious usecount reference for
many reasons.  If the spurious reference exists while vgonel() checks
for active state of the vnode, it would recurse into VOP_INACTIVE().

Fix it by checking and not doing inactivation when vgone() was called
from inactive VOP.

Reported and tested by:	pho
Discussed with:	mjg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-10-19 19:20:23 +00:00
Mateusz Guzik
6d5d469fc1 cache: promote negative entries based on more than one hit
During tinderbox and similar workloads negative entries get at least one
hit before they get evicted. In the current scheme this avoidably promotes
them.

Be conservative and stick to 2 hits for now.
2020-10-19 18:51:51 +00:00
John Baldwin
6bcf3c46d8 Check TF_TOE not the tod pointer to determine if TOE is active.
The TF_TOE flag is the check used in the rest of the network stack to
determine if TOE is active on a socket.  There is at least one path in
the cxgbe(4) TOE driver that can leave the tod pointer non-NULL on a
socket not using TOE.

Reported by:	Sony Arpita Das <sonyarpitad@chelsio.com>
Reviewed by:	np
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D26803
2020-10-19 18:24:06 +00:00
Mark Johnston
d80126a6f4 link_elf_obj: Colour VM objects
This will cause the VM to back sufficiently large .text sections, such
as those in zfs.ko or amdgpu.ko on amd64, with superpage mappings when
possible.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26802
2020-10-19 16:57:59 +00:00
Mark Johnston
6351771b7c vmem: Allocate btags before looping in vmem_xalloc()
BT_MAXALLOC (4) is the number of boundary tags required to complete an
allocation in the worst case: two to clip a free segment, and two to
import from a parent arena.  vmem_xalloc() preallocates four boundary
tags before attempting a search to simplify the segment allocation code.
It implements a loop that:
1) ensures that BT_MAXALLOC boundary tags are available,
2) attempts to find and clip a free segment satisfying the allocation
   constraints, and failing that,
3) attempts to import a segment.

On !UMA_MD_SMALL_ALLOC platforms the btag zone has to handle recusion:
it needs boundary tags to allocate boundary tags.  Thus we reserve
2 * BT_MAXALLOC * mp_ncpus tags for use when recursing: the factor of 2
is because there are two layers of vmem arenas, the per-domain arena and
global arena.  For a single thread, 2 * BT_MAXALLOC tags should be
sufficient.

Because of the way the loop is structured, BT_MAXALLOC tags are not
sufficient.  The first bt_fill() call may allocate BT_MAXALLOC tags,
then import a segment (consuming two tags), then attempt to top up the
preallocation before carving into the imported free segment, thus
requiring up to six tags in the worst case.  Because we don't
preallocate that many, this bug can cause deadlocks in rare scenarios.

Fix the problem by moving the preallocation out the loop.  This assumes
that only a single import is ever required to satisfy an allocation
request.

Thanks to manu, emaste and lwhsu for helping test debug patches.

Reported by:	Jenkins (hardware CI lab)
Reviewed by:	alc, kib, rlibby
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26770
2020-10-19 16:54:06 +00:00
Mark Johnston
33a9bce62f vmem: Simplify bt_fill() callers a bit
No functional change intended.

Reviewed by:	alc, kib, rlibby
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26769
2020-10-19 16:52:27 +00:00
Ruslan Bukin
e707c8be4e Manage MSI iommu pages.
This allows the interrupt controller driver only need a small change to
create a map for the page the device will write to raise an interrupt.

Submitted by:	andrew
Reviewed by:	kib
Sponsored by:	Innovate DSbD
Differential Revision:	https://reviews.freebsd.org/D26705
2020-10-19 13:10:21 +00:00
Mateusz Guzik
665c8c3e7d cache: refactor negative promotion/demotion handling
This will simplify policy changes.
2020-10-19 09:52:52 +00:00
Bjoern A. Zeeb
f7a0bb0dec ddb: add show sysinit command
Add a show sysinit command to ddb (similar to show vnet_sysinit) which
proved to be helpful to debug some ordering issues on early-mid kernel
start panics.
2020-10-17 22:47:08 +00:00
Mateusz Guzik
4c4aa84848 cache: shorten names of debug stats 2020-10-17 21:30:46 +00:00
Mateusz Guzik
676557143f cache: don't automatically evict negative entries if usage is low
The previous scheme only looked at negative entry count in relation to the
total count, leading to tons of spurious evictions if the cache is not
significantly populated.

Instead, only try the above if negative entry count goes beyond namecache
capacity.
2020-10-17 21:22:40 +00:00
Mateusz Guzik
e98c3bc667 cache: erwork sysctl vfs.cache tree
Split everything into neg, debug, param and stat categories.

The legacy nchstats sysctl (queried e.g., by systat) remains untouched.

While here rename some vars to be easier on the eye.
2020-10-17 13:06:29 +00:00
Mateusz Guzik
fa7c73d30c cache: factor negative lookup out of cache_fplookup_next 2020-10-17 13:04:46 +00:00
Mateusz Guzik
41e6b18422 cache: avoid smr in cache_neg_evict in favoro of the already held bucket lock 2020-10-17 13:04:25 +00:00
Mateusz Guzik
c38d8e1eb2 cache: rework parts of negative entry management
- declutter sysctl vfs.cache by moving relevant entries into
vfs.cache.neg
- add a little more parallelism to eviction by replacing the
global lock with an atomically modified counter
- track more statistics

The code needs further effort.
2020-10-17 08:48:58 +00:00
Mateusz Guzik
b31b5e9cfd cache: remove entries before trying to add new ones, not after
Should allow positive entries to replace negative ones in case
the cache is full.
2020-10-17 08:48:32 +00:00
Mateusz Guzik
ad89066af4 vfs: annotate mountlist_mtx with __exclusive_cache_line 2020-10-17 08:47:08 +00:00
Mateusz Guzik
d6eee35004 cache: add a probe reporting addition of duplicate entries 2020-10-17 00:27:26 +00:00
Mateusz Guzik
a59b0ac3aa cache: flip inverted condition in previous
It happened to not affect correctness in that the fallback code would
simply neglect to promote the entry.
2020-10-16 02:19:33 +00:00
Mateusz Guzik
e7602e04c7 cache: support negative entry promotion in slowpath smr 2020-10-16 00:56:13 +00:00
Mateusz Guzik
571bc3d1af cache: elide vhold/vdrop around promoting negative entry 2020-10-16 00:55:57 +00:00
Mateusz Guzik
640e6162ee cache: dedup code for negative promotion 2020-10-16 00:55:31 +00:00
Mateusz Guzik
c97c8746c0 cache: neglist -> nl; negstate -> ns
No functional changes.
2020-10-16 00:55:09 +00:00
Mateusz Guzik
43777a207d cache: split hotlist between existing negative lists
This simplifies the code while allowing for concurrent negative eviction
down the road.

Cache misses increased slightly due to higher rate of evictions allowed by
the change.

The current algorithm remains too aggressive.
2020-10-15 17:44:17 +00:00
Mateusz Guzik
430dc4518d cache: make neglist an array given the static size 2020-10-15 17:42:22 +00:00
Brooks Davis
16e4a0c89c physio: Don't store user addresses in bio_data
Only assign the address from the iovec to bio_data if it is a kernel
address.  This was the single place where bio_data stored (however
briefly) a userspace pointer.

Reviewed by:	imp, markj
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26783
2020-10-15 17:05:21 +00:00