Commit Graph

18026 Commits

Author SHA1 Message Date
Alan Somers
1868a91fac Regenerate syscall files after addition of aio_writev/aio_readv 2021-01-02 19:57:58 -07:00
Alan Somers
022ca2fc7f Add aio_writev and aio_readv
POSIX AIO is great, but it lacks vectored I/O functions. This commit
fixes that shortcoming by adding aio_writev and aio_readv. They aren't
part of the standard, but they're an obvious extension. They work just
like their synchronous equivalents pwritev and preadv.

It isn't yet possible to use vectored aiocbs with lio_listio, but that
could be added in the future.

Reviewed by:    jhb, kib, bcr
Relnotes:       yes
Differential Revision: https://reviews.freebsd.org/D27743
2021-01-02 19:57:58 -07:00
Jamie Gritton
b58a46347c jail: revert the attachment part of b4e87a6329
The change to kern_jail_set that was supposed to "also properly clean
up when attachment fails" didn't fix a memory leak but actually caused
a double free.  Back that part out, and leave the part that manages
allprison_lock state.
2020-12-31 19:55:49 -08:00
Mateusz Guzik
1365b5f86f cache: fold NCF_WHITE check into the rest
Tested by:	pho
2021-01-01 00:10:43 +00:00
Mateusz Guzik
d7c62d98c9 cache: call cache_fplookup_modifying in neg
Tested by:	pho
2021-01-01 00:10:43 +00:00
Mateusz Guzik
6fe7de1a25 cache: refactor cache_fpl_handle_root to fit the rest of the code better
Tested by:	pho
2021-01-01 00:10:43 +00:00
Mateusz Guzik
e17e01bd0e cache: refactor dot handling
Tested by:	pho
2021-01-01 00:10:43 +00:00
Mateusz Guzik
4651db56c7 cache: remove a branch from mount point checking
Tested by:	pho
2021-01-01 00:10:42 +00:00
Mateusz Guzik
0b5bd1afd8 cache: support lockless lookup of degenerate paths
Tested by:	pho
2021-01-01 00:10:42 +00:00
Mateusz Guzik
1d6eb97677 cache: save on branching when parsing the path by inserting a sentinel
Tested by:	pho
2021-01-01 00:10:42 +00:00
Mateusz Guzik
67297766b5 cache: hoist trailing slash and degenerate path handling out of the loop
Tested by:	pho
2021-01-01 00:10:42 +00:00
Mateusz Guzik
bb3a12f0e5 fd: inline pwd_get_smr
Tested by:	pho
2021-01-01 00:10:42 +00:00
John Baldwin
825d234144 Don't check P_INMEM in kdb_thr_*().
Not all debugger operations that enumerate threads require thread
stacks to be resident in memory to be useful.  Instead, push P_INMEM
checks (if needed) into callers.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27827
2020-12-31 16:01:12 -08:00
John Baldwin
9acce1c992 Enumerate processes via the pid hash table in kdb_thr_*().
Processes part way through exit1() are not included in allproc.  Using
allproc to enumerate processes prevented getting the stack trace of a
thread in this part of exit1() via ddb.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27826
2020-12-31 16:00:54 -08:00
John Baldwin
4e7d1b527c Add a proc_off_p_hash helper variable.
This is used by kernel debuggers to enumerate processes via the pid
hash table.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27825
2020-12-31 16:00:33 -08:00
John Baldwin
47877889f2 ddb ps: Use the pidhash to enumerate processes not in allproc.
Exiting processes that have been removed from allproc but are still
executing are not yet marked PRS_ZOMBIE, so they were not listed (for
example, if a thread panics during exit1()).  To detect these
processes, clear p_list.le_prev to NULL explicitly after removing a
process from the allproc list and check for this sentinel rather than
PRS_ZOMBIE when walking the pidhash.

While here, simplify the pidhash walk to use a single outer loop.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27824
2020-12-31 16:00:05 -08:00
Jamie Gritton
b4e87a6329 jail: Clean up allprison_lock handing in kern_jail_set
Keep explicit track of the allprison_lock state during the final part
of kern_jail_set, instead of deducing it from the JAIL_ATTACH flag.

Also properly clean up when the attachment fails, fixing a long-
standing (though minor) memory leak.
2020-12-31 15:18:43 -08:00
Mateusz Guzik
0c09f4b0cc cache: work around corner case of dvp == tvp in cache_fplookup_final_modifying
Fixes a panic where the kernel would unlock an unheld lock coming from
rename looking up "foo/." as the source.

Reported by:	markj (syzkaller)
2020-12-28 21:38:20 +00:00
Mateusz Guzik
4ab7d9f484 cache: reduce engrish in previous commit 2020-12-28 02:05:30 +00:00
Mateusz Guzik
0714f921cd cache: save on some branching in common case mount point traversal 2020-12-28 01:53:28 +00:00
Mateusz Guzik
8c9d74634a vfs: stop open-coding setting WILLBEDIR flag 2020-12-28 01:53:27 +00:00
Mateusz Guzik
002e18eb7f vfs: add FAILIFEXISTS flag
Both FreeBSD and Linux mkdir -p walk the tree up ignoring any EEXIST on
the way and both are used a lot when building respective kernels.

This poses a problem as spurious locking avoidably interferes with
concurrent operations like getdirentries on affected directories.

Work around the problem by adding FAILIFEXISTS flag. In case of lockless
lookup this manages to avoid any work to begin with, there is no speed
up for the locked case but perhaps this can be augmented later on.

For simplicity the only supported semantics are as used by mkdir.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D27789
2020-12-28 01:53:27 +00:00
Mateusz Guzik
ff97bc034f cache: simplify lockless dot lookups 2020-12-28 01:53:27 +00:00
Mateusz Guzik
abd7ded451 cache: modification and last entry filling support in lockless lookup v2
The previous patch failed to set the ISDOTDOT flag when appropriate,
which in turn fail to properly handle degenerate lookups.

While here sprinkle some extra assertions.

Tested by:	pho (previous version)
2020-12-27 21:03:18 +00:00
Mateusz Guzik
623daa69f9 cache: assert internal flags are not passed by namei 2020-12-27 19:49:24 +00:00
Mateusz Guzik
a1fc1f10c6 Revert "cache: modification and last entry filling support in lockless lookup"
This reverts commit 6dbb07ed68.

Some ports unreliably fail to build with rmdir getting ENOTEMPTY.
2020-12-27 19:02:29 +00:00
Mateusz Guzik
6dbb07ed68 cache: modification and last entry filling support in lockless lookup
Tested by:	pho (previous version)
2020-12-27 17:22:25 +00:00
Konstantin Belousov
9dd48b87e6 Regen. 2020-12-27 12:57:27 +02:00
Konstantin Belousov
7a202823aa Expose eventfd in the native API/ABI using a new __specialfd syscall
eventfd is a Linux system call that produces special file descriptors
for event notification. When porting Linux software, it is currently
usually emulated by epoll-shim on top of kqueues.  Unfortunately, kqueues
are not passable between processes.  And, as noted by the author of
epoll-shim, even if they were, the library state would also have to be
passed somehow.  This came up when debugging strange HW video decode
failures in Firefox.  A native implementation would avoid these problems
and help with porting Linux software.

Since we now already have an eventfd implementation in the kernel (for
the Linuxulator), it's pretty easy to expose it natively, which is what
this patch does.

Submitted by:   greg@unrelenting.technology
Reviewed by:    markj (previous version)
MFC after:      2 weeks
Differential Revision:  https://reviews.freebsd.org/D26668
2020-12-27 12:57:26 +02:00
Jamie Gritton
7f4e724829 jail: add a missing lock around an osd_jail_call().
allprison_lock should be at least held shared when jail OSD methods
are called.  Add a shared lock around one such call where that wasn't
the case.

In another such call, change an exclusive lock grab to be shared in
what is likely the more common case.
2020-12-26 20:49:30 -08:00
Jamie Gritton
0fe74ae624 jail: Consistently handle the pr_allow bitmask
Return a boolean (i.e. 0 or 1) from prison_allow, instead of the flag
value itself, which is what sysctl expects.

Add prison_set_allow(), which can set or clear a permission bit, and
propagates cleared bits down to child jails.

Use prison_allow() and prison_set_allow() in the various jail.allow.*
sysctls, and others that depend on thoe permissions.

Add locking around checking both pr_allow and pr_enforce_statfs in
prison_priv_check().
2020-12-26 20:25:02 -08:00
Mark Johnston
26b23f07fb sendfile: Ensure that sfio->npages is initialized
We initialize sfio->npages only when some I/O is required to satisfy the
request.  However, sendfile_iodone() contains an INVARIANTS-only check
that references sfio->npages, and this check is executed even if no I/O
is performed, so the check may use an uninitialized value.

Fix the problem by initializing sfio->npages earlier.  Note that
sendfile_swapin() always initializes the page array.  In some rare cases
we need to trim the page array so ensure that sfio->npages gets updated
accordingly.

Reported by:		syzkaller (with KASAN)
Reviewed by:		kib
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27726
2020-12-26 16:07:40 -05:00
Jamie Gritton
5d58f959d3 jail: Fix lock-free access to dynamic pr.allow flags
Use atomic access and a memory barrier to ensure that the flag parameter
in pr_flag_allow is indeed set after the rest of the structure is valid.

Simplify adding flag bits with pr_allow_all, a dynamic version of
PR_ALLOW_ALL_STATIC.
2020-12-26 12:53:28 -08:00
Jamie Gritton
7de883c82f jail: Fix an O(n^2) loop when adding jails
When a jail is added using the default (system-chosen) JID, and
non-default-JID jails already exist, a loop through the allprison
list could restart and result in unnecessary O(n^2) behaviour.
There should never be more than two list passes required.

Also clean up inefficient (though still O(n)) allprison list traversal
when finding jails by ID, or when adding jails in the common case of
all default JIDs.
2020-12-26 10:39:34 -08:00
Alan Somers
0120603891 AIO: remove the kaiocb->bio linkage
Vectored aio will require each aiocb to be associated with multiple
bios, so we can't store a link to the latter from the former.  But we
don't really need to.  aio_biowakeup already knows the bio it's using,
and the other fields can be stored within the bio and/or buf itself.

Also, remove the unused kaiocb.backend2 field.

Reviewed By:	kib
Differential Revision: https://reviews.freebsd.org/D27682
2020-12-23 16:06:15 +00:00
Mateusz Guzik
906a73e791 cache: fix up cache_hold_vnode comment 2020-12-23 07:24:29 +00:00
Andrew Gallatin
02bc3865aa Optionally bind ktls threads to NUMA domains
When ktls_bind_thread is 2, we pick a ktls worker thread that is
bound to the same domain as the TCP connection associated with
the socket. We use roughly the same code as netinet/tcp_hpts.c to
do this. This allows crypto to run on the same domain as the TCP
connection is associated with. Assuming TCP_REUSPORT_LB_NUMA
(D21636) is in place & in use, this ensures that the crypto source
and destination buffers are local to the same NUMA domain as we're
running crypto on.

This change (when TCP_REUSPORT_LB_NUMA, D21636, is used) reduces
cross-domain traffic from over 37% down to about 13% as measured
by pcm.x on a dual-socket Xeon using nginx and a Netflix workload.

Reviewed by:	jhb
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21648
2020-12-19 21:46:09 +00:00
Kyle Evans
54a837c8cc kern: cpuset: allow jails to modify child jails' roots
This partially lifts a restriction imposed by r191639 ("Prevent a superuser
inside a jail from modifying the dedicated root cpuset of that jail") that's
perhaps beneficial after r192895 ("Add hierarchical jails."). Jails still
cannot modify their own cpuset, but they can modify child jails' roots to
further restrict them or widen them back to the modifying jails' own mask.

As a side effect of this, the system root may once again widen the mask of
jails as long as they're still using a subset of the parent jails' mask.
This was previously prevented by the fact that cpuset_getroot of a root set
will return that root, rather than the root's parent -- cpuset_modify uses
cpuset_getroot since it was introduced in r327895, previously it was just
validating against set->cs_parent which allowed the system root to widen
jail masks.

Reviewed by:	jamie
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27352
2020-12-19 03:30:06 +00:00
Konstantin Belousov
673e2dd652 Add ELF flag to disable ASLR stack gap.
Also centralize and unify checks to enable ASLR stack gap in a new
helper exec_stackgap().

PR:	239873
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-12-18 23:14:39 +00:00
John Baldwin
a095390344 Use a template assembly file for firmware object files.
Similar to r366897, this uses the .incbin directive to pull in a
firmware file's contents into a .fwo file.  The same scheme for
computing symbol names from the filename is used as before to maximize
compatiblity and not require rebuilding existing .fwo files for
NO_CLEAN builds.  Using ld -o binary requires extra hacks in linkers
to either specify ABI options (e.g. soft- vs hard-float) or to ignore
ABI incompatiblities when linking certain objects (e.g.  object files
with only data).  Using the compiler driver avoids the need for these
hacks as the compiler driver is able to set all the appropriate ABI
options.

Reviewed by:	imp, markj
Obtained from:	CheriBSD
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D27579
2020-12-17 20:31:17 +00:00
Konstantin Belousov
551e205f6d Fix a race in tty_signal_sessleader() with unlocked read of s_leader.
Since we do not own the session lock, a parallel killjobc() might
reset s_leader to NULL after we checked it.  Read s_leader only once
and ensure that compiler is not allowed to reload.

While there, make access to t_session somewhat more pretty by using
local variable.

PR:	251915
Submitted by:	Jakub Piecuch <j.piecuch96@gmail.com>
MFC after:	1 week
2020-12-17 19:51:39 +00:00
Mateusz Guzik
57efe26bcb fd: reimplement close_range to avoid spurious relocking 2020-12-17 18:52:30 +00:00
Mateusz Guzik
08a5615cfe audit: rework AUDIT_SYSCLOSE
This in particular avoids spurious lookups on close.
2020-12-17 18:52:04 +00:00
Mateusz Guzik
1e71e7c4f6 fd: refactor closefp in preparation for close_range rework 2020-12-17 18:51:09 +00:00
Mateusz Guzik
08241fedc4 fd: remove redundant saturation check from fget_unlocked_seq
refcount_acquire_if_not_zero returns true on saturation.
The case of 0 is handled by looping again, after which the originally
found pointer will no longer be there.

Noted by:	kib
2020-12-16 18:01:41 +00:00
Mateusz Guzik
6404d7ffc1 uipc: disable prediction in unp_pcb_lock_peer
The branch is not very predictable one way or the other, at least during
buildkernel where it only correctly matched 57% of calls.
2020-12-13 21:32:19 +00:00
Mateusz Guzik
8ab96e265d cache: fix ups bad predicts
- last level fallback normally sees CREATE; the code should be optimized to not
get there for said case
- fast path commonly fails with ENOENT
2020-12-13 21:29:39 +00:00
Mateusz Guzik
d48c2b8d29 vfs: correctly predict last fdrop on failed open
Arguably since the count is guaranteed to be 1 the code should be modified
to avoid the work.
2020-12-13 21:28:15 +00:00
Konstantin Belousov
203affb291 Fix TDP_WAKEUP/thr_wake(curthread->td_tid) after r366428.
Reported by:	arichardson
Reviewed by:	arichardson, markj
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27597
2020-12-13 19:45:42 +00:00
Konstantin Belousov
0b459854bc Correct indent.
Sponsored by:	The FreeBSD Foundation
2020-12-13 19:43:45 +00:00