Commit Graph

18200 Commits

Author SHA1 Message Date
Mateusz Guzik
6dbb07ed68 cache: modification and last entry filling support in lockless lookup
Tested by:	pho (previous version)
2020-12-27 17:22:25 +00:00
Konstantin Belousov
9dd48b87e6 Regen. 2020-12-27 12:57:27 +02:00
Konstantin Belousov
7a202823aa Expose eventfd in the native API/ABI using a new __specialfd syscall
eventfd is a Linux system call that produces special file descriptors
for event notification. When porting Linux software, it is currently
usually emulated by epoll-shim on top of kqueues.  Unfortunately, kqueues
are not passable between processes.  And, as noted by the author of
epoll-shim, even if they were, the library state would also have to be
passed somehow.  This came up when debugging strange HW video decode
failures in Firefox.  A native implementation would avoid these problems
and help with porting Linux software.

Since we now already have an eventfd implementation in the kernel (for
the Linuxulator), it's pretty easy to expose it natively, which is what
this patch does.

Submitted by:   greg@unrelenting.technology
Reviewed by:    markj (previous version)
MFC after:      2 weeks
Differential Revision:  https://reviews.freebsd.org/D26668
2020-12-27 12:57:26 +02:00
Jamie Gritton
7f4e724829 jail: add a missing lock around an osd_jail_call().
allprison_lock should be at least held shared when jail OSD methods
are called.  Add a shared lock around one such call where that wasn't
the case.

In another such call, change an exclusive lock grab to be shared in
what is likely the more common case.
2020-12-26 20:49:30 -08:00
Jamie Gritton
0fe74ae624 jail: Consistently handle the pr_allow bitmask
Return a boolean (i.e. 0 or 1) from prison_allow, instead of the flag
value itself, which is what sysctl expects.

Add prison_set_allow(), which can set or clear a permission bit, and
propagates cleared bits down to child jails.

Use prison_allow() and prison_set_allow() in the various jail.allow.*
sysctls, and others that depend on thoe permissions.

Add locking around checking both pr_allow and pr_enforce_statfs in
prison_priv_check().
2020-12-26 20:25:02 -08:00
Mark Johnston
26b23f07fb sendfile: Ensure that sfio->npages is initialized
We initialize sfio->npages only when some I/O is required to satisfy the
request.  However, sendfile_iodone() contains an INVARIANTS-only check
that references sfio->npages, and this check is executed even if no I/O
is performed, so the check may use an uninitialized value.

Fix the problem by initializing sfio->npages earlier.  Note that
sendfile_swapin() always initializes the page array.  In some rare cases
we need to trim the page array so ensure that sfio->npages gets updated
accordingly.

Reported by:		syzkaller (with KASAN)
Reviewed by:		kib
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27726
2020-12-26 16:07:40 -05:00
Jamie Gritton
5d58f959d3 jail: Fix lock-free access to dynamic pr.allow flags
Use atomic access and a memory barrier to ensure that the flag parameter
in pr_flag_allow is indeed set after the rest of the structure is valid.

Simplify adding flag bits with pr_allow_all, a dynamic version of
PR_ALLOW_ALL_STATIC.
2020-12-26 12:53:28 -08:00
Jamie Gritton
7de883c82f jail: Fix an O(n^2) loop when adding jails
When a jail is added using the default (system-chosen) JID, and
non-default-JID jails already exist, a loop through the allprison
list could restart and result in unnecessary O(n^2) behaviour.
There should never be more than two list passes required.

Also clean up inefficient (though still O(n)) allprison list traversal
when finding jails by ID, or when adding jails in the common case of
all default JIDs.
2020-12-26 10:39:34 -08:00
Alan Somers
0120603891 AIO: remove the kaiocb->bio linkage
Vectored aio will require each aiocb to be associated with multiple
bios, so we can't store a link to the latter from the former.  But we
don't really need to.  aio_biowakeup already knows the bio it's using,
and the other fields can be stored within the bio and/or buf itself.

Also, remove the unused kaiocb.backend2 field.

Reviewed By:	kib
Differential Revision: https://reviews.freebsd.org/D27682
2020-12-23 16:06:15 +00:00
Mateusz Guzik
906a73e791 cache: fix up cache_hold_vnode comment 2020-12-23 07:24:29 +00:00
Andrew Gallatin
02bc3865aa Optionally bind ktls threads to NUMA domains
When ktls_bind_thread is 2, we pick a ktls worker thread that is
bound to the same domain as the TCP connection associated with
the socket. We use roughly the same code as netinet/tcp_hpts.c to
do this. This allows crypto to run on the same domain as the TCP
connection is associated with. Assuming TCP_REUSPORT_LB_NUMA
(D21636) is in place & in use, this ensures that the crypto source
and destination buffers are local to the same NUMA domain as we're
running crypto on.

This change (when TCP_REUSPORT_LB_NUMA, D21636, is used) reduces
cross-domain traffic from over 37% down to about 13% as measured
by pcm.x on a dual-socket Xeon using nginx and a Netflix workload.

Reviewed by:	jhb
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21648
2020-12-19 21:46:09 +00:00
Kyle Evans
54a837c8cc kern: cpuset: allow jails to modify child jails' roots
This partially lifts a restriction imposed by r191639 ("Prevent a superuser
inside a jail from modifying the dedicated root cpuset of that jail") that's
perhaps beneficial after r192895 ("Add hierarchical jails."). Jails still
cannot modify their own cpuset, but they can modify child jails' roots to
further restrict them or widen them back to the modifying jails' own mask.

As a side effect of this, the system root may once again widen the mask of
jails as long as they're still using a subset of the parent jails' mask.
This was previously prevented by the fact that cpuset_getroot of a root set
will return that root, rather than the root's parent -- cpuset_modify uses
cpuset_getroot since it was introduced in r327895, previously it was just
validating against set->cs_parent which allowed the system root to widen
jail masks.

Reviewed by:	jamie
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27352
2020-12-19 03:30:06 +00:00
Konstantin Belousov
673e2dd652 Add ELF flag to disable ASLR stack gap.
Also centralize and unify checks to enable ASLR stack gap in a new
helper exec_stackgap().

PR:	239873
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-12-18 23:14:39 +00:00
John Baldwin
a095390344 Use a template assembly file for firmware object files.
Similar to r366897, this uses the .incbin directive to pull in a
firmware file's contents into a .fwo file.  The same scheme for
computing symbol names from the filename is used as before to maximize
compatiblity and not require rebuilding existing .fwo files for
NO_CLEAN builds.  Using ld -o binary requires extra hacks in linkers
to either specify ABI options (e.g. soft- vs hard-float) or to ignore
ABI incompatiblities when linking certain objects (e.g.  object files
with only data).  Using the compiler driver avoids the need for these
hacks as the compiler driver is able to set all the appropriate ABI
options.

Reviewed by:	imp, markj
Obtained from:	CheriBSD
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D27579
2020-12-17 20:31:17 +00:00
Konstantin Belousov
551e205f6d Fix a race in tty_signal_sessleader() with unlocked read of s_leader.
Since we do not own the session lock, a parallel killjobc() might
reset s_leader to NULL after we checked it.  Read s_leader only once
and ensure that compiler is not allowed to reload.

While there, make access to t_session somewhat more pretty by using
local variable.

PR:	251915
Submitted by:	Jakub Piecuch <j.piecuch96@gmail.com>
MFC after:	1 week
2020-12-17 19:51:39 +00:00
Mateusz Guzik
57efe26bcb fd: reimplement close_range to avoid spurious relocking 2020-12-17 18:52:30 +00:00
Mateusz Guzik
08a5615cfe audit: rework AUDIT_SYSCLOSE
This in particular avoids spurious lookups on close.
2020-12-17 18:52:04 +00:00
Mateusz Guzik
1e71e7c4f6 fd: refactor closefp in preparation for close_range rework 2020-12-17 18:51:09 +00:00
Mateusz Guzik
08241fedc4 fd: remove redundant saturation check from fget_unlocked_seq
refcount_acquire_if_not_zero returns true on saturation.
The case of 0 is handled by looping again, after which the originally
found pointer will no longer be there.

Noted by:	kib
2020-12-16 18:01:41 +00:00
Mateusz Guzik
6404d7ffc1 uipc: disable prediction in unp_pcb_lock_peer
The branch is not very predictable one way or the other, at least during
buildkernel where it only correctly matched 57% of calls.
2020-12-13 21:32:19 +00:00
Mateusz Guzik
8ab96e265d cache: fix ups bad predicts
- last level fallback normally sees CREATE; the code should be optimized to not
get there for said case
- fast path commonly fails with ENOENT
2020-12-13 21:29:39 +00:00
Mateusz Guzik
d48c2b8d29 vfs: correctly predict last fdrop on failed open
Arguably since the count is guaranteed to be 1 the code should be modified
to avoid the work.
2020-12-13 21:28:15 +00:00
Konstantin Belousov
203affb291 Fix TDP_WAKEUP/thr_wake(curthread->td_tid) after r366428.
Reported by:	arichardson
Reviewed by:	arichardson, markj
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27597
2020-12-13 19:45:42 +00:00
Konstantin Belousov
0b459854bc Correct indent.
Sponsored by:	The FreeBSD Foundation
2020-12-13 19:43:45 +00:00
Mateusz Guzik
edcdcefb88 fd: fix fdrop prediction when closing a fd
Most of the time this is the last reference, contrary to typical fdrop use.
2020-12-13 18:06:24 +00:00
Ryan Libby
d3bbf8af68 cache_fplookup: quiet gcc -Wreturn-type
Reviewed by:	markj, mjg
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D27555
2020-12-11 22:51:44 +00:00
Mateusz Guzik
0ecce93dca fd: make serialization in fdescfree_fds conditional on hold count
p_fd nullification in fdescfree serializes against new threads transitioning
the count 1 -> 2, meaning that fdescfree_fds observing the count of 1 can
safely assume there is nobody else using the table. Losing the race and
observing > 1 is harmless.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27522
2020-12-10 17:17:22 +00:00
Mark Johnston
3309fa7403 Plug a race between fd table teardown and several loops
To export information from fd tables we have several loops which do
this:

FILDESC_SLOCK(fdp);
for (i = 0; fdp->fd_refcount > 0 && i <= lastfile; i++)
	<export info for fd i>;
FILDESC_SUNLOCK(fdp);

Before r367777, fdescfree() acquired the fd table exclusive lock between
decrementing fdp->fd_refcount and freeing table entries.  This
serialized with the loop above, so the file at descriptor i would remain
valid until the lock is dropped.  Now there is no serialization, so the
loops may race with teardown of file descriptor tables.

Acquire the exclusive fdtable lock after releasing the final table
reference to provide a barrier synchronizing with these loops.

Reported by:	pho
Reviewed by:	kib (previous version), mjg
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27513
2020-12-09 14:05:08 +00:00
Mark Johnston
4c1c90ea95 Use refcount_load(9) to load fd table reference counts
No functional change intended.

Reviewed by:	kib, mjg
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27512
2020-12-09 14:04:54 +00:00
Kyle Evans
f1b18a668d cpuset_set{affinity,domain}: do not allow empty masks
cpuset_modify() would not currently catch this, because it only checks that
the new mask is a subset of the root set and circumvents the EDEADLK check
in cpuset_testupdate().

This change both directly validates the mask coming in since we can
trivially detect an empty mask, and it updates cpuset_testupdate to catch
stuff like this going forward by always ensuring we don't end up with an
empty mask.

The check_mask argument has been renamed because the 'check' verbiage does
not imply to me that it's actually doing a different operation. We're either
augmenting the existing mask, or we are replacing it entirely.

Reported by:	syzbot+4e3b1009de98d2fabcda@syzkaller.appspotmail.com
Discussed with:	andrew
Reviewed by:	andrew, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27511
2020-12-08 18:47:22 +00:00
Kyle Evans
b2780e8537 kern: cpuset: resolve race between cpuset_lookup/cpuset_rel
The race plays out like so between threads A and B:

1. A ref's cpuset 10
2. B does a lookup of cpuset 10, grabs the cpuset lock and searches
   cpuset_ids
3. A rel's cpuset 10 and observes the last ref, waits on the cpuset lock
   while B is still searching and not yet ref'd
4. B ref's cpuset 10 and drops the cpuset lock
5. A proceeds to free the cpuset out from underneath B

Resolve the race by only releasing the last reference under the cpuset lock.
Thread A now picks up the spinlock and observes that the cpuset has been
revived, returning immediately for B to deal with later.

Reported by:	syzbot+92dff413e201164c796b@syzkaller.appspotmail.com
Reviewed by:	markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27498
2020-12-08 18:45:47 +00:00
Kyle Evans
9c83dab96c kern: cpuset: plug a unr leak
cpuset_rel_defer() is supposed to be functionally equivalent to
cpuset_rel() but with anything that might sleep deferred until
cpuset_rel_complete -- this setup is used specifically for cpuset_setproc.

Add in the missing unr free to match cpuset_rel. This fixes a leak that
was observed when I wrote a small userland application to try and debug
another issue, which effectively did:

cpuset(&newid);
cpuset(&scratch);

newid gets leaked when scratch is created; it's off the list, so there's
no mechanism for anything else to relinquish it. A more realistic reproducer
would likely be a process that inherits some cpuset that it's the only ref
for, but it creates a new one to modify. Alternatively, administratively
reassigning a process' cpuset that it's the last ref for will have the same
effect.

Discovered through D27498.

MFC after:	1 week
2020-12-08 18:44:06 +00:00
Mateusz Guzik
8fcfd0e222 vfs: add cleanup on error missed in r368375
Noted by:	jrtc27
2020-12-06 19:24:38 +00:00
Mateusz Guzik
60e2a0d9a4 vfs: factor buffer allocation/copyin out of namei 2020-12-06 04:59:24 +00:00
Mateusz Guzik
0c23d26230 vfs: keep bad ops on vnode reclaim
They were only modified to accomodate a redundant assertion.

This runs into problems as lockless lookup can still try to use the vnode
and crash instead of getting an error.

The bug was only present in kernels with INVARIANTS.

Reported by:	kevans
2020-12-05 05:56:23 +00:00
Konstantin Belousov
be2535b0a6 Add kern_ntp_adjtime(9).
Reviewed by:	brooks, cy
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27471
2020-12-04 18:56:44 +00:00
Kyle Evans
34af05ead3 kern: soclose: don't sleep on SO_LINGER w/ timeout=0
This is a valid scenario that's handled in the various protocol layers where
it makes sense (e.g., tcp_disconnect and sctp_disconnect). Given that it
indicates we should immediately drop the connection, it makes little sense
to sleep on it.

This could lead to panics with INVARIANTS. On non-INVARIANTS kernels, this
could result in the thread hanging until a signal interrupts it if the
protocol does not mark the socket as disconnected for whatever reason.

Reported by:	syzbot+e625d92c1dd74e402c81@syzkaller.appspotmail.com
Reviewed by:	glebius, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27407
2020-12-04 04:39:48 +00:00
Mark Johnston
b957b18594 Always use 64-bit physical addresses for dump_avail[] in minidumps
As of r365978, minidumps include a copy of dump_avail[].  This is an
array of vm_paddr_t ranges.  libkvm walks the array assuming that
sizeof(vm_paddr_t) is equal to the platform "word size", but that's not
correct on some platforms.  For instance, i386 uses a 64-bit vm_paddr_t.

Fix the problem by always dumping 64-bit addresses.  On platforms where
vm_paddr_t is 32 bits wide, namely arm and mips (sometimes), translate
dump_avail[] to an array of uint64_t ranges.  With this change, libkvm
no longer needs to maintain a notion of the target word size, so get rid
of it.

This is a no-op on platforms where sizeof(vm_paddr_t) == 8.

Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27082
2020-12-03 17:12:31 +00:00
Oleksandr Tymoshenko
18ce865a4f Add support for hw.physmem tunable for ARM/ARM64/RISC-V platforms
hw.physmem tunable allows to limit number of physical memory available to the
system. It's handled in machdep files for x86 and PowerPC. This patch adds
required logic to the consolidated physmem management interface that is used by
ARM, ARM64, and RISC-V.

Submitted by:	Klara, Inc.
Reviewed by:	mhorne
Sponsored by:	Ampere Computing
Differential Revision:	https://reviews.freebsd.org/D27152
2020-12-03 05:39:27 +00:00
Mateusz Guzik
10e64782ed select: make sure there are no wakeup attempts after selfdfree returns
Prior to the patch returning selfdfree could still be racing against doselwakeup
which set sf_si = NULL and now locks stp to wake up the other thread.

A sufficiently unlucky pair can end up going all the way down to freeing
select-related structures before the lock/wakeup/unlock finishes.

This started manifesting itself as crashes since select data started getting
freed in r367714.
2020-12-02 00:48:15 +00:00
Konstantin Belousov
6814c2dac5 lio_listio(2): send signal even if number of jobs is zero.
Right now, if lio registered zero jobs, syscall frees lio job
structure, cleaning up queued ksi.  As result, the realtime signal is
dequeued and never delivered.

Fix it by allowing sendsig() to copy ksi when job count is zero.

PR: 220398
Reported and reviewed by:	asomers
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27421
2020-12-01 22:53:33 +00:00
Konstantin Belousov
2933165666 vfs_aio.c: style.
Mostly re-wrap conditions to split after binary ops.

Reviewed by:	asomers
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27421
2020-12-01 22:46:51 +00:00
Konstantin Belousov
5c5005ec20 vfs_aio.c: correct comment.
Reviewed by:	asomers
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27421
2020-12-01 22:30:32 +00:00
Mark Johnston
dad22308a1 vmem: Revert r364744
A pair of bugs are believed to have caused the hangs described in the
commit log message for r364744:

1. uma_reclaim() could trigger reclamation of the reserve of boundary
   tags used to avoid deadlock.  This was fixed by r366840.
2. The loop in vmem_xalloc() would in some cases try to allocate more
   boundary tags than the expected upper bound of BT_MAXALLOC.  The
   reserve is sized based on the value BT_MAXMALLOC, so this behaviour
   could deplete the reserve without guaranteeing a successful
   allocation, resulting in a hang.  This was fixed by r366838.

PR:		248008
Tested by:	rmacklem
2020-12-01 16:06:31 +00:00
Alexander V. Chernikov
8db8bebf1f Move inner loop logic out of sysctl_sysctl_next_ls().
Refactor sysctl_sysctl_next_ls():
* Move huge inner loop out of sysctl_sysctl_next_ls() into a separate
 non-recursive function, returning the next step to be taken.
* Update resulting node oid parts only on successful lookup
* Make sysctl_sysctl_next_ls() return boolean success/failure instead of errno,
 slightly simplifying logic

Reviewed by:	freqlabs
Differential Revision:	https://reviews.freebsd.org/D27029
2020-11-30 21:59:52 +00:00
Toomas Soome
93b18e3730 vt: if loader did pass the font via metadata, use it
The built in 8x16 font may be way too small with large framebuffer
resolutions, to improve readability, use loader provied font.
2020-11-30 11:45:47 +00:00
Toomas Soome
a4a10b37d4 Add VT driver for VBE framebuffer device
Implement vt_vbefb to support Vesa Bios Extensions (VBE) framebuffer with VT.
vt_vbefb is built based on vt_efifb and is assuming similar data for
initialization, use MODINFOMD_VBE_FB to identify the structure vbe_fb
in kernel metadata.

struct vbe_fb, is populated by boot loader, and is passed to kernel via
metadata payload.

Differential Revision:	https://reviews.freebsd.org/D27373
2020-11-30 08:22:40 +00:00
Matt Macy
2338da0373 Import kernel WireGuard support
Data path largely shared with the OpenBSD implementation by
Matt Dunwoodie <ncon@nconroy.net>

Reviewed by:	grehan@freebsd.org
MFC after:	1 month
Sponsored by:	Rubicon LLC, (Netgate)
Differential Revision:	https://reviews.freebsd.org/D26137
2020-11-29 19:38:03 +00:00
Konstantin Belousov
a9d4fe977a bio aio: Destroy ephemeral mapping before unwiring page.
Apparently some architectures, like ppc in its hashed page tables
variants, account mappings by pmap_qenter() in the response from
pmap_is_page_mapped().

While there, eliminate useless userp variable.

Noted and reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27409
2020-11-29 10:30:56 +00:00
Alexander Motin
83f6b50123 Remove alignment requirements for KVA buffer mapping.
After r368124 pbuf_zone has extra page to handle this particular case.
2020-11-29 01:30:17 +00:00
Konstantin Belousov
cd85379104 Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*).  Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys.  Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight.  Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
Kyle Evans
e07e3fa3c9 kern: cpuset: drop the lock to allocate domainsets
Restructure the loop a little bit to make it a little more clear how it
really operates: we never allocate any domains at the beginning of the first
iteration, and it will run until we've satisfied the amount we need or we
encounter an error.

The lock is now taken outside of the loop to make stuff inside the loop
easier to evaluate w.r.t. locking.

This fixes it to not try and allocate any domains for the freelist under the
spinlock, which would have happened before if we needed any new domains.

Reported by:	syzbot+6743fa07b9b7528dc561@syzkaller.appspotmail.com
Reviewed by:	markj
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D27371
2020-11-28 01:21:11 +00:00
Mark Johnston
0c56925bc2 callout(9): Remove some leftover APM BIOS support
This code is obsolete since r366546.

Reviewed by:	imp
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27267
2020-11-27 20:46:02 +00:00
Konstantin Belousov
99c66d3acf vn_read_from_obj(): fix handling of doomed vnodes.
There is no reason why vp->v_object cannot be NULL. If it is, it's
fine, handle it by delegating to VOP_READ().

Tested by:	pho
Reviewed by:	markj, mjg
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27327
2020-11-26 18:13:33 +00:00
Konstantin Belousov
164438a7b9 More careful handling of the mount failure.
- VFS_UNMOUNT() requires vn_start_write() around it [*].
- call VFS_PURGE() before unmount.
- do not destroy mp if cleanup unmount did not succeed.
- set MNTK_UNMOUNT, and indicate forced unmount with MNTK_UNMOUNTF
  for VFS_UNMOUNT() in cleanup.

PR:	251320 [*]
Reported by:	Tong Zhang <ztong0001@gmail.com>
Reviewed by:	markj, mjg
Discussed with:	rmacklem
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27327
2020-11-26 18:08:42 +00:00
Konstantin Belousov
3b1f974bfb Make max ticks for pause in vn_lock_pair() adjustable at runtime.
Reduce default value from hz / 10 to hz / 100.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
2020-11-26 18:00:26 +00:00
Mateusz Guzik
b83e94be53 thread: staticize thread_reap and move td_allocdomain
thread_init is a much better fit as the the value is constant after
initialization.
2020-11-26 06:59:27 +00:00
Mateusz Guzik
2e51c2bfd1 pipe: follow up cleanup to previous
The commited patch was incomplete.

- add back missing goto retry, noted by jhb
- 'if (error)'  -> 'if (error != 0)'
- consistently do:

if (error != 0)
    break;
continue;

instead of:

if (error != 0)
    break;
else
    continue;

This adds some 'continue' uses which are not needed, but line up with the
rest of pipe_write.
2020-11-25 22:53:21 +00:00
Mateusz Guzik
c8df8543fd pipe: drop spurious pipeunlock/pipelock cycle on write 2020-11-25 21:41:23 +00:00
Kyle Evans
d431dea5ac kern: cpuset: properly rebase when attaching to a jail
The current logic is a fine choice for a system administrator modifying
process cpusets or a process creating a new cpuset(2), but not ideal for
processes attaching to a jail.

Currently, when a process attaches to a jail, it does exactly what any other
process does and loses any mask it might have applied in the process of
doing so because cpuset_setproc() is entirely based around the assumption
that non-anonymous cpusets in the process can be replaced with the new
parent set.

This approach slightly improves the jail attach integration by modifying
cpuset_setproc() callers to indicate if they should rebase their cpuset to
the indicated set or not (i.e. cpuset_setproc_update_set).

If we're rebasing and the process currently has a cpuset assigned that is
not the containing jail's root set, then we will now create a new base set
for it hanging off the jail's root with the existing mask applied instead of
using the jail's root set as the new base set.

Note that the common case will be that the process doesn't have a cpuset
within the jail root, but the system root can freely assign a cpuset from
a jail to a process outside of the jail with no restriction. We assume that
that may have happened or that it could happen due to a race when we drop
the proc lock, so we must recheck both within the loop to gather up
sufficient freed cpusets and after the loop.

To recap, here's how it worked before in all cases:

0     4 <-- jail              0      4 <-- jail / process
|                             |
1                 ->          1
|
3 <-- process

Here's how it works now:

0     4 <-- jail             0       4 <-- jail
|                            |       |
1                 ->         1       5 <-- process
|
3 <-- process

or

0     4 <-- jail             0       4 <-- jail / process
|                            |
1 <-- process     ->         1

More importantly, in both cases, the attaching process still retains the
mask it had prior to attaching or the attach fails with EDEADLK if it's
left with no CPUs to run on or the domain policy is incompatible. The
author of this patch considers this almost a security feature, because a MAC
policy could grant PRIV_JAIL_ATTACH to an unprivileged user that's
restricted to some subset of available CPUs the ability to attach to a jail,
which might lift the user's restrictions if they attach to a jail with a
wider mask.

In most cases, it's anticipated that admins will use this to be able to,
for example, `cpuset -c -l 1 jail -c path=/ command=/long/running/cmd`,
and avoid the need for contortions to spawn a command inside a jail with a
more limited cpuset than the jail.

Reviewed by:	jamie
MFC after:	1 month (maybe)
Differential Revision:	https://reviews.freebsd.org/D27298
2020-11-25 03:14:25 +00:00
Kyle Evans
30b7c6f977 kern: cpuset: rename _cpuset_create() to cpuset_init()
cpuset_init() is better descriptor for what the function actually does. The
name was previously taken by a sysinit that setup cpuset_zero's mask
from all_cpus, it was removed in r331698 before stable/12 branched.

A comment referencing the removed sysinit has now also been removed, since
the setup previously done was moved into cpuset_thread0().

Suggested by:	markj
MFC after:	1 week
2020-11-25 02:12:24 +00:00
Kyle Evans
29d04ea8c3 kern: cpuset: allow cpuset_create() to take an allocated *setp
Currently, it must always allocate a new set to be used for passing to
_cpuset_create, but it doesn't have to. This is purely kern_cpuset.c
internal and it's sparsely used, so just change it to use *setp if it's
not-NULL and modify the two consumers to pass in the address of a NULL
cpuset.

This paves the way for consumers that want the unr allocation without the
possibility of sleeping as long as they've done their due diligence to
ensure that the mask will properly apply atop the supplied parent
(i.e. avoiding the free_unr() in the last failure path).

Reviewed by:	jamie, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27297
2020-11-25 01:42:32 +00:00
Kyle Evans
c7ef3490e2 kern: never restart syscalls calling closefp(), e.g. close(2)
All paths leading into closefp() will either replace or remove the fd from
the filedesc table, and closefp() will call fo_close methods that can and do
currently sleep without regard for the possibility of an ERESTART. This can
be dangerous in multithreaded applications as another thread could have
opened another file in its place that is subsequently operated on upon
restart.

The following are seemingly the only ones that will pass back ERESTART
in-tree:
- sockets (SO_LINGER)
- fusefs
- nfsclient

Reviewed by:	jilles, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27310
2020-11-25 01:08:57 +00:00
Cy Schubert
e5a307c6ac Fix a typo in a comment.
MFC after:	3 days
2020-11-24 06:42:32 +00:00
Mateusz Guzik
f90d57b808 locks: push lock_delay_arg_init calls down
Minor cleanup to skip doing them when recursing on locks and so that
they can act on found lock value if need be.
2020-11-24 03:49:37 +00:00
Mateusz Guzik
094c148b7a sx: drop spurious volatile keyword 2020-11-24 03:48:44 +00:00
Mateusz Guzik
598f2b8116 dtrace: stop using eventhandlers for the part compiled into the kernel
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D27311
2020-11-23 18:27:21 +00:00
Mateusz Guzik
a9568cd2bc thread: stash domain id to work around vtophys problems on ppc64
Adding to zombie list can be perfomed by idle threads, which on ppc64 leads to
panics as it requires a sleepable lock.

Reported by:	alfredo
Reviewed by:	kib, markj
Fixes:	r367842 ("thread: numa-aware zombie reaping")
Differential Revision:	https://reviews.freebsd.org/D27288
2020-11-23 18:26:47 +00:00
Konstantin Belousov
87a9b18d22 Provide ABI modules hooks for process exec/exit and thread exit.
Exec and exit are same as corresponding eventhandler hooks.

Thread exit hook is called somewhat earlier, while thread is still
owned by the process and enough context is available.  Note that the
process lock is owned when the hook is called.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27309
2020-11-23 17:29:25 +00:00
Edward Tomasz Napierala
9c8c797c1a Remove the 'wantparent' variable, unused since r145004.
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D27193
2020-11-23 12:47:23 +00:00
Kyle Evans
dac521ebcf cpuset_setproc: use the appropriate parent for new anonymous sets
As far as I can tell, this has been the case since initially committed in
2008.  cpuset_setproc is the executor of cpuset reassignment; note this
excerpt from the description:

* 1) Set is non-null.  This reparents all anonymous sets to the provided
*    set and replaces all non-anonymous td_cpusets with the provided set.

However, reviewing cpuset_setproc_setthread() for some jail related work
unearthed the error: if tdset was not anonymous, we were replacing it with
`set`. If it was anonymous, then we'd rebase it onto `set` (i.e. copy the
thread's mask over and AND it with `set`) but give the new anonymous set
the original tdset as the parent (i.e. the base of the set we're supposed to
be leaving behind).

The primary visible consequences were that:

1.) cpuset_getid() following such assignment returns the wrong result, the
    setid that we left behind rather than the one we joined.
2.) When a process attached to the jail, the base set of any anonymous
    threads was a set outside of the jail.

This was initially bundled in D27298, but it's a minor fix that's fairly
easy to verify the correctness of.

A test is included in D27307 ("badparent"), which demonstrates the issue
with, effectively:

osetid = cpuset_getid()
newsetid = cpuset()
cpuset_setaffinity(thread)
cpuset_setid(osetid)
cpuset_getid(thread) -> observe that it matches newsetid instead of osetid.

MFC after:	1 week
2020-11-23 02:49:53 +00:00
Kyle Evans
60e60e73fd freebsd32: take the _umtx_op struct definitions back
Providing these in freebsd32.h facilitates local testing/measuring of the
structs rather than forcing one to locally recreate them. Sanity checking
offsets/sizes remains in kern_umtx.c where these are typically used.
2020-11-23 00:58:14 +00:00
Kyle Evans
f96078b8fe kern: dup: do not assume oldfde is valid
oldfde may be invalidated if the table has grown due to the operation that
we're performing, either via fdalloc() or a direct fdgrowtable_exp().

This was technically OK before rS367927 because the old table remained valid
until the filedesc became unused, but now it may be freed immediately if
it's an unshared table in a single-threaded process, so it is no longer a
good assumption to make.

This fixes dup/dup2 invocations that grow the file table; in the initial
report, it manifested as a kernel panic in devel/gmake's configure script.

Reported by:	Guy Yur <guyyur gmail com>
Reviewed by:	rew
Differential Revision:	https://reviews.freebsd.org/D27319
2020-11-23 00:33:06 +00:00
Kyle Evans
e0cb5b2a77 [2/2] _umtx_op: introduce 32-bit/i386 flags for operations
This patch takes advantage of the consolidation that happened to provide two
flags that can be used with the native _umtx_op(2): UMTX_OP___32BIT and
UMTX_OP__I386.

UMTX_OP__32BIT iindicates that we are being provided with 32-bit structures.
Note that this flag alone indicates a 64bit time_t, since this is the
majority case.

UMTX_OP__I386 has been provided so that we can emulate i386 as well,
regardless of whether the host is amd64 or not.

Both imply a different set of copyops in sysumtx_op. freebsd32__umtx_op
simply ignores the flags, since it's already doing a 32-bit operation and
it's unlikely we'll be running an emulator under compat32. Future work
could consider it, but the author sees little benefit.

This will be used by qemu-bsd-user to pass on all _umtx_op calls to the
native interface as long as the host/target endianness matches, effectively
eliminating most if not all of the remaining unresolved deadlocks for most.

This version changed a fair amount from what was under review, mostly in
response to refactoring of the prereq reorganization and battle-testing
it with qemu-bsd-user.  The main changes are as follows:

1.) The i386 flag got renamed to omit '32BIT' since this is redundant.
2.) The flags are now properly handled on 32-bit platforms to emulate other
    32-bit platforms.
3.) Robust list handling was fixed, and the 32-bit functionality that was
    previously gated by COMPAT_FREEBSD32 is now unconditional.
4.) Robust list handling was also improved, including the error reported
    when a process has already registered 32-bit ABI lists and also
    detecting if native robust lists have already been registered. Both
    scenarios now return EBUSY rather than EINVAL, because the input is
    technically valid but we're too busy with another ABI's lists.

libsysdecode/kdump/truss support will go into review soon-ish, along with
the associated manpage update.

Reviewed by:	kib (earlier version)
MFC after:	3 weeks
2020-11-22 05:47:45 +00:00
Kyle Evans
15eaec6a5c _umtx_op: move compat32 definitions back in
These are reasonably compact, and a future commit will blur the compat32
lines by supporting 32-bit operations with the native _umtx_op.
2020-11-22 05:34:51 +00:00
Robert Wing
3c85ca21d1 fd: free old file descriptor tables when not shared
During the life of a process, new file descriptor tables may be allocated. When
a new table is allocated, the old table is placed in a free list and held onto
until all processes referencing them exit.

When a new file descriptor table is allocated, the old file descriptor table
can be freed when the current process has a single-thread and the file
descriptor table is not being shared with any other processes.

Reviewed by:    kevans
Approved by:    kevans (mentor)
Differential Revision:  https://reviews.freebsd.org/D18617
2020-11-22 05:00:28 +00:00
Konstantin Belousov
e68c619144 Stop using eventhandlers for itimers subsystem exec and exit hooks.
While there, do some minor cleanup for kclocks.  They are only
registered from kern_time.c, make registration function static.
Remove event hooks, they are not used by both registered kclocks.
Add some consts.

Perhaps we can stop registering kclocks at all and statically
initialize them.

Reviewed by:	mjg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27305
2020-11-21 21:43:36 +00:00
Konstantin Belousov
5a2a4551f5 Remove unused prototype.
Missed part of r367918.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-11-21 10:58:19 +00:00
Konstantin Belousov
74a093eb98 Stop using eventhandler to invoke umtx_exec hook.
There is no point in dynamic registration, umtx hook is there always.

Reviewed by:	mjg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27303
2020-11-21 10:32:40 +00:00
Kirk McKusick
e75f0f2b48 Only attempt a VOP_UNLOCK() when the vn_lock() has been successful.
No MFC as this code is not present in 12-stable.

Reported by:  Peter Holm
Reviewed by:  Mateusz Guzik
Tested by:    Peter Holm
Sponsored by: Netflix
2020-11-20 20:22:01 +00:00
Michal Meloun
d9de80d614 Also pass interrupt binding request to non-root interrupt controllers.
There are message based controllers that can bind interrupts even if they are
not implemented as root controllers (such as the ITS subblock of GIC).

MFC after:	3 weeks
2020-11-20 09:05:36 +00:00
Mateusz Guzik
f9fe7b28bc pipe: thundering herd problem in pipelock
All reads and writes are serialized with a hand-rolled lock, but unlocking it
always wakes up all waiters. Existing flag fields get resized to make room for
introduction of waiter counter without growing the struct.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D27273
2020-11-19 19:25:47 +00:00
Mark Johnston
a33fef5e25 callout(9): Fix a race between CPU migration and callout_drain()
Suppose a running callout re-arms itself, and before the callout
finishes running another CPU calls callout_drain() and goes to sleep.
softclock_call_cc() will wake up the draining thread, which may not run
immediately if there is a lot of CPU load.  Furthermore, the callout is
still in the callout wheel so it can continue to run and re-arm itself.
Then, suppose that the callout migrates to another CPU before the
draining thread gets a chance to run.  The draining thread is in this
loop in _callout_stop_safe():

	while (cc_exec_curr(cc) == c) {
		CC_UNLOCK(cc);
		sleep();
		CC_LOCK(cc);
	}

but after the migration, cc points to the wrong CPU's callout state.
Then the draining thread goes off and removes the callout from the
wheel, but does so using the wrong lock and per-CPU callout state.

Fix the problem by doing a re-lookup of the callout CPU after sleeping.

Reported by:	syzbot+79569cd4d76636b2cc1c@syzkaller.appspotmail.com
Reported by:	syzbot+1b27e0237aa22d8adffa@syzkaller.appspotmail.com
Reported by:	syzbot+e21aa5b85a9aff90ef3e@syzkaller.appspotmail.com
Reviewed by:	emaste, hselasky
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27266
2020-11-19 18:37:28 +00:00
Mitchell Horne
c8a96cdcd9 Add an option for entering KDB on recursive panics
There are many cases where one would choose avoid entering the debugger
on a normal panic, opting instead to reboot and possibly save a kernel
dump. However, recursive kernel panics are an unusual case that might
warrant attention from a human, so provide a secondary tunable,
debug.debugger_on_recursive_panic, to allow entering the debugger only
when this occurs.

For for simplicity in maintaining existing behaviour, the tunable
defaults to zero.

Reviewed by:	cem, markj
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D27271
2020-11-19 18:03:40 +00:00
Mateusz Guzik
d116b9f1ad thread: numa-aware zombie reaping
The current global list is a significant problem, in particular induces a lot
of cross-domain thread frees. When running poudriere on a 2 domain box about
half of all frees were of that nature.

Patch below introduces per-domain thread data containing zombie lists and
domain-aware reaping. By default it only reaps from the current domain, only
reaping from others if there is free TID shortage.

A dedicated callout is introduced to reap lingering threads if there happens
to be no activity.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D27185
2020-11-19 10:00:48 +00:00
Mateusz Guzik
b8cb628534 pipe: tidy up pipelock 2020-11-19 08:16:45 +00:00
Mateusz Guzik
89744405e6 pipe: allow for lockless pipe_stat
pipes get stated all thet time and this avoidably contributed to contention.
The pipe lock is only held to accomodate MAC and to check the type.

Since normally there is no probe for pipe stat depessimize this by having the
flag.

The pipe_state field gets modified with locks held all the time and it's not
feasible to convert them to use atomic store. Move the type flag away to a
separate variable as a simple cleanup and to provide stable field to read.
Use short for both fields to avoid growing the struct.

While here short-circuit MAC for pipe_poll as well.
2020-11-19 06:30:25 +00:00
Mateusz Guzik
2f5b0b48ac cred: fix minor nits in r367695
Noted by:	jhb
2020-11-19 04:28:39 +00:00
Mateusz Guzik
c48f897bbe smp: fix smp_rendezvous_cpus_retry usage before smp starts
Since none of the other CPUs are running there is nobody to clear their
entries and the routine spins indefinitely.
2020-11-19 04:27:51 +00:00
Mark Johnston
a28c28e6ef Remove NO_EVENTTIMERS support
The arm configs that required it have been removed from the tree.
Removing this option makes the callout code easier to read and
discourages developers from adding new configs without eventtimer
drivers.

Reviewed by:	ian, imp, mav
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27270
2020-11-19 02:50:48 +00:00
Mariusz Zaborski
f488d5b797 Add CTLFLAG_MPSAFE to the suser_enabled sysctl.
Pointed out by:	mjg
2020-11-18 21:26:14 +00:00
Mariusz Zaborski
05e1e482c7 jail: introduce per jail suser_enabled setting
The suser_enable sysctl allows to remove a privileged rights from uid 0.
This change introduce per jail setting which allow to make root a
normal user.

Reviewed by:	jamie
Previous version reviewed by:	kevans, emaste, markj, me_igalic.co
Discussed with:	pjd
Differential Revision:	https://reviews.freebsd.org/D27128
2020-11-18 21:07:08 +00:00
Mariusz Zaborski
21fe9441e1 Fix style nits. 2020-11-18 20:59:58 +00:00
John Baldwin
5335f6434b Fix a few nits in vn_printf().
- Mask out recently added VV_* bits to avoid printing them twice.

- Keep VI_LOCKed on the same line as the rest of the flags.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D27261
2020-11-18 16:21:37 +00:00
Kyle Evans
27a9392d54 _umtx_op: fix robust lists after r367744
A copy-pasto left us copying in 24-bytes at the address of the rb pointer
instead of the intended target.

Reported by:	sigsys@gmail.com
Sighing:	kevans
2020-11-18 03:30:31 +00:00
Conrad Meyer
f8f74aaa84 linux(4) clone(2): Correctly handle CLONE_FS and CLONE_FILES
The two flags are distinct and it is impossible to correctly handle clone(2)
without the assistance of fork1().  This change depends on the pwddesc split
introduced in r367777.

I've added a fork_req flag, FR2_SHARE_PATHS, which indicates that p_pd
should be treated the opposite way p_fd is (based on RFFDG flag).  This is a
little ugly, but the benefit is that existing RFFDG API is preserved.
Holding FR2_SHARE_PATHS disabled, RFFDG indicates both p_fd and p_pd are
copied, while !RFFDG indicates both should be cloned.

In Chrome, clone(2) is used with CLONE_FS, without CLONE_FILES, and expects
independent fd tables.

The previous conflation of CLONE_FS and CLONE_FILES was introduced in
r163371 (2006).

Discussed with:	markj, trasz (earlier version)
Differential Revision:	https://reviews.freebsd.org/D27016
2020-11-17 21:20:11 +00:00
Conrad Meyer
85078b8573 Split out cwd/root/jail, cmask state from filedesc table
No functional change intended.

Tracking these structures separately for each proc enables future work to
correctly emulate clone(2) in linux(4).

__FreeBSD_version is bumped (to 1300130) for consumption by, e.g., lsof.

Reviewed by:	kib
Discussed with:	markj, mjg
Differential Revision:	https://reviews.freebsd.org/D27037
2020-11-17 21:14:13 +00:00
Conrad Meyer
ede4af47ae unix(4): Enhance LOCAL_CREDS_PERSISTENT ABI
As this ABI is still fresh (r367287), let's correct some mistakes now:

- Version the structure to allow for future changes
- Include sender's pid in control message structure
- Use a distinct control message type from the cmsgcred / sockcred mess

Discussed with:	kib, markj, trasz
Differential Revision:	https://reviews.freebsd.org/D27084
2020-11-17 20:01:21 +00:00
Conrad Meyer
de774e422e linux(4): Implement name_to_handle_at(), open_by_handle_at()
They are similar to our getfhat(2) and fhopen(2) syscalls.

Differential Revision:	https://reviews.freebsd.org/D27111
2020-11-17 19:51:47 +00:00
Kyle Evans
bd4bcd14e3 Fix !COMPAT_FREEBSD32 kernel build
One of the last shifts inadvertently moved these static assertions out of a
COMPAT_FREEBSD32 block, which the relevant definitions are limited to.

Fix it.

Pointy hat:	kevans
2020-11-17 04:22:10 +00:00
Kyle Evans
63ecb272a0 umtx_op: reduce redundancy required for compat32
All of the compat32 variants are substantially the same, save for
copyin/copyout (mostly). Apply the same kind of technique used with kevent
here by having the syscall routines supply a umtx_copyops describing the
operations needed.

umtx_copyops carries the bare minimum needed- size of timespec and
_umtx_time are used for determining if copyout is needed in the sem2_wait
case.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27222
2020-11-17 03:36:58 +00:00
Kyle Evans
4be0a1b587 _umtx_op: fix a compat32 bug in UMTX_OP_NWAKE_PRIVATE
Specifically, if we're waking up some value n > BATCH_SIZE, then the
copyin(9) is wrong on the second iteration due to upp being the wrong type.
upp is currently a uint32_t**, so upp + pos advances it by twice as many
elements as it should (host pointer size vs. compat32 pointer size).

Fix it by just making upp a uint32_t*; it's still technically a double
pointer, but the distinction doesn't matter all that much here since we're
just doing arithmetic on it.

Add a test case that demonstrates the problem, placed with the libthr tests
since one messing with _umtx_op should be running these tests. Running under
compat32, the new test case will hang as threads after the first 128 get
missed in the wake. it's not immediately clear how to hit it in practice,
since pthread_cond_broadcast() uses a smaller (sleepq batch?) size observed
to be around ~50 -- I did not spend much time digging into it.

The uintptr_t change makes no functional difference, but i've tossed it in
since it's more accurate (semantically).

Reported by:	Andrew Gierth (andrew_tao173.riddles.org.uk, inspection)
Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27231
2020-11-17 03:34:01 +00:00
Konstantin Belousov
cb596eea82 vmem: trivial warning and style fixes.
Add __unused to some args.
Change type of the iterator variables to match loop control.
Remove excessive {}.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27220
2020-11-17 02:18:34 +00:00
Mateusz Guzik
1a7bb89629 cpuset: refcount-clean 2020-11-17 00:04:05 +00:00
Mateusz Guzik
89deca0a33 malloc: make malloc_large closer to standalone
This moves entire large alloc handling out of all consumers, apart from
deciding to go there.

This is a step towards creating a fast path.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27198
2020-11-16 17:56:58 +00:00
Mateusz Guzik
19d3e47dca select: call seltdfini on process and thread exit
Since thread_zone is marked NOFREE the thread_fini callback is never
executed, meaning memory allocated by seltdinit is never released.

Adding the call to thread_dtor is not sufficient as exiting processes
cache the main thread.
2020-11-16 03:12:21 +00:00
Mateusz Guzik
31b2ac4b5a select: replace reference counting with memory barriers in selfd
Refcounting was added to combat a race between selfdfree and doselwakup,
but it adds avoidable overhead.

selfdfree detects it can free the object by ->sf_si == NULL, thus we can
ensure that the condition only holds after all accesses are completed.
2020-11-16 03:09:18 +00:00
Mateusz Guzik
b77594bbbf sched: fix an incorrect comparison in sched_lend_user_prio_cond
Compare with sched_lend_user_prio.
2020-11-15 01:54:44 +00:00
Mateusz Guzik
f34a2f56c3 thread: batch credential freeing 2020-11-14 19:22:02 +00:00
Mateusz Guzik
fb8ab68084 thread: batch resource limit free calls 2020-11-14 19:21:46 +00:00
Mateusz Guzik
5ef7b7a0f3 thread: rework tid batch to use helpers 2020-11-14 19:20:58 +00:00
Mateusz Guzik
d1ca25be49 thread: pad tid lock
On a kernel with other changes this bumps 104-way thread creation/destruction
from 0.96 mln ops/s to 1.1 mln ops/s.
2020-11-14 19:19:27 +00:00
Mateusz Guzik
9b9bb9ffa5 malloc: retire MALLOC_PROFILE
The global array has prohibitive performance impact on multicore systems.

The same data (and more) can be obtained with dtrace.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27199
2020-11-13 19:22:53 +00:00
Konstantin Belousov
441eb16a95 Allow some VOPs to return ERELOOKUP to indicate VFS operation restart at top level.
Restart syscalls and some sync operations when filesystem indicated
ERELOOKUP condition, mostly for VOPs operating on metdata.  In
particular, lookup results cached in the inode/v_data is no longer
valid and needs recalculating.  Right now this should be nop.

Assert that ERELOOKUP is catched everywhere and not returned to
userspace, by asserting that td_errno != ERELOOKUP on syscall return
path.

In collaboration with:	pho
Reviewed by:	mckusick (previous version), markj
Tested by:	markj (syzkaller), pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26136
2020-11-13 09:42:32 +00:00
Konstantin Belousov
7cde2ec4fd Implement vn_lock_pair().
In collaboration with:	pho
Reviewed by:	mckusick (previous version), markj (previous version)
Tested by:	markj (syzkaller), pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26136
2020-11-13 09:31:57 +00:00
Mateusz Guzik
9aa6d792b5 malloc: retire malloc_last_fail
The routine does not serve any practical purpose.

Memory can be allocated in many other ways and most consumers pass the
M_WAITOK flag, making malloc not fail in the first place.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27143
2020-11-12 20:22:58 +00:00
Mateusz Guzik
62dbc992ad thread: move nthread management out of tid_alloc
While this adds more work single-threaded, it also enables SMP-related
speed ups.
2020-11-12 00:29:23 +00:00
Kyle Evans
38033780a3 umtx: drop incorrect timespec32 definition
This works for amd64, but none others -- drop it, because we already have a
proper definition in sys/compat/freebsd32/freebsd32.h that correctly uses
time32_t.

MFC after:	1 week
2020-11-11 22:35:23 +00:00
Mateusz Guzik
755341df4f thread: batch tid_free calls in thread_reap
This eliminates the highly pessimal pattern of relocking from multiple
CPUs in quick succession. Note this is still globally serialized.
2020-11-11 18:45:06 +00:00
Mateusz Guzik
c5315f5196 thread: lockless zombie list manipulation
This gets rid of the most contended spinlock seen when creating/destroying
threads in a loop. (modulo kstack)

Tested by:	alfredo (ppc64), bdragon (ppc64)
2020-11-11 18:43:51 +00:00
Mark Johnston
f52979098d Fix a pair of races in SIGIO registration
First, funsetownlst() list looks at the first element of the list to see
whether it's processing a process or a process group list.  Then it
acquires the global sigio lock and processes the list.  However, nothing
prevents the first sigio tracker from being freed by a concurrent
funsetown() before the sigio lock is acquired.

Fix this by acquiring the global sigio lock immediately after checking
whether the list is empty.  Callers of funsetownlst() ensure that new
sigio trackers cannot be added concurrently.

Second, fsetown() uses funsetown() to remove an existing sigio structure
from a file object.  However, funsetown() uses a racy check to avoid the
sigio lock, so two threads may call fsetown() on the same file object,
both observe that no sigio tracker is present, and enqueue two sigio
trackers for the same file object.  However, if the file object is
destroyed, funsetown() will only remove one sigio tracker, and
funsetownlst() may later trigger a use-after-free when it clears the
file object reference for each entry in the list.

Fix this by introducing funsetown_locked(), which avoids the racy check.

Reviewed by:	kib
Reported by:	pho
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27157
2020-11-11 13:44:27 +00:00
Mateusz Guzik
26007fe37c thread: add more fine-grained tidhash locking
Note this still does not scale but is enough to move it out of the way
for the foreseable future.

In particular a trivial benchmark spawning/killing threads stops contesting
on tidhash.
2020-11-11 08:51:04 +00:00
Mateusz Guzik
aae3547be3 thread: rework tidhash vs proc lock interaction
Apart from minor clean up this gets rid of proc unlock/lock cycle on thread
exit to work around LOR against tidhash lock.
2020-11-11 08:50:04 +00:00
Mateusz Guzik
cf31cadeb6 thread: fix thread0 tid allocation
Startup code hardcodes the value instead of allocating it.
The first spawned thread would then be a duplicate.

Pointy hat:	mjg
2020-11-11 08:48:43 +00:00
Mateusz Guzik
40aad3e477 thread: tidy up r367543
"locked" variable is spurious in the committed version.
2020-11-10 21:29:10 +00:00
Mateusz Guzik
5c5ca843b7 Allow rtprio_thread to operate on threads of any process
This in particular unbreaks rtkit.

The limitation was a leftover of previous state, to quote a
comment:

/*
 * Though lwpid is unique, only current process is supported
 * since there is no efficient way to look up a LWP yet.
 */

Long since then a global tid hash was introduced to remedy
the problem.

Permission checks still apply.

Submitted by:	greg_unrelenting.technology (Greg V)
Differential Revision:	https://reviews.freebsd.org/D27158
2020-11-10 18:10:50 +00:00
Mateusz Guzik
5c100123a3 thread: retire thread_find
tdfind should be used instead.
2020-11-10 01:57:48 +00:00
Mateusz Guzik
f837888a3e thread: use tdfind in sysctl_kern_proc_kstack
This treads linear scans for locked lookup, but more importantly removes
the only consumer of thread_find.
2020-11-10 01:57:19 +00:00
Mateusz Guzik
94275e3e69 threads: remove the unused TID_BUFFER_SIZE macro 2020-11-10 01:31:06 +00:00
Mateusz Guzik
934e7e5ec9 thread: adds newer bits for r367537
The committed patch was an older version.
2020-11-10 01:13:58 +00:00
Mateusz Guzik
35bb59edc5 threads: reimplement tid allocation on top of a bitmap
There are workloads with very bursty tid allocation and since unr tries very
hard to have small-sized bitmaps it keeps reallocating memory. Just doing
buildkernel gives almost 150k calls to free coming from unr.

This also gets rid of the hack which tried to postpone TID reuse.

Reviewed by:	kib, markj
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D27101
2020-11-09 23:05:28 +00:00
Mateusz Guzik
1bd3cf5de5 threads: introduce a limit for total number
The intent is to replace the current id allocation method and a known upper
bound will be useful.

Reviewed by:	kib (previous version), markj (previous version)
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D27100
2020-11-09 23:04:30 +00:00
Mateusz Guzik
f6dd1aefb7 vfs: group mount per-cpu vars into one struct
While here move frequently read stuff into the same cacheline.

This shrinks struct mount by 64 bytes.

Tested by:	pho
2020-11-09 23:02:13 +00:00
Mateusz Guzik
f0c90a0931 malloc: provide 384 byte zone
Total page count after buildworld on ZFS for 384 (if present) and 512 zones:
before: 29713
after: 25946

per-zone page use:
vm.uma.malloc_384.keg.domain.1.pages: 11621
vm.uma.malloc_384.keg.domain.0.pages: 11597
vm.uma.malloc_512.keg.domain.1.pages: 1280
vm.uma.malloc_512.keg.domain.0.pages: 1448

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27145
2020-11-09 22:59:41 +00:00
Mateusz Guzik
8e6526e966 malloc: retire mt_stats_zone in favor of pcpu_zone_64
Reviewed by:	markj, imp
Differential Revision:	https://reviews.freebsd.org/D27142
2020-11-09 22:58:29 +00:00
Mateusz Guzik
3a440a421d Add more per-cpu zones.
This covers powers of 2 up to 64.

Example pending user is ZFS.
2020-11-09 00:34:23 +00:00
Mateusz Guzik
523d66730c procdesc: convert the zone to a malloc type
The object is 128 bytes in size.
2020-11-09 00:05:21 +00:00
Mateusz Guzik
e90afaa015 kqueue: save space by using only one func pointer for assertions 2020-11-09 00:04:35 +00:00
Edward Tomasz Napierala
a1bd83fede Move syscall_thread_{enter,exit}() into the slow path. This is only
needed for syscalls from unloadable modules.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D26988
2020-11-08 15:54:59 +00:00
Kyle Evans
8c28aa5e45 imgact_binmisc: limit the extent of match on incoming entries
imgact_binmisc matches magic/mask from imgp->image_header, which is only a
single page in size mapped from the first page of an image. One can specify
an interpreter that matches on, e.g., --offset 4096 --size 256 to read up to
256 bytes past the mapped first page.

The limitation is that we cannot specify a magic string that exceeds a
single page, and we can't allow offset + size to exceed a single page
either.  A static assert has been added in case someone finds it useful to
try and expand the size, but it does seem a little unlikely.

While this looks kind of exploitable at a sideways squinty-glance, there are
a couple of mitigating factors:

1.) imgact_binmisc is not enabled by default,
2.) entries may only be added by the superuser,
3.) trying to exploit this information to read what's mapped past the end
  would be worse than a root canal or some other relatably painful
  experience, and
4.) there's no way one could pull this off without it being completely
  obvious.

The first page is mapped out of an sf_buf, the implementation of which (or
lack thereof) depends on your platform.

MFC after:	1 week
2020-11-08 04:24:29 +00:00
Michael Tuexen
f908d8247e The ioctl() calls using FIONREAD, FIONWRITE, FIONSPACE, and SIOCATMARK
access the socket send or receive buffer. This is not possible for
listening sockets since r319722.
Because send()/recv() calls fail on listening sockets, fail also ioctl()
indicating EINVAL.

PR:			250366
Reported by:		Yong-Hao Zou
Reviewed by:		glebius, rscheff
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D26897
2020-11-07 21:17:49 +00:00
Kyle Evans
1024ef27fe imgact_binmisc: move some calculations out of the exec path
The offset we need to account for in the interpreter string comes in two
variants:

1. Fixed - macros other than #a that will not vary from invocation to
   invocation
2. Variable - #a, which is substitued with the argv0 that we're replacing

Note that we don't have a mechanism to modify an existing entry.  By
recording both of these offset requirements when the interpreter is added,
we can avoid some unnecessary calculations in the exec path.

Most importantly, we can know up-front whether we need to grab
calculate/grab the the filename for this interpreter. We also get to avoid
walking the string a first time looking for macros. For most invocations,
it's a swift exit as they won't have any, but there's no point entering a
loop and searching for the macro indicator if we already know there will not
be one.

While we're here, go ahead and only calculate the argv0 name length once per
invocation. While it's unlikely that we'll have more than one #a, there's no
reason to recalculate it every time we encounter an #a when it will not
change.

I have not bothered trying to benchmark this at all, because it's arguably a
minor and straightforward/obvious improvement.

MFC after:	1 week
2020-11-07 18:07:55 +00:00
Mateusz Guzik
42e7abd5db rms: several cleanups + debug read lockers handling
This adds a dedicated counter updated with atomics when INVARIANTS
is used. As a side effect one can reliably determine the lock is held
for reading by at least one thread, but it's still not possible to
find out whether curthread has the lock in said mode.

This should be good enough in practice.

Problem spotted by avg.
2020-11-07 16:57:53 +00:00
Kyle Evans
ecb4fdf943 imgact_binmisc: reorder members of struct imgact_binmisc_entry (NFC)
This doesn't change anything at the moment since the out-of-order elements
were a pair of uint32_t, but future additions may have caused unnecessary
padding by following the existing precedent.

MFC after:	1 week
2020-11-07 16:41:59 +00:00
Michal Meloun
eb20867f52 Add a method to determine whether given interrupt is per CPU or not.
MFC after:	2 weeks
2020-11-07 14:58:01 +00:00
Edward Tomasz Napierala
da45ea6bc6 Move TDB_USERWR check under 'if (traced)'.
If we hadn't been traced in the first place when syscallenter()
started executing, we can ignore TDB_USERWR.  TDB_USERWR can get set,
sure, but if it does, it's because the debugger raced with the syscall,
and it cannot depend on winning that race.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D26585
2020-11-07 13:09:51 +00:00
Kyle Evans
2192cd125f imgact_binmisc: abstract away the list lock (NFC)
This module handles relatively few execs (initial qemu-user-static, then
qemu-user-static handles exec'ing itself for binaries it's already running),
but all execs pay the price of at least taking the relatively expensive
sx/slock to check for a match when this module is loaded. Future work will
almost certainly swap this out for another lock, perhaps an rmslock.

The RLOCK/WLOCK phrasing was chosen based on what the callers are really
wanting, rather than using the verbiage typically appropriate for an sx.

MFC after:	1 week
2020-11-07 05:10:46 +00:00
Kyle Evans
7d3ed9777a imgact_binmisc: validate flags coming from userland
We may want to reserve bits in the future for kernel-only use, so start
rejecting any that aren't the two that we're currently expecting from
userland.

MFC after:	1 week
2020-11-07 04:10:23 +00:00
Kyle Evans
7667824ade epoch: support non-preemptible epochs checking in_epoch()
Previously, non-preemptible epochs could not check; in_epoch() would always
fail, usually because non-preemptible epochs don't imply THREAD_NO_SLEEPING.

For default epochs, it's easy enough to verify that we're in the given
epoch: if we're in a critical section and our record for the given epoch
is active, then we're in it.

This patch also adds some additional INVARIANTS bookkeeping. Notably, we set
and check the recorded thread in epoch_enter/epoch_exit to try and catch
some edge-cases for the caller. It also checks upon freeing that none of the
records had a thread in the epoch, which may make it a little easier to
diagnose some improper use if epoch_free() took place while some other
thread was inside.

This version differs slightly from what was just previously reviewed by the
below-listed, in that in_epoch() will assert that no CPU has this thread
recorded even if it *is* currently in a critical section. This is intended
to catch cases where the caller might have somehow messed up critical
section nesting, we can catch both if they exited the critical section or if
they exited, migrated, then re-entered (on the wrong CPU).

Reviewed by:	kib, markj (both previous version)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27098
2020-11-07 03:29:04 +00:00
Kyle Evans
80083216cb imgact_binmisc: minor re-organization of imgact_binmisc_exec exits
Notably, streamline error paths through the existing 'done' label, making it
easier to quickly verify correct cleanup.

Future work might add a kernel-only flag to indicate that a interpreter uses
#a. Currently, all executions via imgact_binmisc pay the penalty of
constructing sname/fname, even if they will not use it. qemu-user-static
doesn't need it, the stock rc script for qemu-user-static certainly doesn't
use it, and I suspect these are the vast majority of (if not the only)
current users.

MFC after:	1 week
2020-11-07 03:28:32 +00:00
Mateusz Guzik
e25d8b67c3 malloc: tweak the version check in r367432 to include type name
While here fix a whitespace problem.
2020-11-07 01:32:16 +00:00
Mateusz Guzik
bdcc222644 malloc: move malloc_type_internal into malloc_type
According to code comments the original motivation was to allow for
malloc_type_internal changes without ABI breakage. This can be trivially
accomplished by providing spare fields and versioning the struct, as
implemented in the patch below.

The upshots are one less memory indirection on each alloc and disappearance
of mt_zone.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27104
2020-11-06 21:33:59 +00:00
Konstantin Belousov
f10845877e Suspend all writeable local filesystems on power suspend.
This ensures that no writes are pending in memory, either metadata or
user data, but not including dirty pages not yet converted to fs writes.

Only filesystems declared local are suspended.

Note that this does not guarantee absence of the metadata errors or
leaks if resume is not done: for instance, on UFS unlinked but opened
inodes are leaked and require fsck to gc.

Reviewed by:	markj
Discussed with:	imp
Tested by:	imp (previous version), pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D27054
2020-11-05 20:52:49 +00:00
Mateusz Guzik
16b971ed6d malloc: add a helper returning size allocated for given request
Sample usage: kernel modules can decide whether to stick to malloc or
create their own zone.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27097
2020-11-05 16:21:21 +00:00
Mateusz Guzik
2dee296a3d Rationalize per-cpu zones.
The 2 provided zones had inconsistent naming between each other
("int" and "64") and other allocator zones (which use bytes).

Follow malloc by naming them "pcpu-" + size in bytes.

This is a step towards replacing ad-hoc per-cpu zones with
general slabs.
2020-11-05 15:08:56 +00:00
Mateusz Guzik
ea33cca971 poll/select: change selfd_zone into a malloc type
On a sample box vmstat -z shows:

ITEM                   SIZE  LIMIT     USED     FREE      REQ
64:                      64,      0, 1043784, 4367538,3698187229
selfd:                   64,      0,    1520,   13726,182729008

But at the same time:
vm.uma.selfd.keg.domain.1.pages: 121
vm.uma.selfd.keg.domain.0.pages: 121

Thus 242 pages got pulled even though the malloc zone would likely accomodate
the load without using extra memory.
2020-11-05 12:24:37 +00:00
Mateusz Guzik
2fbb45c601 vfs: change nt_zone into a malloc type
Elements are small in size and allocated for short periods.
2020-11-05 12:06:50 +00:00
Kyle Evans
df69035d7f imgact_binmisc: fix up some minor nits
- Removed a bunch of redundant headers
- Don't explicitly initialize to 0
- The !error check prior to setting imgp->interpreter_name is redundant, all
  error paths should and do return or go to 'done'. We have larger problems
  otherwise.
2020-11-05 04:19:48 +00:00
Mateusz Guzik
3c50616fc1 fd: make all f_count uses go through refcount_* 2020-11-05 02:12:33 +00:00
Mateusz Guzik
d737e9eaf5 fd: hide _fdrop 0 count check behind INVARIANTS
While here use refcount_load and make sure to report the tested value.
2020-11-05 02:12:08 +00:00
Mateusz Guzik
331c21dd5e pipe: whitespace nit in previous 2020-11-04 23:17:41 +00:00
Mateusz Guzik
c22ba7bb06 pipe: fix POLLHUP handling if no events were specified
Linux allows polling without any events specified and it happens to be the case
in FreeBSD as well. POLLHUP has to be delivered regardless of the event mask
and this works fine if the condition is already present. However, if it is
missing, selrecord is only called if the eventmask has relevant bits set. This
in particular leads to a conditon where pipe_poll can return 0 events and
neglect to selrecord, while kern_poll takes it as an indication it has to go to
sleep, but then there is nobody to wake it up.

While the problem seems systemic to *_poll handlers the least we can do is fix
it up for pipes.

Reported by:	Jeremie Galarneau <jeremie.galarneau at efficios.com>
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D27094
2020-11-04 23:11:54 +00:00
Mateusz Guzik
6fc2b069ca rms: fixup concurrent writer handling and add more features
Previously the code had one wait channel for all pending writers.
This could result in a buggy scenario where after a writer switches
the lock mode form readers to writers goes off CPU, another writer
queues itself and then the last reader wakes up the latter instead
of the former.

Use a separate channel.

While here add features to reliably detect whether curthread has
the lock write-owned. This will be used by ZFS.
2020-11-04 21:18:08 +00:00
Mark Johnston
f7db0c9532 vmspace: Convert to refcount(9)
This is mostly mechanical except for vmspace_exit().  There, use the new
refcount_release_if_last() to avoid switching to vmspace0 unless other
processes are sharing the vmspace.  In that case, upon switching to
vmspace0 we can unconditionally release the reference.

Remove the volatile qualifier from vm_refcnt now that accesses are
protected using refcount(9) KPIs.

Reviewed by:	alc, kib, mmel
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27057
2020-11-04 16:30:56 +00:00
Brooks Davis
19647e76fc sysvshm: pass relevant uap members as arguments
Alter shmget_allocate_segment and shmget_existing to take the values
they want from struct shmget_args rather than passing the struct
around.  In general, uap structures should only be the interface to
sys_<foo> functions.

This makes on small functional change and records the allocated space
rather than the requested space.  If this turns out to be a problem (e.g.
if software tries to find undersized segments by exact size rather than
using keys), we can correct that easily.

Reviewed by:	kib
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D27077
2020-11-03 19:14:03 +00:00
Conrad Meyer
2de07e4096 unix(4): Add SOL_LOCAL:LOCAL_CREDS_PERSISTENT
This option is intended to be semantically identical to Linux's
SOL_SOCKET:SO_PASSCRED.  For now, it is mutually exclusive with the
pre-existing sockopt SOL_LOCAL:LOCAL_CREDS.

Reviewed by:	markj (penultimate version)
Differential Revision:	https://reviews.freebsd.org/D27011
2020-11-03 01:17:45 +00:00
Mateusz Guzik
e1b6a7f83f malloc: prefix zones with malloc-
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27038
2020-11-02 17:39:15 +00:00
Mateusz Guzik
828afdda17 malloc: export kernel zones instead of relying on them being power-of-2
Reviewed by:	markj (previous version)
Differential Revision:	https://reviews.freebsd.org/D27026
2020-11-02 17:38:08 +00:00
Stefan Eßer
1ebef47735 Make sysctl user.local a tunable that can be written at run-time
This sysctl value had been provided as a read-only variable that is
compiled into the C library based on the value of _PATH_LOCALBASE in
paths.h.

After this change, the value is compiled into the kernel as an empty
string, which is translated to _PATH_LOCALBASE by the C library.

This empty string can be overridden at boot time or by a privileged
user at run time and will then be returned by sysctl.

When set to an empty string, the value returned by sysctl reverts to
_PATH_LOCALBASE.

This update does not change the behavior on any system that does
not modify the default value of user.localbase.

I consider this change as experimental and would prefer if the run-time
write permission was reconsidered and the sysctl variable defined with
CLFLAG_RDTUN instead to restrict it to be set at boot time.

MFC after:	1 month
2020-10-31 23:48:41 +00:00
Mateusz Guzik
82c174a3b4 malloc: delegate M_EXEC handling to dedicacted routines
It is almost never needed and adds an avoidable branch.

While here do minior clean ups in preparation for larger changes.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27019
2020-10-30 20:02:32 +00:00
Stefan Eßer
147eea393f Add read only sysctl variable user.localbase
The value is provided by the C library as for other sysctl variables in
the user tree. It is compiled in and returns the value of _PATH_LOCALBASE
defined in paths.h.

Reviewed by:	imp, scottl
Differential Revision:	https://reviews.freebsd.org/D27009
2020-10-30 18:48:09 +00:00
Mateusz Guzik
0685574968 vfs: change vnode poll to just a malloc type
The size is 120, close fit for 128 and rarely used. The infrequent use
avoidably populates per-CPU caches and ends up with more memory.
2020-10-30 14:02:56 +00:00
Mateusz Guzik
4bfebc8d2c cache: add cache_vop_mkdir and rename cache_rename to cache_vop_rename 2020-10-30 10:46:35 +00:00
John Baldwin
36e0a362ac Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc().
This gives a more uniform API for send tag life cycle management.

Reviewed by:	gallatin, hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27000
2020-10-29 23:28:39 +00:00
Mateusz Guzik
62568e886a vfs: add NAMEI_DBG_HADSTARTDIR handling lost in rewrite
Noted by:	rpokala
2020-10-29 18:43:37 +00:00
Mateusz Guzik
eebc2e450f vfs: add NDREINIT to facilitate repeated namei calls
struct nameidata mixes caller arguments, internal state and output, which
can be quite error prone.

Recent addition of valdiating ni_resflags uncovered a caller which could
repeatedly call namei, effectively operating on partially populated state.

Add bare minimium validation this does not happen. The real fix would
decouple aforementioned state.

Reported by:	pho
Tested by:	pho (different variant)
2020-10-29 12:56:02 +00:00
John Baldwin
521eac97f3 Support hardware rate limiting (pacing) with TLS offload.
- Add a new send tag type for a send tag that supports both rate
  limiting (packet pacing) and TLS offload (mostly similar to D22669
  but adds a separate structure when allocating the new tag type).

- When allocating a send tag for TLS offload, check to see if the
  connection already has a pacing rate.  If so, allocate a tag that
  supports both rate limiting and TLS offload rather than a plain TLS
  offload tag.

- When setting an initial rate on an existing ifnet KTLS connection,
  set the rate in the TCP control block inp and then reset the TLS
  send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit
  send tag.  This allocates the TLS send tag asynchronously from a
  task queue, so the TLS rate limit tag alloc is always sleepable.

- When modifying a rate on a connection using KTLS, look for a TLS
  send tag.  If the send tag is only a plain TLS send tag, assume we
  failed to allocate a TLS ratelimit tag (either during the
  TCP_TXTLS_ENABLE socket option, or during the send tag reset
  triggered by ktls_output_eagain) and ignore the new rate.  If the
  send tag is a ratelimit TLS send tag, change the rate on the TLS tag
  and leave the inp tag alone.

- Lock the inp lock when setting sb_tls_info for a socket send buffer
  so that the routines in tcp_ratelimit can safely dereference the
  pointer without needing to grab the socket buffer lock.

- Add an IFCAP_TXTLS_RTLMT capability flag and associated
  administrative controls in ifconfig(8).  TLS rate limit tags are
  only allocated if this capability is enabled.  Note that TLS offload
  (whether unlimited or rate limited) always requires IFCAP_TXTLS[46].

Reviewed by:	gallatin, hselasky
Relnotes:	yes
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26691
2020-10-29 00:23:16 +00:00
Konstantin Belousov
3cbf9dc81c Check for process group change in tty_wait_background().
The calling process's process group can change between PROC_UNLOCK(p)
and PGRP_LOCK(pg) in tty_wait_background(), e.g. by a setpgid() call
from another process. If that happens, the signal is not sent to the
calling process, even if the prior checks determine that one should be
sent.  Re-check that the process group hasn't changed after acquiring
the pgrp lock, and if it has, redo the checks.

PR:	250701
Submitted by:	Jakub Piecuch <j.piecuch96@gmail.com>
MFC after:	2 weeks
2020-10-28 22:12:47 +00:00
Edward Tomasz Napierala
bdc0cb4e2c Add local variable to store the sysent pointer. Just a cleanup,
no functional changes.

Reviewed by:	kib (earlier version)
MFC after:	2 weeks
Sponsored by:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D26977
2020-10-28 14:43:38 +00:00
Edward Tomasz Napierala
bce7ee9d41 Drop "All rights reserved" from all my stuff. This includes
Foundation copyrights, approved by emaste@.  It does not include
files which carry other people's copyrights; if you're one
of those people, feel free to make similar change.

Reviewed by:	emaste, imp, gbe (manpages)
Differential Revision:	https://reviews.freebsd.org/D26980
2020-10-28 13:46:11 +00:00
Mateusz Guzik
11743b6e47 vfs: tidy up vnlru_free
Apart from cosmeatic changes make sure to only decrease the recycled counter
if vtryrecycle succeeded.

Tested by:	pho
2020-10-27 18:13:09 +00:00
Mateusz Guzik
68ac2b804c vfs: fix vnode reclaim races against getnwevnode
All vnodes allocated by UMA are present on the global list used by
vnlru. getnewvnode modifies the state of the vnode (most notably
altering v_holdcnt) but never locks it. Moreover filesystems also
modify it in arbitrary manners sometimes before taking the vnode
lock or adding any other indicator that the vnode can be used.

Picking up such a vnode by vnlru would be problematic.

To that end there are 2 fixes:
- vlrureclaim, not recycling v_holdcnt == 0 vnodes, takes the
interlock and verifies that v_mount has been set. It is an
invariant that the vnode lock is held by that point, providing
the necessary serialisation against locking after vhold.
- vnlru_free_locked, only wanting to free v_holdcnt == 0 vnodes,
now makes sure to only transition the count 0->1 and newly allocated
vnodes start with v_holdcnt == VHOLD_NO_SMR. getnewvnode will only
transition VHOLD_NO_SMR->1 once more making the hold fail

Tested by:	pho
2020-10-27 18:12:07 +00:00
Mateusz Guzik
d681c51d36 cache: add missing NIRES_ABS handling 2020-10-26 18:01:18 +00:00
Alexander Motin
3c0177b887 Enable bioq 'car limit' added at r335066 at 128 bios.
Without the 'car limit' enabled (before this), running sequential ZFS scrub
on HDD without command queuing support, I've measured latency on concurrent
random reads reaching 4 seconds (surprised that not more).  Enabling this
reduced the latency to 65 milliseconds, while scrub still doing ~180MB/s.

For disks with command queuing this does not make much difference (if any),
since most time all the requests are queued down to the disk or HBA, leaving
nothing in the queue to sort.  And even if something does not fit, staying on
the queue, it is likely not for long.  To not limit sorting in such bursty
scenarios I've added batched counter zeroing when the queue is getting empty.

The internal scheduler of the SAS HDD I was testing seems to be even more
loyal to random I/O, reducing the scrub speed to ~120MB/s.  So in case
somebody worried this is limit is too strict -- it actually looks relaxed.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-10-26 04:04:06 +00:00
Alexander Motin
8b220f8915 Fix asymmetry in devstat(9) calls by GEOM.
Before this GEOM passed bio pointer to transaction start, but not end.
It was irrelevant until devstat(9) got DTrace hooks, that appeared to
provide bio pointer on I/O completion, but not on submission.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-10-24 21:07:10 +00:00
Ruslan Bukin
f32f0095e9 o Add iommu de-initialization method for MSI interface.
o Add iommu_unmap_msi() to release the msi GAS entry.
o Provide default implementations for iommu init/deinit methods.

Reviewed by:	kib
Sponsored by:	Innovate DSbD
Differential Revision:	https://reviews.freebsd.org/D26906
2020-10-24 20:09:27 +00:00
Ryan Moeller
e58483c4fb sysctl+kern_sysctl: Honor SKIP for descendant nodes
Ensure we also skip descendants of SKIP nodes when iterating through children
of an explicitly specified node.

Reported by:	np
Reviewed by:	np
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D26833
2020-10-24 16:17:07 +00:00
Ryan Moeller
0595c12484 kern_sysctl: Misc code cleanup
Remove unused oidpp parameter from sysctl_sysctl_next_ls and
add high level comments to describe how it works.

No functional change.

Reviewed by:	imp
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D26854
2020-10-24 14:46:38 +00:00
Kyle Evans
275c821d3d audit: correct reporting of *execve(2) success
r326145 corrected do_execve() to return EJUSTRETURN upon success so that
important registers are not clobbered. This had the side effect of tapping
out 'failures' for all *execve(2) audit records, which is less than useful
for auditing purposes.

Audit exec returns earlier, where we can know for sure that EJUSTRETURN
translates to success. Note that this unsets TDP_AUDITREC as we commit the
audit record, so the usual audit in the syscall return path will do nothing.

PR:		249179
Reported by:	Eirik Oeverby <ltning-freebsd anduin net>
Reviewed by:	csjp, kib
MFC after:	1 week
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26922
2020-10-24 14:39:17 +00:00
Mateusz Guzik
eb65cde4f5 cache: assorted typo fixes 2020-10-24 13:31:40 +00:00
Mateusz Guzik
029cfccc71 cache: add the missing NC_NOMAKEENTRY and NC_KEEPPOSENTRY to lockless lookup
They are de facto ignored.
2020-10-24 13:31:25 +00:00
Mateusz Guzik
7cc1718613 vfs: fix a race where reclaim vholds freed vnodes
Reported by:	pho
Tested by:	pho (previous version)
Fixes:	r366974 ("vfs: stop taking the interlock in vnode reclaim")
2020-10-24 13:30:37 +00:00
Mateusz Guzik
acb41008f3 cache: batch updates to numcache in case of mass removal 2020-10-24 01:14:52 +00:00
Mateusz Guzik
208cb7c4b6 cache: refactor alloc/free
This in particular centralizes manipulation of numcache.
2020-10-24 01:14:17 +00:00
Mateusz Guzik
1d44405690 cache: fold branch prediction into cache_ncp_canuse 2020-10-24 01:13:47 +00:00
Mateusz Guzik
c13d7d1f98 cache: fix some typos 2020-10-24 01:13:16 +00:00
Mateusz Guzik
f878526f20 cache: drop write-only vars 2020-10-24 01:13:02 +00:00
Ruslan Bukin
9729b14985 Move the iommu stubs to a generic place, so they are available on all the
platforms.

This allows to not depend on the IOMMU macro in AHCI driver.

Requested by:	kib
Suggested by:	andrew
Reviewed by:	kib
Sponsored by:	Innovate DSbD
Differential Revision:	https://reviews.freebsd.org/D26887
2020-10-23 21:27:48 +00:00
Mateusz Guzik
3862838921 cache: reduce memory waste in struct namecache
The previous scheme for calculating the total size was doing sizeof
on the struct and then adding the wanted space for the buffer.

nc_name is at offset 58 while sizeof(struct namecache) is 64.
With CACHE_PATH_CUTOFF of 39 bytes and 1 byte of padding we were
allocating 104 bytes for the entry and never accounting for the 6
byte padding, wasting that space.
2020-10-23 15:56:22 +00:00
Mateusz Guzik
703f3fafa5 vfs: stop taking the interlock in vnode reclaim
It no longer protects any of tested fields, keeping all the checks racy.

While here make vtryrecycle drop the vnode on its own. Avoids an additional
lock trip.
2020-10-23 15:49:18 +00:00
Mateusz Guzik
c7520caa4f vfs: prevent avoidable evictions on mkdir of existing directories
mkdir -p /foo/bar/baz will mkdir each path component and ignore EEXIST.

The NOCACHE lookup will make the namecache unnecessarily evict the existing entry,
and then fallback to the fs lookup routine eventually leading namei to return an
error as the directory is already there.

For invocations like mkdir -p /usr/obj/usr/src/sys/GENERIC/modules this triggers
fallbacks to the slowpath for concurrently executing lookups.

Tested by:	pho
Discussed with:	kib
2020-10-22 19:28:12 +00:00
Mateusz Guzik
54f09403a3 cache: assert the created entry does not point to itself 2020-10-22 19:22:34 +00:00
Konstantin Belousov
18b8496c23 sysv_sem: semusz depends on semume.
Size of the per-process semaphore undo structure (semusz) depends on
the number of the per-process undos.  If kern.ipc.semume is adjusted,
semusz must be adjusted as well, and it makes no sense to delegate
adjustment to user.  Make it automatic.

Reported and tested by:	Olef <o.vandestadt@gmail.com>
PR:	250361
Reviewed by:	jhb, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26826
2020-10-22 09:28:11 +00:00
Hans Petter Selasky
2ae634c6db Implement mbuf hashing routines for IP over infiniband, IPoIB.
No functional change intended.

Differential Revision:	https://reviews.freebsd.org/D26254
Reviewed by:		melifaro@
MFC after:		1 week
Sponsored by:		Mellanox Technologies // NVIDIA Networking
2020-10-22 09:17:56 +00:00
Brooks Davis
44ca4575ea vmapbuf: don't smuggle address or length in buf
Instead, add arguments to vmapbuf.  Since this argument is
always a pointer use a type of void * and cast to vm_offset_t in
vmapbuf.  (In CheriBSD we've altered vm_fault_quick_hold_pages to
take a pointer and check its bounds.)

In no other situtation does b_data contain a user pointer and vmapbuf
replaces b_data with the actual mapping.

Suggested by:	jhb
Reviewed by:	imp, jhb
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26784
2020-10-21 16:00:15 +00:00
Mateusz Guzik
2f1c35053c cache: drop the spurious slash_prefixed argument 2020-10-21 05:57:25 +00:00
Mateusz Guzik
ab21ed17ed vfs: drop the de facto curthread argument from VOP_INACTIVE 2020-10-20 07:19:03 +00:00
Mateusz Guzik
8ecd87a3e7 vfs: drop spurious cred argument from VOP_VPTOCNP 2020-10-20 07:18:27 +00:00
Konstantin Belousov
c0baa3dc4a vgonel(): avoid recursing into VOP_INACTIVE().
It is a common pattern for filesystems' VOP_INACTIVE() implementation
to forcibly reclaim the vnode when its state is final.  For instance,
UFS vnode with zero link count is removed, and since it is
inactivated, the last open reference on it is dropped.

On the other hand, vnode might get spurious usecount reference for
many reasons.  If the spurious reference exists while vgonel() checks
for active state of the vnode, it would recurse into VOP_INACTIVE().

Fix it by checking and not doing inactivation when vgone() was called
from inactive VOP.

Reported and tested by:	pho
Discussed with:	mjg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-10-19 19:20:23 +00:00
Mateusz Guzik
6d5d469fc1 cache: promote negative entries based on more than one hit
During tinderbox and similar workloads negative entries get at least one
hit before they get evicted. In the current scheme this avoidably promotes
them.

Be conservative and stick to 2 hits for now.
2020-10-19 18:51:51 +00:00
John Baldwin
6bcf3c46d8 Check TF_TOE not the tod pointer to determine if TOE is active.
The TF_TOE flag is the check used in the rest of the network stack to
determine if TOE is active on a socket.  There is at least one path in
the cxgbe(4) TOE driver that can leave the tod pointer non-NULL on a
socket not using TOE.

Reported by:	Sony Arpita Das <sonyarpitad@chelsio.com>
Reviewed by:	np
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D26803
2020-10-19 18:24:06 +00:00
Mark Johnston
d80126a6f4 link_elf_obj: Colour VM objects
This will cause the VM to back sufficiently large .text sections, such
as those in zfs.ko or amdgpu.ko on amd64, with superpage mappings when
possible.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26802
2020-10-19 16:57:59 +00:00
Mark Johnston
6351771b7c vmem: Allocate btags before looping in vmem_xalloc()
BT_MAXALLOC (4) is the number of boundary tags required to complete an
allocation in the worst case: two to clip a free segment, and two to
import from a parent arena.  vmem_xalloc() preallocates four boundary
tags before attempting a search to simplify the segment allocation code.
It implements a loop that:
1) ensures that BT_MAXALLOC boundary tags are available,
2) attempts to find and clip a free segment satisfying the allocation
   constraints, and failing that,
3) attempts to import a segment.

On !UMA_MD_SMALL_ALLOC platforms the btag zone has to handle recusion:
it needs boundary tags to allocate boundary tags.  Thus we reserve
2 * BT_MAXALLOC * mp_ncpus tags for use when recursing: the factor of 2
is because there are two layers of vmem arenas, the per-domain arena and
global arena.  For a single thread, 2 * BT_MAXALLOC tags should be
sufficient.

Because of the way the loop is structured, BT_MAXALLOC tags are not
sufficient.  The first bt_fill() call may allocate BT_MAXALLOC tags,
then import a segment (consuming two tags), then attempt to top up the
preallocation before carving into the imported free segment, thus
requiring up to six tags in the worst case.  Because we don't
preallocate that many, this bug can cause deadlocks in rare scenarios.

Fix the problem by moving the preallocation out the loop.  This assumes
that only a single import is ever required to satisfy an allocation
request.

Thanks to manu, emaste and lwhsu for helping test debug patches.

Reported by:	Jenkins (hardware CI lab)
Reviewed by:	alc, kib, rlibby
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26770
2020-10-19 16:54:06 +00:00
Mark Johnston
33a9bce62f vmem: Simplify bt_fill() callers a bit
No functional change intended.

Reviewed by:	alc, kib, rlibby
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D26769
2020-10-19 16:52:27 +00:00
Ruslan Bukin
e707c8be4e Manage MSI iommu pages.
This allows the interrupt controller driver only need a small change to
create a map for the page the device will write to raise an interrupt.

Submitted by:	andrew
Reviewed by:	kib
Sponsored by:	Innovate DSbD
Differential Revision:	https://reviews.freebsd.org/D26705
2020-10-19 13:10:21 +00:00
Mateusz Guzik
665c8c3e7d cache: refactor negative promotion/demotion handling
This will simplify policy changes.
2020-10-19 09:52:52 +00:00
Bjoern A. Zeeb
f7a0bb0dec ddb: add show sysinit command
Add a show sysinit command to ddb (similar to show vnet_sysinit) which
proved to be helpful to debug some ordering issues on early-mid kernel
start panics.
2020-10-17 22:47:08 +00:00
Mateusz Guzik
4c4aa84848 cache: shorten names of debug stats 2020-10-17 21:30:46 +00:00
Mateusz Guzik
676557143f cache: don't automatically evict negative entries if usage is low
The previous scheme only looked at negative entry count in relation to the
total count, leading to tons of spurious evictions if the cache is not
significantly populated.

Instead, only try the above if negative entry count goes beyond namecache
capacity.
2020-10-17 21:22:40 +00:00
Mateusz Guzik
e98c3bc667 cache: erwork sysctl vfs.cache tree
Split everything into neg, debug, param and stat categories.

The legacy nchstats sysctl (queried e.g., by systat) remains untouched.

While here rename some vars to be easier on the eye.
2020-10-17 13:06:29 +00:00
Mateusz Guzik
fa7c73d30c cache: factor negative lookup out of cache_fplookup_next 2020-10-17 13:04:46 +00:00
Mateusz Guzik
41e6b18422 cache: avoid smr in cache_neg_evict in favoro of the already held bucket lock 2020-10-17 13:04:25 +00:00
Mateusz Guzik
c38d8e1eb2 cache: rework parts of negative entry management
- declutter sysctl vfs.cache by moving relevant entries into
vfs.cache.neg
- add a little more parallelism to eviction by replacing the
global lock with an atomically modified counter
- track more statistics

The code needs further effort.
2020-10-17 08:48:58 +00:00
Mateusz Guzik
b31b5e9cfd cache: remove entries before trying to add new ones, not after
Should allow positive entries to replace negative ones in case
the cache is full.
2020-10-17 08:48:32 +00:00
Mateusz Guzik
ad89066af4 vfs: annotate mountlist_mtx with __exclusive_cache_line 2020-10-17 08:47:08 +00:00
Mateusz Guzik
d6eee35004 cache: add a probe reporting addition of duplicate entries 2020-10-17 00:27:26 +00:00
Mateusz Guzik
a59b0ac3aa cache: flip inverted condition in previous
It happened to not affect correctness in that the fallback code would
simply neglect to promote the entry.
2020-10-16 02:19:33 +00:00
Mateusz Guzik
e7602e04c7 cache: support negative entry promotion in slowpath smr 2020-10-16 00:56:13 +00:00
Mateusz Guzik
571bc3d1af cache: elide vhold/vdrop around promoting negative entry 2020-10-16 00:55:57 +00:00
Mateusz Guzik
640e6162ee cache: dedup code for negative promotion 2020-10-16 00:55:31 +00:00
Mateusz Guzik
c97c8746c0 cache: neglist -> nl; negstate -> ns
No functional changes.
2020-10-16 00:55:09 +00:00
Mateusz Guzik
43777a207d cache: split hotlist between existing negative lists
This simplifies the code while allowing for concurrent negative eviction
down the road.

Cache misses increased slightly due to higher rate of evictions allowed by
the change.

The current algorithm remains too aggressive.
2020-10-15 17:44:17 +00:00
Mateusz Guzik
430dc4518d cache: make neglist an array given the static size 2020-10-15 17:42:22 +00:00
Brooks Davis
16e4a0c89c physio: Don't store user addresses in bio_data
Only assign the address from the iovec to bio_data if it is a kernel
address.  This was the single place where bio_data stored (however
briefly) a userspace pointer.

Reviewed by:	imp, markj
Obtained from:	CheriBSD
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26783
2020-10-15 17:05:21 +00:00
Mateusz Guzik
214eccf4b6 vfs: add VOP_EAGAIN
Can be used to stub fplookup for example.
2020-10-15 04:48:14 +00:00
Konstantin Belousov
6f3b523c9a Avoid dump_avail[] redefinition.
Move dump_avail[] extern declaration and inlines into a new header
vm/vm_dumpset.h.  This fixes default gcc build for mips.

Reviewed by:	alc, scottph
Tested by:	kevans (previous version)
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26741
2020-10-14 22:51:40 +00:00
John Baldwin
c2a8fd6f05 Permit sending empty fragments for TLS 1.0.
Due to a weakness in the TLS 1.0 protocol, OpenSSL will periodically
send empty TLS records ("empty fragments").  These TLS records have no
payload (and thus a page count of zero).  m_uiotombuf_nomap() was
returning NULL instead of an empty mbuf, and a few places needed to be
updated to treat an empty TLS record as having a page count of "1" as
0 means "no work to do" (e.g. nothing to encrypt, or nothing to mark
ready via sbready()).

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D26729
2020-10-13 17:30:34 +00:00
Warner Losh
e59db46854 newbus: use ssize_t to match sb's len and size, fix ordering of space check
Both s_len and s_size are ssize_t, so their differece is also more
properly a ssize_t not a size_t. Also, assert that len is <= size when
we enter. This should always be the case. Ensure that we have that one
byte that we write to the end of the buffer before we do so, though
the error should already be set on the buffer if not, and the only
times we supply 'partial' buffers they should be plenty large.

Reviewed by: cem, jhb (prior version, I did cem's suggestion)
Differential Revsion: https://reviews.freebsd.org/D26752
2020-10-12 22:07:44 +00:00
Conrad Meyer
f8e8a06d23 random(4) FenestrasX: Push root seed version to arc4random(3)
Push the root seed version to userspace through the VDSO page, if
the RANDOM_FENESTRASX algorithm is enabled.  Otherwise, there is no
functional change.  The mechanism can be disabled with
debug.fxrng_vdso_enable=0.

arc4random(3) obtains a pointer to the root seed version published by
the kernel in the shared page at allocation time.  Like arc4random(9),
it maintains its own per-process copy of the seed version corresponding
to the root seed version at the time it last rekeyed.  On read requests,
the process seed version is compared with the version published in the
shared page; if they do not match, arc4random(3) reseeds from the
kernel before providing generated output.

This change does not implement the FenestrasX concept of PCPU userspace
generators seeded from a per-process base generator.  That change is
left for future discussion/work.

Reviewed by:	kib (previous version)
Approved by:	csprng (me -- only touching FXRNG here)
Differential Revision:	https://reviews.freebsd.org/D22839
2020-10-10 21:52:00 +00:00
Mateusz Guzik
dd28b379cb vfs: support lockless dirfd lookups 2020-10-10 03:48:17 +00:00
Bryan Drewery
c2c6fb90e0 Use unlocked page lookup for inmem() to avoid object lock contention
Reviewed By:	kib, markj
Submitted by:	mlaier
Sponsored by:	Dell EMC
Differential Revision:	https://reviews.freebsd.org/D26653
2020-10-09 23:49:42 +00:00
Mateusz Guzik
deb1339f3f vfs: fix a panic when truncating comming from copy_file_range
Truncating requires an exclusive lock, but it was not taken if the
filesystem indicates support for shared writes. This only concerns
ZFS.

In particular fixes cp of files which have trailing holes.

Reported by:	bdrewery
2020-10-09 20:31:42 +00:00
John Baldwin
7e8bd70cff Don't invoke semunload() if seminit() fails during MOD_LOAD.
The module handler code invokes a MOD_UNLOAD event immediately if
MOD_LOAD fails.  The result was that if seminit() failed, semunload()
was invoked twice.  semunload() is not idempotent however and would
try to remove it's process_exit eventhandler twice resulting in a
panic.

Reviewed by:	kib, markj
Obtained from:	CheriBSD
MFC after:	1 month
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D26696
2020-10-09 20:20:42 +00:00
Mateusz Guzik
eb88fed446 cache: fix vexec panic when racing against vgone
Use of dead_vnodeops would result in a panic instead of returning the intended
EOPNOTSUPP error.

While here make sure to abort, not just try to return a partial result.
The former allows the regular lookup to restart from scratch, while the latter
makes it stuck with an unusable vnode.

Reported by:	kevans
2020-10-09 19:10:00 +00:00
Rick Macklem
19fe23fa2b Make vn_generic_copy_file_range() interruptible via a signal.
Without this patch, when vn_generic_copy_file_range() is
doing a large copy, it will remain in the function for a
considerable amount of time, delaying handling of any
outstanding signals until the copy completes.

This patch adds checks for signals that need to be
processed after each successful data copy cycle.
When sig_intr() returns non-zero, vn_generic_copy_file_range()
will return.
The check "if (len < savlen)" ensures that some data
has been copied, so that progress will be made.

Note that, since copy_file_range(2) is allowed to
return fewer bytes copied than requested, it
will never return EINTR/ERESTART when sig_intr()
returns non-zero.

Reviewed by:	kib, asomers
Differential Revision:	https://reviews.freebsd.org/D26620
2020-10-09 01:04:28 +00:00
Konstantin Belousov
203dda8a63 sig_intr(9): return early if AST is not scheduled.
Check td_flags for relevant AST requests lock-less.  This opens the
race slightly wider where sig_intr() returns false negative, but might
be it is worth it.

Requested by:	mjg
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2020-10-08 22:34:34 +00:00
Konstantin Belousov
4ea4966009 Do not allow to use O_BENEATH as an oracle.
Specifically, if lookup() returned any error and the topping directory
was not latched, which means that (non-existent) path did not returned
to the topping location, give ENOTCAPABLE a priority over the lookup()
error.

PR:	249960
Reviewed by:	emaste, ngie
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D26695
2020-10-08 22:31:11 +00:00
Mitchell Horne
841dad02e9 Fix a loop condition
The correct way to identify the end of the metadata is two adjacent
entries set to zero/MODINFO_END. I made a typo and this was checking the
first entry twice.

Reported by:	rpokala
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
2020-10-08 18:29:17 +00:00
Mitchell Horne
22e6a67086 Add a routine to dump boot metadata
The boot metadata (also referred to as modinfo, or preload metadata)
provides information about the size and location of the kernel,
pre-loaded modules, and other metadata (e.g. the EFI framebuffer) to be
consumed during by the kernel during early boot. It is encoded as a
series of type-length-value entries and is usually constructed by
loader(8) and passed to the kernel. It is also faked on some
architectures when booted by other means.

Although much of the module information is available via kldstat(8),
there is no easy way to debug the metadata in its entirety. Add some
routines to parse this data and allow it to be printed to the console
during early boot or output via a sysctl.

Since the output can be lengthly, printing to the console is gated
behind the debug.dump_modinfo_at_boot kenv variable as well as the
BOOTVERBOSE flag. The sysctl to print the metadata is named
debug.dump_modinfo.

Reviewed by:	tsoome
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D26687
2020-10-08 18:02:05 +00:00
Hans Petter Selasky
eccb214897 The ethernet header structure is read-only. Add const keyword.
(This is a diff reduction towards D26254)

MFC after:		1 week
Sponsored by:		Mellanox Technologies // NVIDIA Networking
2020-10-08 11:25:19 +00:00