Commit Graph

17921 Commits

Author SHA1 Message Date
Kirk McKusick
e75f0f2b48 Only attempt a VOP_UNLOCK() when the vn_lock() has been successful.
No MFC as this code is not present in 12-stable.

Reported by:  Peter Holm
Reviewed by:  Mateusz Guzik
Tested by:    Peter Holm
Sponsored by: Netflix
2020-11-20 20:22:01 +00:00
Michal Meloun
d9de80d614 Also pass interrupt binding request to non-root interrupt controllers.
There are message based controllers that can bind interrupts even if they are
not implemented as root controllers (such as the ITS subblock of GIC).

MFC after:	3 weeks
2020-11-20 09:05:36 +00:00
Mateusz Guzik
f9fe7b28bc pipe: thundering herd problem in pipelock
All reads and writes are serialized with a hand-rolled lock, but unlocking it
always wakes up all waiters. Existing flag fields get resized to make room for
introduction of waiter counter without growing the struct.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D27273
2020-11-19 19:25:47 +00:00
Mark Johnston
a33fef5e25 callout(9): Fix a race between CPU migration and callout_drain()
Suppose a running callout re-arms itself, and before the callout
finishes running another CPU calls callout_drain() and goes to sleep.
softclock_call_cc() will wake up the draining thread, which may not run
immediately if there is a lot of CPU load.  Furthermore, the callout is
still in the callout wheel so it can continue to run and re-arm itself.
Then, suppose that the callout migrates to another CPU before the
draining thread gets a chance to run.  The draining thread is in this
loop in _callout_stop_safe():

	while (cc_exec_curr(cc) == c) {
		CC_UNLOCK(cc);
		sleep();
		CC_LOCK(cc);
	}

but after the migration, cc points to the wrong CPU's callout state.
Then the draining thread goes off and removes the callout from the
wheel, but does so using the wrong lock and per-CPU callout state.

Fix the problem by doing a re-lookup of the callout CPU after sleeping.

Reported by:	syzbot+79569cd4d76636b2cc1c@syzkaller.appspotmail.com
Reported by:	syzbot+1b27e0237aa22d8adffa@syzkaller.appspotmail.com
Reported by:	syzbot+e21aa5b85a9aff90ef3e@syzkaller.appspotmail.com
Reviewed by:	emaste, hselasky
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27266
2020-11-19 18:37:28 +00:00
Mitchell Horne
c8a96cdcd9 Add an option for entering KDB on recursive panics
There are many cases where one would choose avoid entering the debugger
on a normal panic, opting instead to reboot and possibly save a kernel
dump. However, recursive kernel panics are an unusual case that might
warrant attention from a human, so provide a secondary tunable,
debug.debugger_on_recursive_panic, to allow entering the debugger only
when this occurs.

For for simplicity in maintaining existing behaviour, the tunable
defaults to zero.

Reviewed by:	cem, markj
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D27271
2020-11-19 18:03:40 +00:00
Mateusz Guzik
d116b9f1ad thread: numa-aware zombie reaping
The current global list is a significant problem, in particular induces a lot
of cross-domain thread frees. When running poudriere on a 2 domain box about
half of all frees were of that nature.

Patch below introduces per-domain thread data containing zombie lists and
domain-aware reaping. By default it only reaps from the current domain, only
reaping from others if there is free TID shortage.

A dedicated callout is introduced to reap lingering threads if there happens
to be no activity.

Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D27185
2020-11-19 10:00:48 +00:00
Mateusz Guzik
b8cb628534 pipe: tidy up pipelock 2020-11-19 08:16:45 +00:00
Mateusz Guzik
89744405e6 pipe: allow for lockless pipe_stat
pipes get stated all thet time and this avoidably contributed to contention.
The pipe lock is only held to accomodate MAC and to check the type.

Since normally there is no probe for pipe stat depessimize this by having the
flag.

The pipe_state field gets modified with locks held all the time and it's not
feasible to convert them to use atomic store. Move the type flag away to a
separate variable as a simple cleanup and to provide stable field to read.
Use short for both fields to avoid growing the struct.

While here short-circuit MAC for pipe_poll as well.
2020-11-19 06:30:25 +00:00
Mateusz Guzik
2f5b0b48ac cred: fix minor nits in r367695
Noted by:	jhb
2020-11-19 04:28:39 +00:00
Mateusz Guzik
c48f897bbe smp: fix smp_rendezvous_cpus_retry usage before smp starts
Since none of the other CPUs are running there is nobody to clear their
entries and the routine spins indefinitely.
2020-11-19 04:27:51 +00:00
Mark Johnston
a28c28e6ef Remove NO_EVENTTIMERS support
The arm configs that required it have been removed from the tree.
Removing this option makes the callout code easier to read and
discourages developers from adding new configs without eventtimer
drivers.

Reviewed by:	ian, imp, mav
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27270
2020-11-19 02:50:48 +00:00
Mariusz Zaborski
f488d5b797 Add CTLFLAG_MPSAFE to the suser_enabled sysctl.
Pointed out by:	mjg
2020-11-18 21:26:14 +00:00
Mariusz Zaborski
05e1e482c7 jail: introduce per jail suser_enabled setting
The suser_enable sysctl allows to remove a privileged rights from uid 0.
This change introduce per jail setting which allow to make root a
normal user.

Reviewed by:	jamie
Previous version reviewed by:	kevans, emaste, markj, me_igalic.co
Discussed with:	pjd
Differential Revision:	https://reviews.freebsd.org/D27128
2020-11-18 21:07:08 +00:00
Mariusz Zaborski
21fe9441e1 Fix style nits. 2020-11-18 20:59:58 +00:00
John Baldwin
5335f6434b Fix a few nits in vn_printf().
- Mask out recently added VV_* bits to avoid printing them twice.

- Keep VI_LOCKed on the same line as the rest of the flags.

Reviewed by:	kib
Obtained from:	CheriBSD
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D27261
2020-11-18 16:21:37 +00:00
Kyle Evans
27a9392d54 _umtx_op: fix robust lists after r367744
A copy-pasto left us copying in 24-bytes at the address of the rb pointer
instead of the intended target.

Reported by:	sigsys@gmail.com
Sighing:	kevans
2020-11-18 03:30:31 +00:00
Conrad Meyer
f8f74aaa84 linux(4) clone(2): Correctly handle CLONE_FS and CLONE_FILES
The two flags are distinct and it is impossible to correctly handle clone(2)
without the assistance of fork1().  This change depends on the pwddesc split
introduced in r367777.

I've added a fork_req flag, FR2_SHARE_PATHS, which indicates that p_pd
should be treated the opposite way p_fd is (based on RFFDG flag).  This is a
little ugly, but the benefit is that existing RFFDG API is preserved.
Holding FR2_SHARE_PATHS disabled, RFFDG indicates both p_fd and p_pd are
copied, while !RFFDG indicates both should be cloned.

In Chrome, clone(2) is used with CLONE_FS, without CLONE_FILES, and expects
independent fd tables.

The previous conflation of CLONE_FS and CLONE_FILES was introduced in
r163371 (2006).

Discussed with:	markj, trasz (earlier version)
Differential Revision:	https://reviews.freebsd.org/D27016
2020-11-17 21:20:11 +00:00
Conrad Meyer
85078b8573 Split out cwd/root/jail, cmask state from filedesc table
No functional change intended.

Tracking these structures separately for each proc enables future work to
correctly emulate clone(2) in linux(4).

__FreeBSD_version is bumped (to 1300130) for consumption by, e.g., lsof.

Reviewed by:	kib
Discussed with:	markj, mjg
Differential Revision:	https://reviews.freebsd.org/D27037
2020-11-17 21:14:13 +00:00
Conrad Meyer
ede4af47ae unix(4): Enhance LOCAL_CREDS_PERSISTENT ABI
As this ABI is still fresh (r367287), let's correct some mistakes now:

- Version the structure to allow for future changes
- Include sender's pid in control message structure
- Use a distinct control message type from the cmsgcred / sockcred mess

Discussed with:	kib, markj, trasz
Differential Revision:	https://reviews.freebsd.org/D27084
2020-11-17 20:01:21 +00:00
Conrad Meyer
de774e422e linux(4): Implement name_to_handle_at(), open_by_handle_at()
They are similar to our getfhat(2) and fhopen(2) syscalls.

Differential Revision:	https://reviews.freebsd.org/D27111
2020-11-17 19:51:47 +00:00
Kyle Evans
bd4bcd14e3 Fix !COMPAT_FREEBSD32 kernel build
One of the last shifts inadvertently moved these static assertions out of a
COMPAT_FREEBSD32 block, which the relevant definitions are limited to.

Fix it.

Pointy hat:	kevans
2020-11-17 04:22:10 +00:00
Kyle Evans
63ecb272a0 umtx_op: reduce redundancy required for compat32
All of the compat32 variants are substantially the same, save for
copyin/copyout (mostly). Apply the same kind of technique used with kevent
here by having the syscall routines supply a umtx_copyops describing the
operations needed.

umtx_copyops carries the bare minimum needed- size of timespec and
_umtx_time are used for determining if copyout is needed in the sem2_wait
case.

Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27222
2020-11-17 03:36:58 +00:00
Kyle Evans
4be0a1b587 _umtx_op: fix a compat32 bug in UMTX_OP_NWAKE_PRIVATE
Specifically, if we're waking up some value n > BATCH_SIZE, then the
copyin(9) is wrong on the second iteration due to upp being the wrong type.
upp is currently a uint32_t**, so upp + pos advances it by twice as many
elements as it should (host pointer size vs. compat32 pointer size).

Fix it by just making upp a uint32_t*; it's still technically a double
pointer, but the distinction doesn't matter all that much here since we're
just doing arithmetic on it.

Add a test case that demonstrates the problem, placed with the libthr tests
since one messing with _umtx_op should be running these tests. Running under
compat32, the new test case will hang as threads after the first 128 get
missed in the wake. it's not immediately clear how to hit it in practice,
since pthread_cond_broadcast() uses a smaller (sleepq batch?) size observed
to be around ~50 -- I did not spend much time digging into it.

The uintptr_t change makes no functional difference, but i've tossed it in
since it's more accurate (semantically).

Reported by:	Andrew Gierth (andrew_tao173.riddles.org.uk, inspection)
Reviewed by:	kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D27231
2020-11-17 03:34:01 +00:00
Konstantin Belousov
cb596eea82 vmem: trivial warning and style fixes.
Add __unused to some args.
Change type of the iterator variables to match loop control.
Remove excessive {}.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D27220
2020-11-17 02:18:34 +00:00
Mateusz Guzik
1a7bb89629 cpuset: refcount-clean 2020-11-17 00:04:05 +00:00
Mateusz Guzik
89deca0a33 malloc: make malloc_large closer to standalone
This moves entire large alloc handling out of all consumers, apart from
deciding to go there.

This is a step towards creating a fast path.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27198
2020-11-16 17:56:58 +00:00
Mateusz Guzik
19d3e47dca select: call seltdfini on process and thread exit
Since thread_zone is marked NOFREE the thread_fini callback is never
executed, meaning memory allocated by seltdinit is never released.

Adding the call to thread_dtor is not sufficient as exiting processes
cache the main thread.
2020-11-16 03:12:21 +00:00
Mateusz Guzik
31b2ac4b5a select: replace reference counting with memory barriers in selfd
Refcounting was added to combat a race between selfdfree and doselwakup,
but it adds avoidable overhead.

selfdfree detects it can free the object by ->sf_si == NULL, thus we can
ensure that the condition only holds after all accesses are completed.
2020-11-16 03:09:18 +00:00
Mateusz Guzik
b77594bbbf sched: fix an incorrect comparison in sched_lend_user_prio_cond
Compare with sched_lend_user_prio.
2020-11-15 01:54:44 +00:00
Mateusz Guzik
f34a2f56c3 thread: batch credential freeing 2020-11-14 19:22:02 +00:00
Mateusz Guzik
fb8ab68084 thread: batch resource limit free calls 2020-11-14 19:21:46 +00:00
Mateusz Guzik
5ef7b7a0f3 thread: rework tid batch to use helpers 2020-11-14 19:20:58 +00:00
Mateusz Guzik
d1ca25be49 thread: pad tid lock
On a kernel with other changes this bumps 104-way thread creation/destruction
from 0.96 mln ops/s to 1.1 mln ops/s.
2020-11-14 19:19:27 +00:00
Mateusz Guzik
9b9bb9ffa5 malloc: retire MALLOC_PROFILE
The global array has prohibitive performance impact on multicore systems.

The same data (and more) can be obtained with dtrace.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27199
2020-11-13 19:22:53 +00:00
Konstantin Belousov
441eb16a95 Allow some VOPs to return ERELOOKUP to indicate VFS operation restart at top level.
Restart syscalls and some sync operations when filesystem indicated
ERELOOKUP condition, mostly for VOPs operating on metdata.  In
particular, lookup results cached in the inode/v_data is no longer
valid and needs recalculating.  Right now this should be nop.

Assert that ERELOOKUP is catched everywhere and not returned to
userspace, by asserting that td_errno != ERELOOKUP on syscall return
path.

In collaboration with:	pho
Reviewed by:	mckusick (previous version), markj
Tested by:	markj (syzkaller), pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26136
2020-11-13 09:42:32 +00:00
Konstantin Belousov
7cde2ec4fd Implement vn_lock_pair().
In collaboration with:	pho
Reviewed by:	mckusick (previous version), markj (previous version)
Tested by:	markj (syzkaller), pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D26136
2020-11-13 09:31:57 +00:00
Mateusz Guzik
9aa6d792b5 malloc: retire malloc_last_fail
The routine does not serve any practical purpose.

Memory can be allocated in many other ways and most consumers pass the
M_WAITOK flag, making malloc not fail in the first place.

Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D27143
2020-11-12 20:22:58 +00:00
Mateusz Guzik
62dbc992ad thread: move nthread management out of tid_alloc
While this adds more work single-threaded, it also enables SMP-related
speed ups.
2020-11-12 00:29:23 +00:00
Kyle Evans
38033780a3 umtx: drop incorrect timespec32 definition
This works for amd64, but none others -- drop it, because we already have a
proper definition in sys/compat/freebsd32/freebsd32.h that correctly uses
time32_t.

MFC after:	1 week
2020-11-11 22:35:23 +00:00
Mateusz Guzik
755341df4f thread: batch tid_free calls in thread_reap
This eliminates the highly pessimal pattern of relocking from multiple
CPUs in quick succession. Note this is still globally serialized.
2020-11-11 18:45:06 +00:00
Mateusz Guzik
c5315f5196 thread: lockless zombie list manipulation
This gets rid of the most contended spinlock seen when creating/destroying
threads in a loop. (modulo kstack)

Tested by:	alfredo (ppc64), bdragon (ppc64)
2020-11-11 18:43:51 +00:00
Mark Johnston
f52979098d Fix a pair of races in SIGIO registration
First, funsetownlst() list looks at the first element of the list to see
whether it's processing a process or a process group list.  Then it
acquires the global sigio lock and processes the list.  However, nothing
prevents the first sigio tracker from being freed by a concurrent
funsetown() before the sigio lock is acquired.

Fix this by acquiring the global sigio lock immediately after checking
whether the list is empty.  Callers of funsetownlst() ensure that new
sigio trackers cannot be added concurrently.

Second, fsetown() uses funsetown() to remove an existing sigio structure
from a file object.  However, funsetown() uses a racy check to avoid the
sigio lock, so two threads may call fsetown() on the same file object,
both observe that no sigio tracker is present, and enqueue two sigio
trackers for the same file object.  However, if the file object is
destroyed, funsetown() will only remove one sigio tracker, and
funsetownlst() may later trigger a use-after-free when it clears the
file object reference for each entry in the list.

Fix this by introducing funsetown_locked(), which avoids the racy check.

Reviewed by:	kib
Reported by:	pho
Tested by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D27157
2020-11-11 13:44:27 +00:00
Mateusz Guzik
26007fe37c thread: add more fine-grained tidhash locking
Note this still does not scale but is enough to move it out of the way
for the foreseable future.

In particular a trivial benchmark spawning/killing threads stops contesting
on tidhash.
2020-11-11 08:51:04 +00:00
Mateusz Guzik
aae3547be3 thread: rework tidhash vs proc lock interaction
Apart from minor clean up this gets rid of proc unlock/lock cycle on thread
exit to work around LOR against tidhash lock.
2020-11-11 08:50:04 +00:00
Mateusz Guzik
cf31cadeb6 thread: fix thread0 tid allocation
Startup code hardcodes the value instead of allocating it.
The first spawned thread would then be a duplicate.

Pointy hat:	mjg
2020-11-11 08:48:43 +00:00
Mateusz Guzik
40aad3e477 thread: tidy up r367543
"locked" variable is spurious in the committed version.
2020-11-10 21:29:10 +00:00
Mateusz Guzik
5c5ca843b7 Allow rtprio_thread to operate on threads of any process
This in particular unbreaks rtkit.

The limitation was a leftover of previous state, to quote a
comment:

/*
 * Though lwpid is unique, only current process is supported
 * since there is no efficient way to look up a LWP yet.
 */

Long since then a global tid hash was introduced to remedy
the problem.

Permission checks still apply.

Submitted by:	greg_unrelenting.technology (Greg V)
Differential Revision:	https://reviews.freebsd.org/D27158
2020-11-10 18:10:50 +00:00
Mateusz Guzik
5c100123a3 thread: retire thread_find
tdfind should be used instead.
2020-11-10 01:57:48 +00:00
Mateusz Guzik
f837888a3e thread: use tdfind in sysctl_kern_proc_kstack
This treads linear scans for locked lookup, but more importantly removes
the only consumer of thread_find.
2020-11-10 01:57:19 +00:00
Mateusz Guzik
94275e3e69 threads: remove the unused TID_BUFFER_SIZE macro 2020-11-10 01:31:06 +00:00