Track the the current lock/reference state in a single variable,
rather than deducing the proper prison_deref() flags from a
combination of equations and hard-coded values.
Instead of trying to maintain pg_jobc counter on each process group
update (and sometimes before), just calculate the counter when needed.
Still, for the benefit of the signal delivery code, explicitly mark
orphaned groups as such with the new process group flag.
This way we prevent bugs in the corner cases where updates to the counter
were missed due to complicated configuration of p_pptr/p_opptr/real_parent
(debugger).
Since we need to iterate over all children of the process on exit, this
change mostly affects the process group entry and leave, where we need
to iterate all process group members to detect orpaned status.
(For MFC, keep pg_jobc around but unused).
Reported by: jhb
Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871
This improves code structure and allows to put the lock asserts right
into place where the locks are needed.
Also move zeroing of the kinfo_proc structure from fill_kinfo_proc_only()
to fill_kinfo_proc(), this looks more symmetrical.
Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871
Proctree lock is needed for correct calculation and collection of the
job-control related data in kinfo_proc. There was even an XXX comment
about it.
Satisfy locking and lock ordering requirements by taking proctree lock
around pass over each bucket in proc_iterate(), and in sysctl_kern_proc()
and note_procstat_proc() for individual process reporting.
Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871
Increase the scope of the process group lock ownership. This ensures that
we are consistent in returning EIO for tty write from an orphan and delivery
of TTYOUT signals.
Reviewed by: jilles
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871
Often, we have a process locked and need to get locked process group.
In this case, because progress group lock is before process lock,
unlocking process allows the group to be freed. See for instance
tty_wait_background().
Make pgrp structures allocated from nofree zone, and ensure type stability
of the pgrp mutex.
Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871
When using NOTE_NSECONDS in the kevent(2) API, US_TO_SBT should be
used instead of NS_TO_SBT, otherwise the timeout results are
misleading.
PR: 252539
Reviewed by: kevans, kib
Approved by: kevans
MFC after: 3 weeks
This is just like debug.kdb.panic, except the string that's passed in
is reported in the panic message. This allows people with automated
systems to collect kernel panics over a large fleet of machines to
flag panics better. Strings like "Warner look at this hang" or "see
JIRA ABC-1234 for details" allow these automated systems to route the
forced panic to the appropriate engineers like you can with other
types of panics. Other users are likely possible.
Relnotes: Yes
Sponsored by: Netflix
Reviewed by: allanjude (earlier version)
Suggestions from review folded in by: 0mp, emaste, lwhsu
Differential Revision: https://reviews.freebsd.org/D28041
Ext_pg mbufs allow carrying multiple pages per mbuf. This
reduces mbuf linked list traversals, especially in socket
buffers, thereby reducing cache misses and CPU use for
applications using sendfile. Note that ext_pages use
unmapped pages, eliminating KVA mapping costs on 32-bit
platforms.
Ext_pg mbufs are also required for ktls (KERN_TLS), and having
them disabled by default is a stumbling block for those
wishing to enable ktls.
Reviewed-by: jhb, glebius
Sponsored by: Netfix
The originally chosen numbers interfere with downstream projects'
syscalls. Move them to the end of the syscall table instead.
Reported by: jrtc27
Reviewed by: brooks
MFC-With: 022ca2fc7f
Differential Revision: 022ca2fc7f
aio_fsync(O_DSYNC, ...) is the asynchronous version of fdatasync(2).
Reviewed by: kib, asomers, jhb
Differential Review: https://reviews.freebsd.org/D25071
POSIX O_DSYNC means that writes include an implicit fdatasync(2), just
as O_SYNC implies fsync(2).
VOP_WRITE() functions that understand the new IO_DATASYNC flag can act
accordingly, but we'll still pass down IO_SYNC so that file systems that
don't understand it will continue to provide the stronger O_SYNC
behaviour.
Flag also applies to fcntl(2).
Reviewed by: kib, delphij
Differential Revision: https://reviews.freebsd.org/D25090
vn_rdwr() must lock the entire file range for IO_APPEND
just like vn_io_fault() does for O_APPEND.
Reviewed by: kib, imp, mckusick
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D28008
When INVARIANTS is configred, the sendfile_iodone() callback verifies
that pages attached to the sendfile header are wired, but we unwire all
such pages after a synchronous pager error, before calling
sendfile_iodone().
Reported by: pho
Tested by: pho
Sponsored by: The FreeBSD Foundation
We have the d_off field in struct dirent for providing the seek offset
of the next directory entry. Several filesystems were not initializing
the field, which ends up being copied out to userland.
Reported by: Syed Faraz Abrar <faraz@elttam.com>
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27792
POSIX AIO is great, but it lacks vectored I/O functions. This commit
fixes that shortcoming by adding aio_writev and aio_readv. They aren't
part of the standard, but they're an obvious extension. They work just
like their synchronous equivalents pwritev and preadv.
It isn't yet possible to use vectored aiocbs with lio_listio, but that
could be added in the future.
Reviewed by: jhb, kib, bcr
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D27743
The change to kern_jail_set that was supposed to "also properly clean
up when attachment fails" didn't fix a memory leak but actually caused
a double free. Back that part out, and leave the part that manages
allprison_lock state.
Not all debugger operations that enumerate threads require thread
stacks to be resident in memory to be useful. Instead, push P_INMEM
checks (if needed) into callers.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27827
Processes part way through exit1() are not included in allproc. Using
allproc to enumerate processes prevented getting the stack trace of a
thread in this part of exit1() via ddb.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27826
This is used by kernel debuggers to enumerate processes via the pid
hash table.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27825
Exiting processes that have been removed from allproc but are still
executing are not yet marked PRS_ZOMBIE, so they were not listed (for
example, if a thread panics during exit1()). To detect these
processes, clear p_list.le_prev to NULL explicitly after removing a
process from the allproc list and check for this sentinel rather than
PRS_ZOMBIE when walking the pidhash.
While here, simplify the pidhash walk to use a single outer loop.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27824
Keep explicit track of the allprison_lock state during the final part
of kern_jail_set, instead of deducing it from the JAIL_ATTACH flag.
Also properly clean up when the attachment fails, fixing a long-
standing (though minor) memory leak.
Both FreeBSD and Linux mkdir -p walk the tree up ignoring any EEXIST on
the way and both are used a lot when building respective kernels.
This poses a problem as spurious locking avoidably interferes with
concurrent operations like getdirentries on affected directories.
Work around the problem by adding FAILIFEXISTS flag. In case of lockless
lookup this manages to avoid any work to begin with, there is no speed
up for the locked case but perhaps this can be augmented later on.
For simplicity the only supported semantics are as used by mkdir.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D27789
The previous patch failed to set the ISDOTDOT flag when appropriate,
which in turn fail to properly handle degenerate lookups.
While here sprinkle some extra assertions.
Tested by: pho (previous version)
eventfd is a Linux system call that produces special file descriptors
for event notification. When porting Linux software, it is currently
usually emulated by epoll-shim on top of kqueues. Unfortunately, kqueues
are not passable between processes. And, as noted by the author of
epoll-shim, even if they were, the library state would also have to be
passed somehow. This came up when debugging strange HW video decode
failures in Firefox. A native implementation would avoid these problems
and help with porting Linux software.
Since we now already have an eventfd implementation in the kernel (for
the Linuxulator), it's pretty easy to expose it natively, which is what
this patch does.
Submitted by: greg@unrelenting.technology
Reviewed by: markj (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26668
allprison_lock should be at least held shared when jail OSD methods
are called. Add a shared lock around one such call where that wasn't
the case.
In another such call, change an exclusive lock grab to be shared in
what is likely the more common case.
Return a boolean (i.e. 0 or 1) from prison_allow, instead of the flag
value itself, which is what sysctl expects.
Add prison_set_allow(), which can set or clear a permission bit, and
propagates cleared bits down to child jails.
Use prison_allow() and prison_set_allow() in the various jail.allow.*
sysctls, and others that depend on thoe permissions.
Add locking around checking both pr_allow and pr_enforce_statfs in
prison_priv_check().
We initialize sfio->npages only when some I/O is required to satisfy the
request. However, sendfile_iodone() contains an INVARIANTS-only check
that references sfio->npages, and this check is executed even if no I/O
is performed, so the check may use an uninitialized value.
Fix the problem by initializing sfio->npages earlier. Note that
sendfile_swapin() always initializes the page array. In some rare cases
we need to trim the page array so ensure that sfio->npages gets updated
accordingly.
Reported by: syzkaller (with KASAN)
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27726
Use atomic access and a memory barrier to ensure that the flag parameter
in pr_flag_allow is indeed set after the rest of the structure is valid.
Simplify adding flag bits with pr_allow_all, a dynamic version of
PR_ALLOW_ALL_STATIC.
When a jail is added using the default (system-chosen) JID, and
non-default-JID jails already exist, a loop through the allprison
list could restart and result in unnecessary O(n^2) behaviour.
There should never be more than two list passes required.
Also clean up inefficient (though still O(n)) allprison list traversal
when finding jails by ID, or when adding jails in the common case of
all default JIDs.
Vectored aio will require each aiocb to be associated with multiple
bios, so we can't store a link to the latter from the former. But we
don't really need to. aio_biowakeup already knows the bio it's using,
and the other fields can be stored within the bio and/or buf itself.
Also, remove the unused kaiocb.backend2 field.
Reviewed By: kib
Differential Revision: https://reviews.freebsd.org/D27682
When ktls_bind_thread is 2, we pick a ktls worker thread that is
bound to the same domain as the TCP connection associated with
the socket. We use roughly the same code as netinet/tcp_hpts.c to
do this. This allows crypto to run on the same domain as the TCP
connection is associated with. Assuming TCP_REUSPORT_LB_NUMA
(D21636) is in place & in use, this ensures that the crypto source
and destination buffers are local to the same NUMA domain as we're
running crypto on.
This change (when TCP_REUSPORT_LB_NUMA, D21636, is used) reduces
cross-domain traffic from over 37% down to about 13% as measured
by pcm.x on a dual-socket Xeon using nginx and a Netflix workload.
Reviewed by: jhb
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21648
This partially lifts a restriction imposed by r191639 ("Prevent a superuser
inside a jail from modifying the dedicated root cpuset of that jail") that's
perhaps beneficial after r192895 ("Add hierarchical jails."). Jails still
cannot modify their own cpuset, but they can modify child jails' roots to
further restrict them or widen them back to the modifying jails' own mask.
As a side effect of this, the system root may once again widen the mask of
jails as long as they're still using a subset of the parent jails' mask.
This was previously prevented by the fact that cpuset_getroot of a root set
will return that root, rather than the root's parent -- cpuset_modify uses
cpuset_getroot since it was introduced in r327895, previously it was just
validating against set->cs_parent which allowed the system root to widen
jail masks.
Reviewed by: jamie
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27352
Also centralize and unify checks to enable ASLR stack gap in a new
helper exec_stackgap().
PR: 239873
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Similar to r366897, this uses the .incbin directive to pull in a
firmware file's contents into a .fwo file. The same scheme for
computing symbol names from the filename is used as before to maximize
compatiblity and not require rebuilding existing .fwo files for
NO_CLEAN builds. Using ld -o binary requires extra hacks in linkers
to either specify ABI options (e.g. soft- vs hard-float) or to ignore
ABI incompatiblities when linking certain objects (e.g. object files
with only data). Using the compiler driver avoids the need for these
hacks as the compiler driver is able to set all the appropriate ABI
options.
Reviewed by: imp, markj
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D27579
Since we do not own the session lock, a parallel killjobc() might
reset s_leader to NULL after we checked it. Read s_leader only once
and ensure that compiler is not allowed to reload.
While there, make access to t_session somewhat more pretty by using
local variable.
PR: 251915
Submitted by: Jakub Piecuch <j.piecuch96@gmail.com>
MFC after: 1 week
refcount_acquire_if_not_zero returns true on saturation.
The case of 0 is handled by looping again, after which the originally
found pointer will no longer be there.
Noted by: kib
p_fd nullification in fdescfree serializes against new threads transitioning
the count 1 -> 2, meaning that fdescfree_fds observing the count of 1 can
safely assume there is nobody else using the table. Losing the race and
observing > 1 is harmless.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D27522
To export information from fd tables we have several loops which do
this:
FILDESC_SLOCK(fdp);
for (i = 0; fdp->fd_refcount > 0 && i <= lastfile; i++)
<export info for fd i>;
FILDESC_SUNLOCK(fdp);
Before r367777, fdescfree() acquired the fd table exclusive lock between
decrementing fdp->fd_refcount and freeing table entries. This
serialized with the loop above, so the file at descriptor i would remain
valid until the lock is dropped. Now there is no serialization, so the
loops may race with teardown of file descriptor tables.
Acquire the exclusive fdtable lock after releasing the final table
reference to provide a barrier synchronizing with these loops.
Reported by: pho
Reviewed by: kib (previous version), mjg
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27513
cpuset_modify() would not currently catch this, because it only checks that
the new mask is a subset of the root set and circumvents the EDEADLK check
in cpuset_testupdate().
This change both directly validates the mask coming in since we can
trivially detect an empty mask, and it updates cpuset_testupdate to catch
stuff like this going forward by always ensuring we don't end up with an
empty mask.
The check_mask argument has been renamed because the 'check' verbiage does
not imply to me that it's actually doing a different operation. We're either
augmenting the existing mask, or we are replacing it entirely.
Reported by: syzbot+4e3b1009de98d2fabcda@syzkaller.appspotmail.com
Discussed with: andrew
Reviewed by: andrew, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27511
The race plays out like so between threads A and B:
1. A ref's cpuset 10
2. B does a lookup of cpuset 10, grabs the cpuset lock and searches
cpuset_ids
3. A rel's cpuset 10 and observes the last ref, waits on the cpuset lock
while B is still searching and not yet ref'd
4. B ref's cpuset 10 and drops the cpuset lock
5. A proceeds to free the cpuset out from underneath B
Resolve the race by only releasing the last reference under the cpuset lock.
Thread A now picks up the spinlock and observes that the cpuset has been
revived, returning immediately for B to deal with later.
Reported by: syzbot+92dff413e201164c796b@syzkaller.appspotmail.com
Reviewed by: markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27498
cpuset_rel_defer() is supposed to be functionally equivalent to
cpuset_rel() but with anything that might sleep deferred until
cpuset_rel_complete -- this setup is used specifically for cpuset_setproc.
Add in the missing unr free to match cpuset_rel. This fixes a leak that
was observed when I wrote a small userland application to try and debug
another issue, which effectively did:
cpuset(&newid);
cpuset(&scratch);
newid gets leaked when scratch is created; it's off the list, so there's
no mechanism for anything else to relinquish it. A more realistic reproducer
would likely be a process that inherits some cpuset that it's the only ref
for, but it creates a new one to modify. Alternatively, administratively
reassigning a process' cpuset that it's the last ref for will have the same
effect.
Discovered through D27498.
MFC after: 1 week
They were only modified to accomodate a redundant assertion.
This runs into problems as lockless lookup can still try to use the vnode
and crash instead of getting an error.
The bug was only present in kernels with INVARIANTS.
Reported by: kevans
This is a valid scenario that's handled in the various protocol layers where
it makes sense (e.g., tcp_disconnect and sctp_disconnect). Given that it
indicates we should immediately drop the connection, it makes little sense
to sleep on it.
This could lead to panics with INVARIANTS. On non-INVARIANTS kernels, this
could result in the thread hanging until a signal interrupts it if the
protocol does not mark the socket as disconnected for whatever reason.
Reported by: syzbot+e625d92c1dd74e402c81@syzkaller.appspotmail.com
Reviewed by: glebius, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27407
As of r365978, minidumps include a copy of dump_avail[]. This is an
array of vm_paddr_t ranges. libkvm walks the array assuming that
sizeof(vm_paddr_t) is equal to the platform "word size", but that's not
correct on some platforms. For instance, i386 uses a 64-bit vm_paddr_t.
Fix the problem by always dumping 64-bit addresses. On platforms where
vm_paddr_t is 32 bits wide, namely arm and mips (sometimes), translate
dump_avail[] to an array of uint64_t ranges. With this change, libkvm
no longer needs to maintain a notion of the target word size, so get rid
of it.
This is a no-op on platforms where sizeof(vm_paddr_t) == 8.
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27082
hw.physmem tunable allows to limit number of physical memory available to the
system. It's handled in machdep files for x86 and PowerPC. This patch adds
required logic to the consolidated physmem management interface that is used by
ARM, ARM64, and RISC-V.
Submitted by: Klara, Inc.
Reviewed by: mhorne
Sponsored by: Ampere Computing
Differential Revision: https://reviews.freebsd.org/D27152
Prior to the patch returning selfdfree could still be racing against doselwakeup
which set sf_si = NULL and now locks stp to wake up the other thread.
A sufficiently unlucky pair can end up going all the way down to freeing
select-related structures before the lock/wakeup/unlock finishes.
This started manifesting itself as crashes since select data started getting
freed in r367714.
Right now, if lio registered zero jobs, syscall frees lio job
structure, cleaning up queued ksi. As result, the realtime signal is
dequeued and never delivered.
Fix it by allowing sendsig() to copy ksi when job count is zero.
PR: 220398
Reported and reviewed by: asomers
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27421
A pair of bugs are believed to have caused the hangs described in the
commit log message for r364744:
1. uma_reclaim() could trigger reclamation of the reserve of boundary
tags used to avoid deadlock. This was fixed by r366840.
2. The loop in vmem_xalloc() would in some cases try to allocate more
boundary tags than the expected upper bound of BT_MAXALLOC. The
reserve is sized based on the value BT_MAXMALLOC, so this behaviour
could deplete the reserve without guaranteeing a successful
allocation, resulting in a hang. This was fixed by r366838.
PR: 248008
Tested by: rmacklem
Refactor sysctl_sysctl_next_ls():
* Move huge inner loop out of sysctl_sysctl_next_ls() into a separate
non-recursive function, returning the next step to be taken.
* Update resulting node oid parts only on successful lookup
* Make sysctl_sysctl_next_ls() return boolean success/failure instead of errno,
slightly simplifying logic
Reviewed by: freqlabs
Differential Revision: https://reviews.freebsd.org/D27029
Implement vt_vbefb to support Vesa Bios Extensions (VBE) framebuffer with VT.
vt_vbefb is built based on vt_efifb and is assuming similar data for
initialization, use MODINFOMD_VBE_FB to identify the structure vbe_fb
in kernel metadata.
struct vbe_fb, is populated by boot loader, and is passed to kernel via
metadata payload.
Differential Revision: https://reviews.freebsd.org/D27373
Apparently some architectures, like ppc in its hashed page tables
variants, account mappings by pmap_qenter() in the response from
pmap_is_page_mapped().
While there, eliminate useless userp variable.
Noted and reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27409
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
Restructure the loop a little bit to make it a little more clear how it
really operates: we never allocate any domains at the beginning of the first
iteration, and it will run until we've satisfied the amount we need or we
encounter an error.
The lock is now taken outside of the loop to make stuff inside the loop
easier to evaluate w.r.t. locking.
This fixes it to not try and allocate any domains for the freelist under the
spinlock, which would have happened before if we needed any new domains.
Reported by: syzbot+6743fa07b9b7528dc561@syzkaller.appspotmail.com
Reviewed by: markj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D27371
There is no reason why vp->v_object cannot be NULL. If it is, it's
fine, handle it by delegating to VOP_READ().
Tested by: pho
Reviewed by: markj, mjg
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27327
- VFS_UNMOUNT() requires vn_start_write() around it [*].
- call VFS_PURGE() before unmount.
- do not destroy mp if cleanup unmount did not succeed.
- set MNTK_UNMOUNT, and indicate forced unmount with MNTK_UNMOUNTF
for VFS_UNMOUNT() in cleanup.
PR: 251320 [*]
Reported by: Tong Zhang <ztong0001@gmail.com>
Reviewed by: markj, mjg
Discussed with: rmacklem
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27327
The commited patch was incomplete.
- add back missing goto retry, noted by jhb
- 'if (error)' -> 'if (error != 0)'
- consistently do:
if (error != 0)
break;
continue;
instead of:
if (error != 0)
break;
else
continue;
This adds some 'continue' uses which are not needed, but line up with the
rest of pipe_write.
The current logic is a fine choice for a system administrator modifying
process cpusets or a process creating a new cpuset(2), but not ideal for
processes attaching to a jail.
Currently, when a process attaches to a jail, it does exactly what any other
process does and loses any mask it might have applied in the process of
doing so because cpuset_setproc() is entirely based around the assumption
that non-anonymous cpusets in the process can be replaced with the new
parent set.
This approach slightly improves the jail attach integration by modifying
cpuset_setproc() callers to indicate if they should rebase their cpuset to
the indicated set or not (i.e. cpuset_setproc_update_set).
If we're rebasing and the process currently has a cpuset assigned that is
not the containing jail's root set, then we will now create a new base set
for it hanging off the jail's root with the existing mask applied instead of
using the jail's root set as the new base set.
Note that the common case will be that the process doesn't have a cpuset
within the jail root, but the system root can freely assign a cpuset from
a jail to a process outside of the jail with no restriction. We assume that
that may have happened or that it could happen due to a race when we drop
the proc lock, so we must recheck both within the loop to gather up
sufficient freed cpusets and after the loop.
To recap, here's how it worked before in all cases:
0 4 <-- jail 0 4 <-- jail / process
| |
1 -> 1
|
3 <-- process
Here's how it works now:
0 4 <-- jail 0 4 <-- jail
| | |
1 -> 1 5 <-- process
|
3 <-- process
or
0 4 <-- jail 0 4 <-- jail / process
| |
1 <-- process -> 1
More importantly, in both cases, the attaching process still retains the
mask it had prior to attaching or the attach fails with EDEADLK if it's
left with no CPUs to run on or the domain policy is incompatible. The
author of this patch considers this almost a security feature, because a MAC
policy could grant PRIV_JAIL_ATTACH to an unprivileged user that's
restricted to some subset of available CPUs the ability to attach to a jail,
which might lift the user's restrictions if they attach to a jail with a
wider mask.
In most cases, it's anticipated that admins will use this to be able to,
for example, `cpuset -c -l 1 jail -c path=/ command=/long/running/cmd`,
and avoid the need for contortions to spawn a command inside a jail with a
more limited cpuset than the jail.
Reviewed by: jamie
MFC after: 1 month (maybe)
Differential Revision: https://reviews.freebsd.org/D27298
cpuset_init() is better descriptor for what the function actually does. The
name was previously taken by a sysinit that setup cpuset_zero's mask
from all_cpus, it was removed in r331698 before stable/12 branched.
A comment referencing the removed sysinit has now also been removed, since
the setup previously done was moved into cpuset_thread0().
Suggested by: markj
MFC after: 1 week
Currently, it must always allocate a new set to be used for passing to
_cpuset_create, but it doesn't have to. This is purely kern_cpuset.c
internal and it's sparsely used, so just change it to use *setp if it's
not-NULL and modify the two consumers to pass in the address of a NULL
cpuset.
This paves the way for consumers that want the unr allocation without the
possibility of sleeping as long as they've done their due diligence to
ensure that the mask will properly apply atop the supplied parent
(i.e. avoiding the free_unr() in the last failure path).
Reviewed by: jamie, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27297
All paths leading into closefp() will either replace or remove the fd from
the filedesc table, and closefp() will call fo_close methods that can and do
currently sleep without regard for the possibility of an ERESTART. This can
be dangerous in multithreaded applications as another thread could have
opened another file in its place that is subsequently operated on upon
restart.
The following are seemingly the only ones that will pass back ERESTART
in-tree:
- sockets (SO_LINGER)
- fusefs
- nfsclient
Reviewed by: jilles, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27310
Adding to zombie list can be perfomed by idle threads, which on ppc64 leads to
panics as it requires a sleepable lock.
Reported by: alfredo
Reviewed by: kib, markj
Fixes: r367842 ("thread: numa-aware zombie reaping")
Differential Revision: https://reviews.freebsd.org/D27288
Exec and exit are same as corresponding eventhandler hooks.
Thread exit hook is called somewhat earlier, while thread is still
owned by the process and enough context is available. Note that the
process lock is owned when the hook is called.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27309
As far as I can tell, this has been the case since initially committed in
2008. cpuset_setproc is the executor of cpuset reassignment; note this
excerpt from the description:
* 1) Set is non-null. This reparents all anonymous sets to the provided
* set and replaces all non-anonymous td_cpusets with the provided set.
However, reviewing cpuset_setproc_setthread() for some jail related work
unearthed the error: if tdset was not anonymous, we were replacing it with
`set`. If it was anonymous, then we'd rebase it onto `set` (i.e. copy the
thread's mask over and AND it with `set`) but give the new anonymous set
the original tdset as the parent (i.e. the base of the set we're supposed to
be leaving behind).
The primary visible consequences were that:
1.) cpuset_getid() following such assignment returns the wrong result, the
setid that we left behind rather than the one we joined.
2.) When a process attached to the jail, the base set of any anonymous
threads was a set outside of the jail.
This was initially bundled in D27298, but it's a minor fix that's fairly
easy to verify the correctness of.
A test is included in D27307 ("badparent"), which demonstrates the issue
with, effectively:
osetid = cpuset_getid()
newsetid = cpuset()
cpuset_setaffinity(thread)
cpuset_setid(osetid)
cpuset_getid(thread) -> observe that it matches newsetid instead of osetid.
MFC after: 1 week
Providing these in freebsd32.h facilitates local testing/measuring of the
structs rather than forcing one to locally recreate them. Sanity checking
offsets/sizes remains in kern_umtx.c where these are typically used.
oldfde may be invalidated if the table has grown due to the operation that
we're performing, either via fdalloc() or a direct fdgrowtable_exp().
This was technically OK before rS367927 because the old table remained valid
until the filedesc became unused, but now it may be freed immediately if
it's an unshared table in a single-threaded process, so it is no longer a
good assumption to make.
This fixes dup/dup2 invocations that grow the file table; in the initial
report, it manifested as a kernel panic in devel/gmake's configure script.
Reported by: Guy Yur <guyyur gmail com>
Reviewed by: rew
Differential Revision: https://reviews.freebsd.org/D27319
This patch takes advantage of the consolidation that happened to provide two
flags that can be used with the native _umtx_op(2): UMTX_OP___32BIT and
UMTX_OP__I386.
UMTX_OP__32BIT iindicates that we are being provided with 32-bit structures.
Note that this flag alone indicates a 64bit time_t, since this is the
majority case.
UMTX_OP__I386 has been provided so that we can emulate i386 as well,
regardless of whether the host is amd64 or not.
Both imply a different set of copyops in sysumtx_op. freebsd32__umtx_op
simply ignores the flags, since it's already doing a 32-bit operation and
it's unlikely we'll be running an emulator under compat32. Future work
could consider it, but the author sees little benefit.
This will be used by qemu-bsd-user to pass on all _umtx_op calls to the
native interface as long as the host/target endianness matches, effectively
eliminating most if not all of the remaining unresolved deadlocks for most.
This version changed a fair amount from what was under review, mostly in
response to refactoring of the prereq reorganization and battle-testing
it with qemu-bsd-user. The main changes are as follows:
1.) The i386 flag got renamed to omit '32BIT' since this is redundant.
2.) The flags are now properly handled on 32-bit platforms to emulate other
32-bit platforms.
3.) Robust list handling was fixed, and the 32-bit functionality that was
previously gated by COMPAT_FREEBSD32 is now unconditional.
4.) Robust list handling was also improved, including the error reported
when a process has already registered 32-bit ABI lists and also
detecting if native robust lists have already been registered. Both
scenarios now return EBUSY rather than EINVAL, because the input is
technically valid but we're too busy with another ABI's lists.
libsysdecode/kdump/truss support will go into review soon-ish, along with
the associated manpage update.
Reviewed by: kib (earlier version)
MFC after: 3 weeks
During the life of a process, new file descriptor tables may be allocated. When
a new table is allocated, the old table is placed in a free list and held onto
until all processes referencing them exit.
When a new file descriptor table is allocated, the old file descriptor table
can be freed when the current process has a single-thread and the file
descriptor table is not being shared with any other processes.
Reviewed by: kevans
Approved by: kevans (mentor)
Differential Revision: https://reviews.freebsd.org/D18617
While there, do some minor cleanup for kclocks. They are only
registered from kern_time.c, make registration function static.
Remove event hooks, they are not used by both registered kclocks.
Add some consts.
Perhaps we can stop registering kclocks at all and statically
initialize them.
Reviewed by: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27305
There is no point in dynamic registration, umtx hook is there always.
Reviewed by: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27303
There are message based controllers that can bind interrupts even if they are
not implemented as root controllers (such as the ITS subblock of GIC).
MFC after: 3 weeks
All reads and writes are serialized with a hand-rolled lock, but unlocking it
always wakes up all waiters. Existing flag fields get resized to make room for
introduction of waiter counter without growing the struct.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D27273
Suppose a running callout re-arms itself, and before the callout
finishes running another CPU calls callout_drain() and goes to sleep.
softclock_call_cc() will wake up the draining thread, which may not run
immediately if there is a lot of CPU load. Furthermore, the callout is
still in the callout wheel so it can continue to run and re-arm itself.
Then, suppose that the callout migrates to another CPU before the
draining thread gets a chance to run. The draining thread is in this
loop in _callout_stop_safe():
while (cc_exec_curr(cc) == c) {
CC_UNLOCK(cc);
sleep();
CC_LOCK(cc);
}
but after the migration, cc points to the wrong CPU's callout state.
Then the draining thread goes off and removes the callout from the
wheel, but does so using the wrong lock and per-CPU callout state.
Fix the problem by doing a re-lookup of the callout CPU after sleeping.
Reported by: syzbot+79569cd4d76636b2cc1c@syzkaller.appspotmail.com
Reported by: syzbot+1b27e0237aa22d8adffa@syzkaller.appspotmail.com
Reported by: syzbot+e21aa5b85a9aff90ef3e@syzkaller.appspotmail.com
Reviewed by: emaste, hselasky
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27266
There are many cases where one would choose avoid entering the debugger
on a normal panic, opting instead to reboot and possibly save a kernel
dump. However, recursive kernel panics are an unusual case that might
warrant attention from a human, so provide a secondary tunable,
debug.debugger_on_recursive_panic, to allow entering the debugger only
when this occurs.
For for simplicity in maintaining existing behaviour, the tunable
defaults to zero.
Reviewed by: cem, markj
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D27271
The current global list is a significant problem, in particular induces a lot
of cross-domain thread frees. When running poudriere on a 2 domain box about
half of all frees were of that nature.
Patch below introduces per-domain thread data containing zombie lists and
domain-aware reaping. By default it only reaps from the current domain, only
reaping from others if there is free TID shortage.
A dedicated callout is introduced to reap lingering threads if there happens
to be no activity.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D27185
pipes get stated all thet time and this avoidably contributed to contention.
The pipe lock is only held to accomodate MAC and to check the type.
Since normally there is no probe for pipe stat depessimize this by having the
flag.
The pipe_state field gets modified with locks held all the time and it's not
feasible to convert them to use atomic store. Move the type flag away to a
separate variable as a simple cleanup and to provide stable field to read.
Use short for both fields to avoid growing the struct.
While here short-circuit MAC for pipe_poll as well.
The arm configs that required it have been removed from the tree.
Removing this option makes the callout code easier to read and
discourages developers from adding new configs without eventtimer
drivers.
Reviewed by: ian, imp, mav
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27270
The suser_enable sysctl allows to remove a privileged rights from uid 0.
This change introduce per jail setting which allow to make root a
normal user.
Reviewed by: jamie
Previous version reviewed by: kevans, emaste, markj, me_igalic.co
Discussed with: pjd
Differential Revision: https://reviews.freebsd.org/D27128
- Mask out recently added VV_* bits to avoid printing them twice.
- Keep VI_LOCKed on the same line as the rest of the flags.
Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D27261
A copy-pasto left us copying in 24-bytes at the address of the rb pointer
instead of the intended target.
Reported by: sigsys@gmail.com
Sighing: kevans
The two flags are distinct and it is impossible to correctly handle clone(2)
without the assistance of fork1(). This change depends on the pwddesc split
introduced in r367777.
I've added a fork_req flag, FR2_SHARE_PATHS, which indicates that p_pd
should be treated the opposite way p_fd is (based on RFFDG flag). This is a
little ugly, but the benefit is that existing RFFDG API is preserved.
Holding FR2_SHARE_PATHS disabled, RFFDG indicates both p_fd and p_pd are
copied, while !RFFDG indicates both should be cloned.
In Chrome, clone(2) is used with CLONE_FS, without CLONE_FILES, and expects
independent fd tables.
The previous conflation of CLONE_FS and CLONE_FILES was introduced in
r163371 (2006).
Discussed with: markj, trasz (earlier version)
Differential Revision: https://reviews.freebsd.org/D27016
No functional change intended.
Tracking these structures separately for each proc enables future work to
correctly emulate clone(2) in linux(4).
__FreeBSD_version is bumped (to 1300130) for consumption by, e.g., lsof.
Reviewed by: kib
Discussed with: markj, mjg
Differential Revision: https://reviews.freebsd.org/D27037
As this ABI is still fresh (r367287), let's correct some mistakes now:
- Version the structure to allow for future changes
- Include sender's pid in control message structure
- Use a distinct control message type from the cmsgcred / sockcred mess
Discussed with: kib, markj, trasz
Differential Revision: https://reviews.freebsd.org/D27084
One of the last shifts inadvertently moved these static assertions out of a
COMPAT_FREEBSD32 block, which the relevant definitions are limited to.
Fix it.
Pointy hat: kevans
All of the compat32 variants are substantially the same, save for
copyin/copyout (mostly). Apply the same kind of technique used with kevent
here by having the syscall routines supply a umtx_copyops describing the
operations needed.
umtx_copyops carries the bare minimum needed- size of timespec and
_umtx_time are used for determining if copyout is needed in the sem2_wait
case.
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27222
Specifically, if we're waking up some value n > BATCH_SIZE, then the
copyin(9) is wrong on the second iteration due to upp being the wrong type.
upp is currently a uint32_t**, so upp + pos advances it by twice as many
elements as it should (host pointer size vs. compat32 pointer size).
Fix it by just making upp a uint32_t*; it's still technically a double
pointer, but the distinction doesn't matter all that much here since we're
just doing arithmetic on it.
Add a test case that demonstrates the problem, placed with the libthr tests
since one messing with _umtx_op should be running these tests. Running under
compat32, the new test case will hang as threads after the first 128 get
missed in the wake. it's not immediately clear how to hit it in practice,
since pthread_cond_broadcast() uses a smaller (sleepq batch?) size observed
to be around ~50 -- I did not spend much time digging into it.
The uintptr_t change makes no functional difference, but i've tossed it in
since it's more accurate (semantically).
Reported by: Andrew Gierth (andrew_tao173.riddles.org.uk, inspection)
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27231
Add __unused to some args.
Change type of the iterator variables to match loop control.
Remove excessive {}.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27220
This moves entire large alloc handling out of all consumers, apart from
deciding to go there.
This is a step towards creating a fast path.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D27198
Since thread_zone is marked NOFREE the thread_fini callback is never
executed, meaning memory allocated by seltdinit is never released.
Adding the call to thread_dtor is not sufficient as exiting processes
cache the main thread.
Refcounting was added to combat a race between selfdfree and doselwakup,
but it adds avoidable overhead.
selfdfree detects it can free the object by ->sf_si == NULL, thus we can
ensure that the condition only holds after all accesses are completed.
The global array has prohibitive performance impact on multicore systems.
The same data (and more) can be obtained with dtrace.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D27199
Restart syscalls and some sync operations when filesystem indicated
ERELOOKUP condition, mostly for VOPs operating on metdata. In
particular, lookup results cached in the inode/v_data is no longer
valid and needs recalculating. Right now this should be nop.
Assert that ERELOOKUP is catched everywhere and not returned to
userspace, by asserting that td_errno != ERELOOKUP on syscall return
path.
In collaboration with: pho
Reviewed by: mckusick (previous version), markj
Tested by: markj (syzkaller), pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D26136
The routine does not serve any practical purpose.
Memory can be allocated in many other ways and most consumers pass the
M_WAITOK flag, making malloc not fail in the first place.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D27143
This works for amd64, but none others -- drop it, because we already have a
proper definition in sys/compat/freebsd32/freebsd32.h that correctly uses
time32_t.
MFC after: 1 week
This gets rid of the most contended spinlock seen when creating/destroying
threads in a loop. (modulo kstack)
Tested by: alfredo (ppc64), bdragon (ppc64)
First, funsetownlst() list looks at the first element of the list to see
whether it's processing a process or a process group list. Then it
acquires the global sigio lock and processes the list. However, nothing
prevents the first sigio tracker from being freed by a concurrent
funsetown() before the sigio lock is acquired.
Fix this by acquiring the global sigio lock immediately after checking
whether the list is empty. Callers of funsetownlst() ensure that new
sigio trackers cannot be added concurrently.
Second, fsetown() uses funsetown() to remove an existing sigio structure
from a file object. However, funsetown() uses a racy check to avoid the
sigio lock, so two threads may call fsetown() on the same file object,
both observe that no sigio tracker is present, and enqueue two sigio
trackers for the same file object. However, if the file object is
destroyed, funsetown() will only remove one sigio tracker, and
funsetownlst() may later trigger a use-after-free when it clears the
file object reference for each entry in the list.
Fix this by introducing funsetown_locked(), which avoids the racy check.
Reviewed by: kib
Reported by: pho
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27157
Note this still does not scale but is enough to move it out of the way
for the foreseable future.
In particular a trivial benchmark spawning/killing threads stops contesting
on tidhash.
This in particular unbreaks rtkit.
The limitation was a leftover of previous state, to quote a
comment:
/*
* Though lwpid is unique, only current process is supported
* since there is no efficient way to look up a LWP yet.
*/
Long since then a global tid hash was introduced to remedy
the problem.
Permission checks still apply.
Submitted by: greg_unrelenting.technology (Greg V)
Differential Revision: https://reviews.freebsd.org/D27158
There are workloads with very bursty tid allocation and since unr tries very
hard to have small-sized bitmaps it keeps reallocating memory. Just doing
buildkernel gives almost 150k calls to free coming from unr.
This also gets rid of the hack which tried to postpone TID reuse.
Reviewed by: kib, markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D27101