They were only modified to accomodate a redundant assertion.
This runs into problems as lockless lookup can still try to use the vnode
and crash instead of getting an error.
The bug was only present in kernels with INVARIANTS.
Reported by: kevans
This is a valid scenario that's handled in the various protocol layers where
it makes sense (e.g., tcp_disconnect and sctp_disconnect). Given that it
indicates we should immediately drop the connection, it makes little sense
to sleep on it.
This could lead to panics with INVARIANTS. On non-INVARIANTS kernels, this
could result in the thread hanging until a signal interrupts it if the
protocol does not mark the socket as disconnected for whatever reason.
Reported by: syzbot+e625d92c1dd74e402c81@syzkaller.appspotmail.com
Reviewed by: glebius, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27407
As of r365978, minidumps include a copy of dump_avail[]. This is an
array of vm_paddr_t ranges. libkvm walks the array assuming that
sizeof(vm_paddr_t) is equal to the platform "word size", but that's not
correct on some platforms. For instance, i386 uses a 64-bit vm_paddr_t.
Fix the problem by always dumping 64-bit addresses. On platforms where
vm_paddr_t is 32 bits wide, namely arm and mips (sometimes), translate
dump_avail[] to an array of uint64_t ranges. With this change, libkvm
no longer needs to maintain a notion of the target word size, so get rid
of it.
This is a no-op on platforms where sizeof(vm_paddr_t) == 8.
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27082
hw.physmem tunable allows to limit number of physical memory available to the
system. It's handled in machdep files for x86 and PowerPC. This patch adds
required logic to the consolidated physmem management interface that is used by
ARM, ARM64, and RISC-V.
Submitted by: Klara, Inc.
Reviewed by: mhorne
Sponsored by: Ampere Computing
Differential Revision: https://reviews.freebsd.org/D27152
Prior to the patch returning selfdfree could still be racing against doselwakeup
which set sf_si = NULL and now locks stp to wake up the other thread.
A sufficiently unlucky pair can end up going all the way down to freeing
select-related structures before the lock/wakeup/unlock finishes.
This started manifesting itself as crashes since select data started getting
freed in r367714.
Right now, if lio registered zero jobs, syscall frees lio job
structure, cleaning up queued ksi. As result, the realtime signal is
dequeued and never delivered.
Fix it by allowing sendsig() to copy ksi when job count is zero.
PR: 220398
Reported and reviewed by: asomers
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27421
A pair of bugs are believed to have caused the hangs described in the
commit log message for r364744:
1. uma_reclaim() could trigger reclamation of the reserve of boundary
tags used to avoid deadlock. This was fixed by r366840.
2. The loop in vmem_xalloc() would in some cases try to allocate more
boundary tags than the expected upper bound of BT_MAXALLOC. The
reserve is sized based on the value BT_MAXMALLOC, so this behaviour
could deplete the reserve without guaranteeing a successful
allocation, resulting in a hang. This was fixed by r366838.
PR: 248008
Tested by: rmacklem
Refactor sysctl_sysctl_next_ls():
* Move huge inner loop out of sysctl_sysctl_next_ls() into a separate
non-recursive function, returning the next step to be taken.
* Update resulting node oid parts only on successful lookup
* Make sysctl_sysctl_next_ls() return boolean success/failure instead of errno,
slightly simplifying logic
Reviewed by: freqlabs
Differential Revision: https://reviews.freebsd.org/D27029
Implement vt_vbefb to support Vesa Bios Extensions (VBE) framebuffer with VT.
vt_vbefb is built based on vt_efifb and is assuming similar data for
initialization, use MODINFOMD_VBE_FB to identify the structure vbe_fb
in kernel metadata.
struct vbe_fb, is populated by boot loader, and is passed to kernel via
metadata payload.
Differential Revision: https://reviews.freebsd.org/D27373
Apparently some architectures, like ppc in its hashed page tables
variants, account mappings by pmap_qenter() in the response from
pmap_is_page_mapped().
While there, eliminate useless userp variable.
Noted and reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27409
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
Make b_pages[] array in struct buf flexible. Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*). Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.
Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys. Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight. Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.
Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.
Suggested by: mav (*)
Reviewed by: imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27225
Restructure the loop a little bit to make it a little more clear how it
really operates: we never allocate any domains at the beginning of the first
iteration, and it will run until we've satisfied the amount we need or we
encounter an error.
The lock is now taken outside of the loop to make stuff inside the loop
easier to evaluate w.r.t. locking.
This fixes it to not try and allocate any domains for the freelist under the
spinlock, which would have happened before if we needed any new domains.
Reported by: syzbot+6743fa07b9b7528dc561@syzkaller.appspotmail.com
Reviewed by: markj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D27371
There is no reason why vp->v_object cannot be NULL. If it is, it's
fine, handle it by delegating to VOP_READ().
Tested by: pho
Reviewed by: markj, mjg
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27327
- VFS_UNMOUNT() requires vn_start_write() around it [*].
- call VFS_PURGE() before unmount.
- do not destroy mp if cleanup unmount did not succeed.
- set MNTK_UNMOUNT, and indicate forced unmount with MNTK_UNMOUNTF
for VFS_UNMOUNT() in cleanup.
PR: 251320 [*]
Reported by: Tong Zhang <ztong0001@gmail.com>
Reviewed by: markj, mjg
Discussed with: rmacklem
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27327
The commited patch was incomplete.
- add back missing goto retry, noted by jhb
- 'if (error)' -> 'if (error != 0)'
- consistently do:
if (error != 0)
break;
continue;
instead of:
if (error != 0)
break;
else
continue;
This adds some 'continue' uses which are not needed, but line up with the
rest of pipe_write.
The current logic is a fine choice for a system administrator modifying
process cpusets or a process creating a new cpuset(2), but not ideal for
processes attaching to a jail.
Currently, when a process attaches to a jail, it does exactly what any other
process does and loses any mask it might have applied in the process of
doing so because cpuset_setproc() is entirely based around the assumption
that non-anonymous cpusets in the process can be replaced with the new
parent set.
This approach slightly improves the jail attach integration by modifying
cpuset_setproc() callers to indicate if they should rebase their cpuset to
the indicated set or not (i.e. cpuset_setproc_update_set).
If we're rebasing and the process currently has a cpuset assigned that is
not the containing jail's root set, then we will now create a new base set
for it hanging off the jail's root with the existing mask applied instead of
using the jail's root set as the new base set.
Note that the common case will be that the process doesn't have a cpuset
within the jail root, but the system root can freely assign a cpuset from
a jail to a process outside of the jail with no restriction. We assume that
that may have happened or that it could happen due to a race when we drop
the proc lock, so we must recheck both within the loop to gather up
sufficient freed cpusets and after the loop.
To recap, here's how it worked before in all cases:
0 4 <-- jail 0 4 <-- jail / process
| |
1 -> 1
|
3 <-- process
Here's how it works now:
0 4 <-- jail 0 4 <-- jail
| | |
1 -> 1 5 <-- process
|
3 <-- process
or
0 4 <-- jail 0 4 <-- jail / process
| |
1 <-- process -> 1
More importantly, in both cases, the attaching process still retains the
mask it had prior to attaching or the attach fails with EDEADLK if it's
left with no CPUs to run on or the domain policy is incompatible. The
author of this patch considers this almost a security feature, because a MAC
policy could grant PRIV_JAIL_ATTACH to an unprivileged user that's
restricted to some subset of available CPUs the ability to attach to a jail,
which might lift the user's restrictions if they attach to a jail with a
wider mask.
In most cases, it's anticipated that admins will use this to be able to,
for example, `cpuset -c -l 1 jail -c path=/ command=/long/running/cmd`,
and avoid the need for contortions to spawn a command inside a jail with a
more limited cpuset than the jail.
Reviewed by: jamie
MFC after: 1 month (maybe)
Differential Revision: https://reviews.freebsd.org/D27298
cpuset_init() is better descriptor for what the function actually does. The
name was previously taken by a sysinit that setup cpuset_zero's mask
from all_cpus, it was removed in r331698 before stable/12 branched.
A comment referencing the removed sysinit has now also been removed, since
the setup previously done was moved into cpuset_thread0().
Suggested by: markj
MFC after: 1 week
Currently, it must always allocate a new set to be used for passing to
_cpuset_create, but it doesn't have to. This is purely kern_cpuset.c
internal and it's sparsely used, so just change it to use *setp if it's
not-NULL and modify the two consumers to pass in the address of a NULL
cpuset.
This paves the way for consumers that want the unr allocation without the
possibility of sleeping as long as they've done their due diligence to
ensure that the mask will properly apply atop the supplied parent
(i.e. avoiding the free_unr() in the last failure path).
Reviewed by: jamie, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27297
All paths leading into closefp() will either replace or remove the fd from
the filedesc table, and closefp() will call fo_close methods that can and do
currently sleep without regard for the possibility of an ERESTART. This can
be dangerous in multithreaded applications as another thread could have
opened another file in its place that is subsequently operated on upon
restart.
The following are seemingly the only ones that will pass back ERESTART
in-tree:
- sockets (SO_LINGER)
- fusefs
- nfsclient
Reviewed by: jilles, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27310
Adding to zombie list can be perfomed by idle threads, which on ppc64 leads to
panics as it requires a sleepable lock.
Reported by: alfredo
Reviewed by: kib, markj
Fixes: r367842 ("thread: numa-aware zombie reaping")
Differential Revision: https://reviews.freebsd.org/D27288
Exec and exit are same as corresponding eventhandler hooks.
Thread exit hook is called somewhat earlier, while thread is still
owned by the process and enough context is available. Note that the
process lock is owned when the hook is called.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D27309
As far as I can tell, this has been the case since initially committed in
2008. cpuset_setproc is the executor of cpuset reassignment; note this
excerpt from the description:
* 1) Set is non-null. This reparents all anonymous sets to the provided
* set and replaces all non-anonymous td_cpusets with the provided set.
However, reviewing cpuset_setproc_setthread() for some jail related work
unearthed the error: if tdset was not anonymous, we were replacing it with
`set`. If it was anonymous, then we'd rebase it onto `set` (i.e. copy the
thread's mask over and AND it with `set`) but give the new anonymous set
the original tdset as the parent (i.e. the base of the set we're supposed to
be leaving behind).
The primary visible consequences were that:
1.) cpuset_getid() following such assignment returns the wrong result, the
setid that we left behind rather than the one we joined.
2.) When a process attached to the jail, the base set of any anonymous
threads was a set outside of the jail.
This was initially bundled in D27298, but it's a minor fix that's fairly
easy to verify the correctness of.
A test is included in D27307 ("badparent"), which demonstrates the issue
with, effectively:
osetid = cpuset_getid()
newsetid = cpuset()
cpuset_setaffinity(thread)
cpuset_setid(osetid)
cpuset_getid(thread) -> observe that it matches newsetid instead of osetid.
MFC after: 1 week
Providing these in freebsd32.h facilitates local testing/measuring of the
structs rather than forcing one to locally recreate them. Sanity checking
offsets/sizes remains in kern_umtx.c where these are typically used.
oldfde may be invalidated if the table has grown due to the operation that
we're performing, either via fdalloc() or a direct fdgrowtable_exp().
This was technically OK before rS367927 because the old table remained valid
until the filedesc became unused, but now it may be freed immediately if
it's an unshared table in a single-threaded process, so it is no longer a
good assumption to make.
This fixes dup/dup2 invocations that grow the file table; in the initial
report, it manifested as a kernel panic in devel/gmake's configure script.
Reported by: Guy Yur <guyyur gmail com>
Reviewed by: rew
Differential Revision: https://reviews.freebsd.org/D27319
This patch takes advantage of the consolidation that happened to provide two
flags that can be used with the native _umtx_op(2): UMTX_OP___32BIT and
UMTX_OP__I386.
UMTX_OP__32BIT iindicates that we are being provided with 32-bit structures.
Note that this flag alone indicates a 64bit time_t, since this is the
majority case.
UMTX_OP__I386 has been provided so that we can emulate i386 as well,
regardless of whether the host is amd64 or not.
Both imply a different set of copyops in sysumtx_op. freebsd32__umtx_op
simply ignores the flags, since it's already doing a 32-bit operation and
it's unlikely we'll be running an emulator under compat32. Future work
could consider it, but the author sees little benefit.
This will be used by qemu-bsd-user to pass on all _umtx_op calls to the
native interface as long as the host/target endianness matches, effectively
eliminating most if not all of the remaining unresolved deadlocks for most.
This version changed a fair amount from what was under review, mostly in
response to refactoring of the prereq reorganization and battle-testing
it with qemu-bsd-user. The main changes are as follows:
1.) The i386 flag got renamed to omit '32BIT' since this is redundant.
2.) The flags are now properly handled on 32-bit platforms to emulate other
32-bit platforms.
3.) Robust list handling was fixed, and the 32-bit functionality that was
previously gated by COMPAT_FREEBSD32 is now unconditional.
4.) Robust list handling was also improved, including the error reported
when a process has already registered 32-bit ABI lists and also
detecting if native robust lists have already been registered. Both
scenarios now return EBUSY rather than EINVAL, because the input is
technically valid but we're too busy with another ABI's lists.
libsysdecode/kdump/truss support will go into review soon-ish, along with
the associated manpage update.
Reviewed by: kib (earlier version)
MFC after: 3 weeks
During the life of a process, new file descriptor tables may be allocated. When
a new table is allocated, the old table is placed in a free list and held onto
until all processes referencing them exit.
When a new file descriptor table is allocated, the old file descriptor table
can be freed when the current process has a single-thread and the file
descriptor table is not being shared with any other processes.
Reviewed by: kevans
Approved by: kevans (mentor)
Differential Revision: https://reviews.freebsd.org/D18617
While there, do some minor cleanup for kclocks. They are only
registered from kern_time.c, make registration function static.
Remove event hooks, they are not used by both registered kclocks.
Add some consts.
Perhaps we can stop registering kclocks at all and statically
initialize them.
Reviewed by: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27305
There is no point in dynamic registration, umtx hook is there always.
Reviewed by: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D27303
There are message based controllers that can bind interrupts even if they are
not implemented as root controllers (such as the ITS subblock of GIC).
MFC after: 3 weeks
All reads and writes are serialized with a hand-rolled lock, but unlocking it
always wakes up all waiters. Existing flag fields get resized to make room for
introduction of waiter counter without growing the struct.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D27273