to make it use the right aligned zone.
Reported by: melifaro
Reviewed by: alc, markj (previous version)
Discussed with: jrtc27
Tested by: pho (previous version)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28219
UMA page_alloc() does not take an alignment, so UMA can only handle
alignment less then page size.
Noted by: alc
Reviewed by: alc, markj (previous version)
Discussed with: jrtc27
Tested by: pho (previous version)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28219
Use refcount(9) for both pr_ref and pr_uref in struct prison. This
allows prisons to held and freed without requiring the prison mutex.
An exception to this is that dropping the last reference will still
lock the prison, to keep the guarantee that a locked prison remains
valid and alive (provided it was at the time it was locked).
Among other things, this honors the promise made in a comment in
crcopy(9), that it will not block, which hasn't been true for two
decades.
This allows slightly more efficient opcode testing in-kernel. It is
transparent to userland, except to applications that sneakily submit
aio fsync or aio mlock operations via lio_listio, which has never been
documented, requires the use of deliberately undefined constants
(LIO_SYNC and LIO_MLOCK), and is arguably a bug.
Reviewed by: jhb
Differential Revision: https://reviews.freebsd.org/D27942
This patch helped me debug why /sbin/init was not being loaded after
making changes to the image activator in CheriBSD.
Reviewed By: jhb (earlier version), kib
Differential Revision: https://reviews.freebsd.org/D28121
- Only check for empty domains if we actually tried to configure domain
affinity in the first place. Otherwise setting bind_threads=1 will
always cause the sysctl value to be reported as zero. This is
harmless since the threads end up being bound, but it's confusing.
- Try to improve the sysctl description a bit.
Reviewed by: gallatin, jhb
Submitted by: Klara, Inc.
Sponsored by: Ampere Computing
Differential Revision: https://reviews.freebsd.org/D28161
TRYUPGRADE requests kept failing when they should not have due to wrong
macro used to count readers.
Fixes: f6b091fbbd77cbb0 ("lockmgr: rewrite upgrade to stop always dropping the lock")
Noted by: asomers
Differential Revision: https://reviews.freebsd.org/D27947
Previously the code would branch on top find out whether it should
branch on SDT probe and bumping the numposhits counter, depending
on cache_fplookup_cross_mount.
Arguably it should be done regardless of what said function returns.
Move prison_hold, prison_hold_locked ,prison_proc_hold, and
prison_proc_free to a more intuitive part of the file (together with
with prison_free and prison_free_locked), and add or improve comments
to these and others, to better describe what's going in the prison
reference cycle.
No functional changes.
Use a machdep.nirq tunable intead of compile-time constant NIRQ
as a value for maximum number of interrupts. It allows keep a system
footprint small by default with an option to increase the limit
for large systems like server-grade ARM64
Reviewd by: mhorne
Differential Revision: https://reviews.freebsd.org/D27844
Submitted by: Klara, Inc.
Sponsored by: Ampere Computing
prison_isvalid() checks if a prison record can be used at all, i.e.
pr_ref > 0. This filters out prisons that aren't fully created, and
those that are either in the process of being dismantled, or will be
at the next opportunity. While the check for pr_ref > 0 is simple
enough to make without a convenience function, this prepares the way
for other measures of prison validity.
prison_isalive() checks not only validity as far as the useablity of
the prison structure, but also whether the prison is visible to user
space. It replaces a test for pr_uref > 0, which is currently only
used within kern_jail.c, and not often there.
Both of these functions also assert that either the prison mutex or
allprison_lock is held, since it's generally the case that unlocked
prisons aren't guaranteed to remain useable for any length of time.
This isn't entirely true, for example a thread can assume its own
prison is good, but most exceptions will exist inside of kern_jail.c.
Change the power-of-two malloc zones to require alignment equal to the
size [*]. Current uma allocator already provides such alignment, so in
fact this change does not change anything except providing future-proof
setup.
Suggested by: markj [*]
Reviewed by: andrew, jah, markj
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28147
disk failure.
Each vnode has an embedded lock that controls access to its contents.
However vnodes describing a UFS snapshot all share a single snapshot
lock to coordinate their access and update. As part of mounting a
UFS filesystem with snapshots, each of the vnodes describing a
snapshot has its individual lock replaced with the snapshot lock.
When the filesystem is unmounted the vnode's original lock is
returned replacing the snapshot lock.
When a disk fails while the UFS filesystem it contains is still
mounted (for example when a thumb drive is removed) UFS forcibly
unmounts the filesystem. The loss of the drive causes the GEOM
subsystem to orphan the provider, but the consumer remains until
the filesystem has finished with the unmount. Information describing
the snapshot locks was being prematurely cleared during the orphaning
causing the return of the snapshot vnode's original locks to fail.
The fix is to not clear the needed information prematurely.
Sponsored by: Netflix
The purpose of this KASSERT is to ensure that we do not run out of space
in the early devmap. However, the devmap grew beyond its initial size of
2MB in r336519, and this assertion did not grow with it.
A devmap mapping of a 1080p framebuffer requires 1920x1080 bytes, or
1.977 MB, so it is just barely able to fit without triggering the
assertion, provided no other devices are mapped before it. With the
addition of `options GDB` in GENERIC by bbfa199cbc16, the uart is now
mapped for the purposes of a debug port, before mapping the framebuffer.
The presence of both these conditions pushes the selected virtual
address just below the threshold, triggering the assertion.
To fix this, use the correct size of the devmap, defined by
PMAP_MAPDEV_EARLY_SIZE. Since this code is shared with RISC-V, define
it for that platform as well (although it is a different size).
PR: 25241
Reported by: gbe
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
This lets callers avoid atomic ops by initializing the count to required
value from the get go.
While here add falloc_abort to backpedal from this without having to
fdrop.
This prevents a situation where other thread modifies map entries
permissions between setting max_prot, then relocking, then setting prot,
confusing the operation outcome. E.g. you can get an error that is not
possible if operation is performed atomic.
Also enable setting rwx for max_prot even if map does not allow to set
effective rwx protection.
Reviewed by: brooks, markj (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28117
Even if sigfastblock block is non-zero, non-blockable signals must be
checked on ast and delivered now. This also affects debugger ability
to attach, because issignal() also calls ptracestop() if there is
a pending stop for debugee.
Instead of checking for sigfastblock, and either setting PENDING flag
for usermode or doing signal delivery loop, always do the loop after
checking, and then handle PENDING bit. issignal() already does the right
thing for fast-blocked case, allowing only STOPs and SIGKILL delivery to
happen.
Reported by: Vasily Postnicov <shamaz.mazum@gmail.com>, markj
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28089
User pending bit should not be set if kernel did not noted a pending signal.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28089
Previously, we would accept any kind of LIO_* opcode, including ones
that were intended for in-kernel use only like LIO_SYNC (which is not
defined in userland). The situation became more serious with
022ca2fc7fe08d51f33a1d23a9be49e6d132914e. After that revision, setting
aio_lio_opcode to LIO_WRITEV or LIO_READV would trigger an assertion.
Note that POSIX does not specify what should happen if aio_lio_opcode is
invalid.
MFC-with: 022ca2fc7fe08d51f33a1d23a9be49e6d132914e
Reviewed by: jhb, tmunro, 0mp
Differential Revision: <https://reviews.freebsd.org/D28078
For rate-based resources that support throttling (e.g.
readiops/writeips), this fixes a divide-by-zero panic when rctl(8)
passes 0 as the throttle value. For these resources, treat
zero-throttle requests as requests to suspend forward progress as long
as possible using the duration specified in
kern.racct.rctl.throttle_max.
PR: 251803
Reported by: chris@cretaforce.gr
Reviewed by: kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D27858
After old vmspace is destroyed during execve(2), but before the new space
is fully constructed, an error during image activation cannot be returned
because there is no executing program to receive it.
In the relatively common case of failure to map stack, print some hints
on the control terminal. Note that user has enough knobs to cause stack
mapping error, and this is the most common reason for execve(2) aborting
the process.
Requested by: jhb
Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28050
It is checked in vm_map_insert() and vm_map_protect() that PROT_WRITE |
PROT_EXEC are never specified together, if vm_map has MAP_WX flag set.
FreeBSD control flag allows specific binary to request WX exempt, and
there are per ABI boolean sysctls kern.elf{32,64}.allow_wx to enable/
disable globally.
Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28050
- Implement a dtrace_getnanouptime(), matching the existing
dtrace_getnanotime(), to avoid DTrace calling out to a potentially
instrumentable function.
(These should probably both be under KDTRACE_HOOKS. Also, it's not clear
to me that they are correct implementations for the DTrace thread time
functions they are used in .. fixes for another commit.)
- Don't allow FBT to instrument functions involved in EL1 exception handling
that are involved in FBT trap processing: handle_el1h_sync() and
do_el1h_sync().
- Don't allow FBT to instrument DDB and KDB functions, as that makes it
rather harder to debug FBT problems.
Prior to these changes, use of FBT on FreeBSD/arm64 rapidly led to kernel
panics due to recursion in DTrace.
Reliable FBT on FreeBSD/arm64 is reliant on another change from @andrew to
have the aarch64 instrumentor more carefully check that instructions it
replaces are against the stack pointer, which can otherwise lead to memory
corruption. That change remains under review.
MFC after: 2 weeks
Reviewed by: andrew, kp, markj (earlier version), jrtc27 (earlier version)
Differential revision: https://reviews.freebsd.org/D27766
Track the the current lock/reference state in a single variable,
rather than deducing the proper prison_deref() flags from a
combination of equations and hard-coded values.
Instead of trying to maintain pg_jobc counter on each process group
update (and sometimes before), just calculate the counter when needed.
Still, for the benefit of the signal delivery code, explicitly mark
orphaned groups as such with the new process group flag.
This way we prevent bugs in the corner cases where updates to the counter
were missed due to complicated configuration of p_pptr/p_opptr/real_parent
(debugger).
Since we need to iterate over all children of the process on exit, this
change mostly affects the process group entry and leave, where we need
to iterate all process group members to detect orpaned status.
(For MFC, keep pg_jobc around but unused).
Reported by: jhb
Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871
This improves code structure and allows to put the lock asserts right
into place where the locks are needed.
Also move zeroing of the kinfo_proc structure from fill_kinfo_proc_only()
to fill_kinfo_proc(), this looks more symmetrical.
Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871
Proctree lock is needed for correct calculation and collection of the
job-control related data in kinfo_proc. There was even an XXX comment
about it.
Satisfy locking and lock ordering requirements by taking proctree lock
around pass over each bucket in proc_iterate(), and in sysctl_kern_proc()
and note_procstat_proc() for individual process reporting.
Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871
Increase the scope of the process group lock ownership. This ensures that
we are consistent in returning EIO for tty write from an orphan and delivery
of TTYOUT signals.
Reviewed by: jilles
Tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871
Often, we have a process locked and need to get locked process group.
In this case, because progress group lock is before process lock,
unlocking process allows the group to be freed. See for instance
tty_wait_background().
Make pgrp structures allocated from nofree zone, and ensure type stability
of the pgrp mutex.
Reviewed by: jilles
Tested by: pho
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D27871