- pget()/pfind() will acquire the PID hash bucket locks, which are
sleepable sx locks, but this means that the sigio mutex cannot be held
while calling these functions. Instead, use pget() to hold the
process, after which we lock the sigio and proc locks, respectively.
- funsetownlst() assumes that processes cannot be registered for SIGIO
once they have P_WEXIT set. However, pfind() will happily return
exiting processes, breaking the invariant. Add an explicit check for
P_WEXIT in fsetown() to fix this. [1]
Fixes: f52979098d ("Fix a pair of races in SIGIO registration")
Reported by: syzkaller [1]
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31661
rmsr.r_offset now is set to rqsr.r_offset plus the number of bytes
zeroed before hitting the end-of-file. After this change rmsr.r_offset
no longer contains the EOF when the requested operation range is
completely beyond the end-of-file. Instead in such case rmsr.r_offset is
equal to rqsr.r_offset. Callers can obtain the number of bytes zeroed
by subtracting rqsr.r_offset from rmsr.r_offset.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31677
This avoids any unnecessary works in such case.
Sponsored by: The FreeBSD Foundation
Reviewed by: markj, kib
Differential Revision: https://reviews.freebsd.org/D31655
rmacklem@ spotted two things in the system call:
- Upon returning from a successful operation, vop_stddeallocate can
update rmsr.r_offset to a value greater than file size. This behavior,
although being harmless, can be confusing.
- The EINVAL return value for rqsr.r_offset + rqsr.r_len > OFF_MAX is
undocumented.
This commit has the following changes:
- vop_stddeallocate and shm_deallocate to bound the the affected area
further by the file size.
- The EINVAL case for rqsr.r_offset + rqsr.r_len > OFF_MAX is
documented.
- The fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9)'s return
len is explicitly documented the be the value 0, and the return offset
is restricted to be the smallest of off + len and current file size
suggested by kib@. This semantic allows callers to interact better
with potential file size growth after the call.
Sponsored by: The FreeBSD Foundation
Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D31604
Allow multiple vector IOs to be started with one system call.
aio_readv() and aio_writev() already used these opcodes under the
covers. This commit makes them available to user space.
Being non-standard extensions, they're only visible if __BSD_VISIBLE is
defined, like the functions.
Reviewed by: asomers, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31627
Now that we allow recursive unmount attempts to be abandoned upon
exceeding the retry limit, we should avoid leaving an unkillable
thread when a synchronous unmount request was issued against the
base filesystem.
Reviewed by: kib (earlier revision), mkusick
Differential Revision: https://reviews.freebsd.org/D31450
A forcible unmount attempt may fail due to a transient condition, but
it may also fail due to some issue in the filesystem implementation
that will indefinitely prevent successful unmount. In such a case,
the retry logic in the recursive unmount facility will cause the
deferred unmount taskqueue to execute constantly.
Avoid this scenario by imposing a retry limit, with a default value
of 10, beyond which the recursive unmount facility will emit a log
message and give up. Additionally, introduce a grace period, with
a default value of 1s, between successive unmount retries on the
same mount.
Create a new sysctl node, vfs.deferred_unmount, to export the total
number of failed recursive unmount attempts since boot, and to allow
the retry limit and retry grace period to be tuned.
Reviewed by: kib (earlier revision), mkusick
Differential Revision: https://reviews.freebsd.org/D31450
This fixes some of the LORs seen on mount/unmount.
Complete fix will require taking care of unmount as well.
Reviewed by: kib
Tested by: pho (previous version)
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31611
domain_init() gets reinvoked for each vnet on a system, so we must not
alter global state. Practically speaking, we were creating circular
lists and tying up a softclock thread into an infinite loop.
The breakage here was most easily observed by simply creating a jail
in a new vnet and watching the system suddenly become erratic.
Reported by: markj
Fixes: e0a17c3f06 ("uipc: create dedicated lists for fast ...")
Pointy hat: kevans
This lock no longer exists. It was removed in
a60100fdfc (if: Remove ifnet_rwlock, 2020-11-25)
Reviewed by: mjg
Pointed out by: Dheeraj Kandula <dheerajk@netapp.com>
Different Revision: https://reviews.freebsd.org/D31585
Introduce m_get3() which is similar to m_get2(), but can allocate up to
MJUM16BYTES bytes (m_get2() can only allocate up to MJUMPAGESIZE).
This simplifies the bpf improvement in f13da24715.
Suggested by: glebius
Differential Revision: https://reviews.freebsd.org/D31455
This avoids having to walk all possible protocols only to check if they
have one (vast majority does not).
Original patch by kevans@.
Reviewed by: kevans
Sponsored by: Rubicon Communications, LLC ("Netgate")
When a sigtimedwait(2) caller goes to sleep, it uses a wait channel of
p->p_sigacts with the proc lock as the interlock. However, p_sigacts
can be shared between processes if a child is created with
rfork(RFSIGSHARE | RFPROC). Thus we can end up with two threads
sleeping on the same wait channel using different locks, which is not
permitted.
Fix the problem simply by using a process-unique wait channel, following
the example of sigsuspend. The actual wait channel value is irrelevant
here, sleeping threads are awoken using sleepq_abort().
Reported by: syzbot+8c417afabadb50bb8827@syzkaller.appspotmail.com
Reported by: syzbot+1d89fc2a9ef92ef64fa8@syzkaller.appspotmail.com
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31563
TLS 1.0 empty fragment mbufs have no payload and thus m_epg_npgs is
zero. However, these mbufs need to occupy a "unit" of space for the
purposes of M_NOTREADY tracking similar to regular mbufs. Previously
this was done for the page count returned from ktls_frame() and passed
to ktls_enqueue() as well as the page count passed to pru_ready().
However, sbready() and mb_free_notready() only use m_epg_nrdy to
determine the number of "units" of space in an M_EXT mbuf, so when a
TLS 1.0 fragment was marked ready it would mark one unit of the next
mbuf in the socket buffer as ready as well. To fix, set m_epg_nrdy to
1 for empty fragments. This actually simplifies the code as now only
ktls_frame() has to handle TLS 1.0 fragments explicitly and the rest
of the KTLS functions can just use m_epg_nrdy.
Reviewed by: gallatin
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31536
I can see two concerns for adding domains after domainfinalize:
1.) The slow/fast callouts have already been setup.
2.) Userland could create a socket while we're in the middle of
initialization.
We can address #1 fairly easily by tracking whether the domain's been
initialized for at least the default vnet. There are still some concerns
about the callbacks being invoked while a vnet is in the process of
being created/destroyed, but this is a pre-existing issue that the
callbacks must coordinate anyways.
We should also address #2, but technically this has been an issue
anyways because we don't assert on post-domainfinalize additions; we
don't seem to hit it in practice.
Future work can fix that up to make sure we don't find partially
constructed domains, but care must be taken to make sure that at least,
e.g., the usages of pffindproto in ip_input.c can still find them.
Differential Revision: https://reviews.freebsd.org/D25459
This gives any given domain a chance to indicate that it's not actually
supported on the current system. If dom_probe isn't supplied, we assume
the domain is universally applicable as most of them are. Keeping
fully-initialized and registered domains around that physically can't
work on a large majority of FreeBSD deployments is sub-optimal and leads
to errors that aren't consistent with the reality of why the socket
can't be created (e.g. ESOCKTNOSUPPORT) because such scenario has to be
caught upon pru_attach, at which point kicking back the more-appropriate
EAFNOSUPPORT would seem weird.
The initial consumer of this will be hvsock, which is only available on
HyperV guests.
Reviewed by: cem (earlier version), bcr (manpages)
Differential Revision: https://reviews.freebsd.org/D25062
Set NIRES_EMPTYPATH earlies, to have use of EMPTYPATH recorded even if
we are going to return error. When namei_setup() refused to accept dirfd,
which is not of the vnode type, and indicated by ENOTDIR error return,
fall back to kern_fstat(dirfd).
Reported by: dchagin
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31530
This implements fspacectl(2) support on shared memory objects. The
semantic of SPACECTL_DEALLOC is equivalent to clearing the backing
store and free the pages within the affected range. If the call
succeeds, subsequent reads on the affected range return all zero.
tests/sys/posixshm/posixshm_tests.c is expanded to include a
fspacectl(2) functional test.
Sponsored by: The FreeBSD Foundation
Reviewed by: kevans, kib
Differential Revision: https://reviews.freebsd.org/D31490
The addition of ioflag allows callers passing
IO_SYNC/IO_DATASYNC/IO_DIRECT down to the file system implementation.
The vop_stddeallocate fallback implementation is updated to pass the
ioflag to the file system implementation. vn_deallocate(9) internally is
also changed to pass ioflag to the VOP_DEALLOCATE call.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D31500
Converted vn_write to use this helper.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31513
At least Linux x86 ABI's does not use carry bit and expects that the dx register
is preserved. For this add a new sv_set_fork_retval hook and call it from cpu_fork().
Add a short comment about touching dx in x86_set_fork_retval(), for more details
see phab comments from kib@ and imp@.
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D31472
MFC after: 2 weeks
Previously, if an encrypted netdump failed, such as due to a timeout or
network failure, the key was not saved, so a partial dump was
completely useless.
Send the key first, so the partial dump can be decrypted, because even a
partial dump can be useful.
Reviewed by: bdrewery, markj
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D31453
When cloning a ktls session (which is needed when we need to
switch output NICs for a NIC TLS session), we need to also
init the reset task, like we do when creating a new tls session.
Reviewed by: jhb
Sponsored by: Netflix
Make kdb_thr_first() and kdb_thr_next() return sane values if the
allproc list and pidhashtbl haven't been initialized yet. This can
happen if the debugger is entered very early on, for example with the
'-d' boot flag.
This allows remote gdb to attach at such a time, and fixes some ddb
commands like 'show threads'.
Be explicit about the static initialization of these variables. This
part has no functional change.
Reviewed by: markj, imp (previous version)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D31495
This includes a style fix around ioflag checking as well.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib, bcr
Differential Revision: https://reviews.freebsd.org/D31505
For now, just hook the allocation path: upon allocation, items are
marked as initialized (absent M_ZERO). Some zones are exempted from
this when it would otherwise raise false positives.
Use kmsan_orig() to update the origin map for UMA and malloc(9)
allocations. This allows KMSAN to print the return address when an
uninitialized UMA item is implicated in a report. For example:
panic: MSan: Uninitialized UMA memory from m_getm2+0x7fe
Sponsored by: The FreeBSD Foundation
Sanitizer instrumentation of course cannot automatically update shadow
state when devices write to host memory. KMSAN thus hooks into busdma,
both to update shadow state after a device write, and to verify that the
kernel does not publish uninitalized bytes to devices.
To implement this, when KMSAN is configured, each dmamap embeds a memory
descriptor describing the region currently loaded into the map.
bus_dmamap_sync() uses the operation flags to determine whether to
validate the loaded region or to mark it as initialized in the shadow
map.
Note that in cases where the amount of data written is less than the
buffer size, the entire buffer is marked initialized even when it is
not. For example, if a NIC writes a 128B packet into a 2KB buffer, the
entire buffer will be marked initialized, but subsequent accesses past
the first 128 bytes are likely caused by bugs.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31338
Interrupt and exception handlers must call kmsan_intr_enter() prior to
calling any C code. This is because the KMSAN runtime maintains some
TLS in order to track initialization state of function parameters and
return values across function calls. Then, to ensure that this state is
kept consistent in the face of asynchronous kernel-mode excpeptions, the
runtime uses a stack of TLS blocks, and kmsan_intr_enter() and
kmsan_intr_leave() push and pop that stack, respectively.
Use these functions in amd64 interrupt and exception handlers. Note
that handlers for user->kernel transitions need not be annotated.
Also ensure that trap frames pushed by the CPU and by handlers are
marked as initialized before they are used.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31467
- During boot, allocate PDP pages for the shadow maps. The region above
KERNBASE is currently not shadowed.
- Create a dummy shadow for the vm page array. For now, this array is
not protected by the shadow map to help reduce kernel memory usage.
- Grow shadows when growing the kernel map.
- Increase the default kernel stack size when KMSAN is enabled. As with
KASAN, sanitizer instrumentation appears to create stack frames large
enough that the default value is not sufficient.
- Disable UMA's use of the direct map when KMSAN is configured. KMSAN
cannot validate the direct map.
- Disable unmapped I/O when KMSAN configured.
- Lower the limit on paging buffers when KMSAN is configured. Each
buffer has a static MAXPHYS-sized allocation of KVA, which in turn
eats 2*MAXPHYS of space in the shadow map.
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31295
KMSAN enables the use of LLVM's MemorySanitizer in the kernel. This
enables precise detection of uses of uninitialized memory. As with
KASAN, this feature has substantial runtime overhead and is intended to
be used as part of some automated testing regime.
The runtime maintains a pair of shadow maps. One is used to track the
state of memory in the kernel map at bit-granularity: a bit in the
kernel map is initialized when the corresponding shadow bit is clear,
and is uninitialized otherwise. The second shadow map stores
information about the origin of uninitialized regions of the kernel map,
simplifying debugging.
KMSAN relies on being able to intercept certain functions which cannot
be instrumented by the compiler. KMSAN thus implements interceptors
which manually update shadow state and in some cases explicitly check
for uninitialized bytes. For instance, all calls to copyout() are
subject to such checks.
The runtime exports several functions which can be used to verify the
shadow map for a given buffer. Helpers provide the same functionality
for a few structures commonly used for I/O, such as CAM CCBs, BIOs and
mbufs. These are handy when debugging a KMSAN report whose
proximate and root causes are far away from each other.
Obtained from: NetBSD
Sponsored by: The FreeBSD Foundation
Some filesystems, e.g., devfs, do not populate va_birthtime in their
GETATTR implementations. To handle this, make sure that va_birthtime is
initialized to the quasi-standard value of { VNOVAL, 0 } before calling
VOP_GETATTR.
Reported by: KMSAN
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31468
When the device name is provided, we can simply run strncmp() for each
line to quickly skip unrelated ones, that is much faster than sscanf()
and only then strcmp().
MFC after: 2 weeks
These ones were unambiguous cases where the Foundation was the only
listed copyright holder (in the associated license block).
Sponsored by: The FreeBSD Foundation
VOP_LOOKUP() is called with cn_flags bits ISLASTCN and ISOPEN
to indicate that the lookup is for the last component of a pathname
when doing open.
If the cn_flags also indicates if the open is for Reading, Writing or Both,
the NFSv4 client can do an NFSv4 Open operation in the same compound
RPC as Lookup, often avoiding the additional Open RPC now done when
VOP_OPEN() is called.
This patch defines two new cn_flags bits called OPENREAD and OPENWRITE
and sets these in open2nameif() based on FREAD, FWRITE flag bits.
This will allow a subsequent patch to the NFSv4 client to do the Open
operation in the same RPC as Lookup.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31431
Use the new PNOLOCK flag to tsleep() to indicate that
we are managing potential races, and don't need to
sleep with a lock, or have a backstop timeout.
Reviewed by: jhb
Sponsored by: Netflix
Add a PNOLOCK flag so that, in the race circumstance where
wakeup races are externally mitigated, tsleep() can be
called with a sleep time of 0 without triggering an
an assertion.
Reviewed by: jhb
Sponsored by: Netflix
98215005b7 introduced a new
thread that uses tsleep(..0) to sleep forever. This hit
an assert due to sleeping with a 0 timeout.
So spell "forever" using SBT_MAX instead, which does not
trigger the assert.
Pointy hat to: gallatin
Pointed out by: emaste
Sponsored by: Netflix
fspacectl(2) is a system call to provide space management support to
userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the
deallocation. vn_deallocate(9) is a public KPI for kmods' use.
The purpose of proposing a new system call, a KPI and a VOP call is to
allow bhyve or other hypervisor monitors to emulate the behavior of SCSI
UNMAP/NVMe DEALLOCATE on a plain file.
fspacectl(2) comprises of cmd and flags parameters to specify the
space management operation to be performed. Currently cmd has to be
SPACECTL_DEALLOC, and flags has to be 0.
fo_fspacectl is added to fileops.
VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation
of VOP_DEALLOCATE(9) is provided.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D28347
vn_bmap_seekhole_locked() is factored out version of vn_bmap_seekhole().
This variant requires shared vnode lock being held around the call.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D31404
Ktls recently received an optimization where we allocate 16k
physically contiguous crypto destination buffers. This provides a
large (more than 5%) reduction in CPU use in our
workload. However, after several days of uptime, the performance
benefit disappears because we have frequent allocation failures
from the ktls buffer zone.
It turns out that when load drops off, the ktls buffer zone is
trimmed, and some 16k buffers are freed back to the OS. When load
picks back up again, re-allocating those 16k buffers fails after
some number of days of uptime because physical memory has become
fragmented. This causes allocations to fail, because they are
intentionally done without M_NORECLAIM, so as to avoid pausing
the ktls crytpo work thread while the VM system defragments
memory.
To work around this, this change starts one thread per VM domain
to allocate ktls buffers with M_NORECLAIM, as we don't care if
this thread is paused while memory is defragged. The thread then
frees the buffers back into the ktls buffer zone, thus allowing
future allocations to succeed.
Note that waking up the thread is intentionally racy, but neither
of the races really matter. In the worst case, we could have
either spurious wakeups or we could have to wait 1 second until
the next rate-limited allocation failure to wake up the thread.
This patch has been in use at Netflix on a handful of servers,
and seems to fix the issue.
Differential Revision: https://reviews.freebsd.org/D31260
Reviewed by: jhb, markj, (jtl, rrs, and dhw reviewed earlier version)
Sponsored by: Netflix
and remove repetetive code that calculates vnode locking type for write.
Reviewed by: khng, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31405
The spinning start time is missing from the calculation due to a
misplaced #endif. Return the #endif where it's supposed to be.
Submitted by: Alexander Alexeev <aalexeev@isilon.com>
Reviewed by: bdrewery, mjg
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D31384
Only perform this expensive operation when the unit number is a
potential candidate (i.e. not already in use), thereby reducing device
scan time on systems with many devices, unit numbers, and drivers.
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
X-NetApp-PR: #61
Differential Revision: https://reviews.freebsd.org/D31381
I don't think it changes anything, but why not.
While there, make cpu_search_highest() use all 8 lower load bits for
noise, since it does not use cs_prefer and the code is not shared
with cpu_search_lowest() any more.
MFC after: 1 month
On some load patterns it is possible for several CPUs to try steal
thread from the same CPU despite randomization introduced. It may
cause significant lock contention when holding one queue lock idle
thread tries to acquire another one. Use of trylock on the remote
queue allows both reduce the contention and handle lock ordering
easier. If we can't get lock inside tdq_trysteal() we just return,
allowing tdq_idled() handle it. If it happens in tdq_idled(), then
we repeat search for load skipping this CPU.
On 2-socket 80-thread Xeon system I am observing dramatic reduction
of the lock spinning time when doing random uncached 4KB reads from
12 ZVOLs, while IOPS increase from 327K to 403K.
MFC after: 1 month
When sched_highest() called for some CPU group returns nothing, idle
thread calls it for the parent CPU group. But the parent CPU group
also includes the CPU group we've just searched, and unless there is
a race going on, it is unlikely we find anything new this time.
Avoid the double search in case of parent group having only two sub-
groups (the most prominent case). Instead of escalating to the parent
group run the next search over the sibling subgroup and escalate two
levels up after if that fail too. In case of more than two siblings
the difference is less significant, while searching the parent group
can result in better decision if we find several candidate CPUs.
On 2-socket 40-core Xeon system I am measuring ~25% reduction of CPU
time spent inside cpu_search_highest() in both SMT (2x20x2) and non-
SMT (2x20) cases.
MFC after: 1 month
KASAN and KCSAN implement interceptors for various primitive operations
that are not instrumented by the compiler. KMSAN requires them as well.
Rather than adding new cases for each sanitizer which requires
interceptors, implement the following protocol:
- When interceptor definitions are required, define
SAN_NEEDS_INTERCEPTORS and SANITIZER_INTERCEPTOR_PREFIX.
- In headers that declare functions which need to be intercepted by a
sanitizer runtime, use SANITIZER_INTERCEPTOR_PREFIX to provide
declarations.
- When SAN_RUNTIME is defined, do not redefine the names of intercepted
functions. This is typically the case in files which implement
sanitizer runtimes but is also needed in, for example, files which
define ifunc selectors for intercepted operations.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
This is required for KASAN: when a module is unloaded, poisoned regions
(e.g., pad areas between global variables) are left as such, so if they
are reused as KLDs are loaded, false positives can arise.
Reported by: pho, Jenkins
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31339
The umtx_pi_frop() will be used by Linux emulation layer.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31238
MFC after: 2 weeks
The bitset is a Linux emulation layer extension. This 32-bit mask, in which at
least one bit must be set, is used to select which threads should be woken up.
The bitset is stored in the umtx_q structure, which is used to enqueue the waiter
into the umtx waitqueue. Put the bitset into the hole, that appeared on LP64 due
to data alignment, to prevent the growth of the struct umtx_q.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31234
MFC after: 2 weeks
Add umtx_ prefix to all abs_timeout facility and add declaration for it.
For consistency with others abs_timeout mark inline abs_timeout_init2.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31249
MFC after: 2 weeks
To prevent umtx.h polluting by future changes split it on two headers:
umtx.h - ABI header for userspace;
umtxvar.h - the kernel staff.
While here fix umtx_key_match style.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31248
MFC after: 2 weeks
makesyscalls was rewritten in Lua and introduced in d3276301ab. In the
time since, no objections have risen and a warning was introduced long
ago on invocation of makesyscalls.sh that it would be removed before
FreeBSD 13. Belatedly follow through on that.
Remove cpu_search_both(), unused for many years. Without it there is
less sense for the trick of compiling common cpu_search() into separate
cpu_search_lowest() and cpu_search_highest(), so split them completely,
making code more readable. While there, split iteration over children
groups and CPUs, complicating code for very small deduplication.
Stop passing cpuset_t arguments by value and avoid some manipulations.
Since MAXCPU bump from 64 to 256, what was a single register turned
into 32-byte memory array, requiring memory allocation and accesses.
Splitting struct cpu_search into parameter and result parts allows to
even more reduce stack usage, since the first can be passed through
on recursion.
Remove CPU_FFS() from the hot paths, precalculating first and last CPU
for each CPU group in advance during initialization. Again, it was
not a problem for 64 CPUs before, but for 256 FFS needs much more code.
With these changes on 80-thread system doing ~260K uncached ZFS reads
per second I observe ~30% reduction of time spent in cpu_search_*().
MFC after: 1 month
genoffset used the fully generic ASSYM macro to generate the offsets
needed for the thread_lite structure. However, since these are offsets
into a structure, they will always be necessarily small and positive. As
such, just create a simple character array of the right size and use a
naming convention such that we can recover the field name, structure
name and type. Use nm -t d and sort -n to sort these into order, then
loop over the resutls to generate the thread_lite structure.
MFC After: 2 weeks
Reviewed by: kib, markj (earlier versions)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31203
o Remove redunant $ in $(( )) expression.
o Quote arg passed to work so paths with spaces, etc will work.
MFC After: 2 weeks
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31335
SO_RERROR indicates that receive buffer overflows should be handled as
errors. Historically receive buffer overflows have been ignored and
programs could not tell if they missed messages or messages had been
truncated because of overflows. Since programs historically do not
expect to get receive overflow errors, this behavior is not the
default.
This is really really important for programs that use route(4) to keep
in sync with the system. If we loose a message then we need to reload
the full system state, otherwise the behaviour from that point is
undefined and can lead to chasing bogus bug reports.
Reviewed by: philip (network), kbowling (transport), gbe (manpages)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26652
to restore ABI compatibility for pre-10.x binaries.
It restores _umtx_lock() and _umtx_unlock() syscalls, and UMTX_OP_LOCK/
UMTX_OP_UNLOCK umtx_op(2) operations. UMUTEX_ERROR_CHECK flag is left
out for now, I do not think it makes a difference.
PR: 218571
Reviewed by: brooks (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31220
Use m_copydata() instead of a direct bcopy() when copying data out of
a source mbuf into a newly-allocated mbuf.
PR: 256610
Reported by: Niels Bakker <niels=freebsd@bakker.net>
Reviewed by: markj
MFC after: 2 weeks
We no longer allow upper filesystems to be unregistered from the base
mount while vfs_notify_upper() or any other upper operation is pending.
New upper mounts can still be registered during this period, but they
will be added at the end of the upper mount tailq. We therefore no
longer need to allocate marker nodes during vfs_notify_upper() to keep
our place in the iteration.
Reviewed by: kib, mckusick
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D31016
In certain emergency cases such as media failure or removal, UFS will
initiate a forced unmount in order to prevent dirty buffers from
accumulating against the no-longer-usable filesystem. The presence
of a stacked filesystem such as nullfs or unionfs above the UFS mount
will prevent this forced unmount from succeeding.
This change addreses the situation by allowing stacked filesystems to
be recursively unmounted on a taskqueue thread when the MNT_RECURSE
flag is specified to dounmount(). This call will block until all upper
mounts have been removed unless the caller specifies the MNT_DEFERRED
flag to indicate the base filesystem should also be unmounted from the
taskqueue.
To achieve this, the recently-added vfs_pin_from_vp()/vfs_unpin() KPIs
have been combined with the existing 'mnt_uppers' list used by nullfs
and renamed to vfs_register_upper_from_vp()/vfs_unregister_upper().
The format of the mnt_uppers list has also been changed to accommodate
filesystems such as unionfs in which a given mount may be stacked atop
more than one lower mount. Additionally, management of lower FS
reclaim/unlink notifications has been split into a separate list
managed by a separate set of KPIs, as registration of an upper FS no
longer implies interest in these notifications.
Reviewed by: kib, mckusick
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D31016
Mount options aren't solely ASCII strings. In addition, experience to
date suggests that the mount options are much less useful than was
originally supposed and the mount flags suffice to make decisions. Drop
the reporting of options for the mount/remount/unmount events.
Reviewed by: markj
Reported by: KASAN
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31287
This variable is only used to rate-limit "Root mount waiting for: ..."
messages using ppsratecheck().
Reported by: KMSAN
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
KASAN hooks will not generate reports if panicstr != NULL, but then
there is a window after the initial panic() call where another report
may be raised. This can happen if a false positive occurs; to simplify
debugging of such problems, avoid recursing.
Sponsored by: The FreeBSD Foundation
ZFS creates some sysctl nodes that include a pool name, and '.' is an
allowed character in pool names. But it's the separator in the sysctl
tree, so it can't be included in a sysctl name. Replace it with "%25".
Handily, "%" is illegal in ZFS pool names, so there's no ambiguity
there.
PR: 257316
MFC after: 3 weeks
Sponsored by: Axcient
Reviewed by: freqlabs
Differential Revision: https://reviews.freebsd.org/D31265
parse_dir_md() opens /dev/mdctl but only closes the resulting fd on
success, not upon failure of the ioctl or when we exceed the md unit
max.
Reviewed by: kib (slightly previous version)
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
X-NetApp-PR: #62
Differential Revision: https://reviews.freebsd.org/D31229
This builds on recently introduced NO_NEW_PRIVS flag to implement
unprivileged chroot, enabled by `security.bsd.unprivileged_chroot`.
It allows non-root processes to chroot(2), provided they have the
NO_NEW_PRIVS flag set.
The chroot(8) utility gets a new flag, -n, which sets NO_NEW_PRIVS
before chrooting.
Reviewed By: kib
Sponsored By: EPSRC
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D30130
Temporary add stubs to the Linux emulation layer which calls the existing hook.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D30911
MFC after: 2 weeks
For future use in the Linux emulation layer call sv_onexec hook right after
the new process address space is created. It's safe, as sv_onexec used only
by Linux abi and linux_on_exec() does not depend on a state of process VA.
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D30899
MFC after: 2 weeks
For future use in the Linux emulation layer modify the exec_sysvec_init()
to allow non-native abi to fill sv_timekeep_base and sv_shared_page_obj.
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D30898
MFC after: 2 weeks
The early environment is typically cleared, so these new options
need the PRESERVE_EARLY_KENV kernel config(8) option. These environments
are reported as missing by kenv(1) if the option is not present in the
running kernel.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D30835
Some downstream configurations do not store secrets in the
early (loader/static) environments and desire a way to preserve these
for diagnostic reasons. Provide an option to do so.
Reviewed by: imp, jhb (earlier version)
Differential Revision: https://reviews.freebsd.org/D30834
The syscall number is stored in the same register as the syscall return
on amd64 (and possibly other architectures) and so it is impossible to
recover in the signal handler after the call has returned. This small
tweak delivers it in the `si_value` field of the signal, which is
sufficient to catch capability violations and emulate them with a call
to a more-privileged process in the signal handler.
This reapplies 3a522ba1bc with a fix for
the static assertion failure on i386.
Approved by: markj (mentor)
Reviewed by: kib, bcr (manpages)
Differential Revision: https://reviews.freebsd.org/D29185
One is allowed to use LIO_NOWAIT without specifying a sigevent. In this
case, lj->lioj_signal is left uninitialized, but several code paths
examine liov_signal.sigev_notify to figure out which notification to
post. Unconditionally initialize that field to SIGEV_NONE.
Add a dumb test case which triggers the bug.
Reported by: KMSAN+syzkaller
Reviewed by: asomers
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31197
Commit bb4a27f927 added the ability to allocate a span of blocks
crossing a meta node boundary. To ensure that blst_next_leaf_alloc()
does not walk past the end of the tree, an extra all-zero meta node
needs to be present at the end of the allocation, and
blst_next_leaf_alloc() is implemented such that the presence of this
node terminates the search.
blist_create() computes the number of nodes required. It had two
problems:
1. When the size of the blist is a power of BLIST_RADIX, we would
unnecessarily allocate an extra level in the tree.
2. When the size of the blist is a multiple of BLIST_RADIX, we would
fail to allocate a terminator node. In this case,
blst_next_leaf_alloc() could scan beyond the bounds of the
allocation. This was found using KASAN.
Modify blist_create() to handle these cases correctly.
Reported by: pho
Reviewed by: dougm
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31158
Its callers do not make use the modified size that malloc_large() was
returning, so there's no need to pass a pointer. No functional change
intended.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
The syscall number is stored in the same register as the syscall return
on amd64 (and possibly other architectures) and so it is impossible to
recover in the signal handler after the call has returned. This small
tweak delivers it in the `si_value` field of the signal, which is
sufficient to catch capability violations and emulate them with a call
to a more-privileged process in the signal handler.
Approved by: markj (mentor)
Reviewed by: kib, bcr (manpages)
Differential Revision: https://reviews.freebsd.org/D29185
It will be called during KLD unload to unpoison the redzones following
global variables. Otherwise, virtual address ranges previously used for
a KLD may be left tainted, triggering false positives when they are
recycled.
Reported by: pho
Sponsored by: The FreeBSD Foundation
The first release of an interrupt in a situation where the interrupt table
is full should schedule a full table check the next time an interrupt is
allocated. A full check is necessary to ensure maximum separation between
the order of allocation and the order of release.
Submitted by: ehem_freebsd@m5p.com (initial version)
Discussed in: https://reviews.freebsd.org/D29310
MFC after: 4 weeks
HPTS drives both rack and bbr, and yet there have been many complaints
about performance. This bit of work restructures hpts to help reduce CPU
overhead. It does this by now instead of relying on the timer/callout to
drive it instead use user return from a system call as well as lro flushes
to drive hpts. The timer becomes a backstop that dynamically adjusts
based on how "late" we are.
Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31083
Use sysentvec hooks to only call umtx_thread_exit/umtx_exec, which handle
robust mutexes, for native FreeBSD ABI. Similarly, there is no sense
in calling sigfastblock_clear() for non-native ABIs.
Requested by: dchagin
Reviewed by: dchagin, markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30987
after itimers are stopped. This makes it more usable for e.g. native FreeBSD
ABI sysentvecs.
Reviewed by: dchagin, markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30987
Unlike sv_onexec(), it is called from the old (pre-exec) sysentvec structure.
The old vmspace for the process is still intact during the call.
Reviewed by: dchagin, markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30987
The intent is to eliminate the MT_NOINIT flag and consequently a branch
from the constructor.
Reviewed by: gallatin
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31080
Ifnet (inline) hw kTLS NICs typically keep state within
a TLS record, so that when transmitting in-order,
they can continue encryption on each segment sent without
DMA'ing extra state from the host.
This breaks down when transmits are out of order (eg,
TCP retransmits). In this case, the NIC must re-DMA
the entire TLS record up to and including the segment
being retransmitted. This means that when re-transmitting
the last 1448 byte segment of a TLS record, the NIC will
have to re-DMA the entire 16KB TLS record. This can lead
to the NIC running out of PCIe bus bandwidth well before
it saturates the network link if a lot of TCP connections have
a high retransmoit rate.
This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct),
where TCP connections with higher retransmit rate will be
switched to SW kTLS so as to conserve PCIe bandwidth.
Reviewed by: hselasky, markj, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D30908
Since commit 2dd1bdf183 in 2016 the r_start and r_end fields have been
rman_res_t, which was briefly unsigned long, but commit da1b038af9
changed the typedef to be uintmax_t instead. C99 is also something we
assume these days.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D30808
The intent is to remove all direct zone_mbuf consumers so that ctor/dtor
from that zone can be reimplemented as wrappers around uma, avoiding an
indirect function call.
Reviewed by: kbowling
Discussed with: gallatin
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30959
The flag was added in 2016 but remains unused.
Reviewed by: kbowling
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30958
This introduces a new, per-process flag, "NO_NEW_PRIVS", which
is inherited, preserved on exec, and cannot be cleared. The flag,
when set, makes subsequent execs ignore any SUID and SGID bits,
instead executing those binaries as if they not set.
The main purpose of the flag is implementation of Linux
PROC_SET_NO_NEW_PRIVS prctl(2), and possibly also unpriviledged
chroot.
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30939
Instead of p_elf_machine use machine member of the Elf_Brandinfo which is now
cached in the struct proc at p_elf_brandinfo member.
Note to MFC: D30918, KBI
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D30926
MFC after: 2 weeks
To allow the ABI to make a dicision based on the Brandinfo add a link
to the Elf_Brandinfo into the struct proc. Add a note that the high 8 bits
of Elf_Brandinfo flags is private to the ABI.
Note to MFC: it breaks KBI.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D30918
MFC after: 2 weeks
This adds `sv_elf_core_osabi`, `sv_elf_core_abi_vendor`,
and `sv_elf_core_prepare_notes` fields to `struct sysentvec`,
and modifies imgact_elf.c to make use of them instead
of hardcoding FreeBSD-specific values. It also updates all
of the ABI definitions to preserve current behaviour.
This makes it possible to implement non-native ELF coredump
support without unnecessary code duplication. It will be used
for Linux coredumps.
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30921
Introduce SLEEPQ_DROP sleepq_signal() flag, allowing one to drop the
sleep queue chain lock before returning. Reduced lock scope allows
significantly reduce lock contention inside taskqueue_enqueue() for
ZFS worker threads doing ~350K disk reads/s on 40-thread system.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Its intent is to do the initialization of the future part of struct nameidata
which should be used across several namei() and VOPs. Right now it is NOP.
Reviewed by: mckusick
Discussed with: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30041
Now that the upper layers all go through a layer to tie into these
information functions that translates an sbuf into char * and len. The
current interface suffers issues of what to do in cases of truncation,
etc. Instead, migrate all these functions to using struct sbuf and these
issues go away. The caller is also in charge of any memory allocation
and/or expansion that's needed during this process.
Create a bus_generic_child_{pnpinfo,location} and make it default. It
just returns success. This is for those busses that have no information
for these items. Migrate the now-empty routines to using this as
appropriate.
Document these new interfaces with man pages, and oversight from before.
Reviewed by: jhb, bcr
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29937
The new buffer is somewhat larger, but there should be no functional
changes.
Reviewed By: kib, imp
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30821
The i386 loader (and hopefully others to come) now passes tslog data
as a "preloaded module". Include this in the data returned by the
debug.tslog sysctl.
Reviewed by: kevans
armv6 and armv7 systems already were 1000Hz. The other armv5 were a
mix of 100 and 1000. This changes them to 1000. Should there be
issues, we can add options HZ=100 to the systems that have bad
performance at the drop of a hat.
mips is a lot more complicated. But most of the systems are already
1000HZ. The hardware exceptions are all fast enough to run at
1000Hz. MALTA is our primary emulator, and history has shown emulators
tend to like 100Hz better, so run those systems at 100Hz. As with arm,
any system that shows a huge performance regression can reverted to
100Hz easily.
This was going to be committed well in advance of the 13 branch, but
it was delayed and forgotten til now.
Discussed on: #bsdmips ages ago
Sponsored by: Netflix
The TOE driver might receive decrypted TLS records that are enqueued
to the socket buffer after ktls_try_toe() returns and before
ktls_enable_rx() locks the receive buffer to call sb_mark_notready().
In that case, sb_mark_notready() would incorrectly treat the decrypted
TLS record as an encrypted record and schedule it for decryption.
This always resulted in the connection being dropped as the data in
the control message did not look like a valid TLS header.
To fix, don't try to handle software decryption of existing buffers in
the socket buffer for TOE TLS in ktls_enable_rx(). If a TOE TLS
driver needs to decrypt existing data in the socket buffer, the driver
will need to manage that in its tod_alloc_tls_session method.
Sponsored by: Chelsio Communications
It seems that Linux does not dequeue siginfo for SIGCHLD when wait*(2)
reports status of the running process. In particular, sigwaitinfo(2)
and other signal querying syscalls can observe the siginfo after wait.
FreeBSD dequeued siginfo from the beginning, so we cannot change the
default ABI to be more compatible. Still, add a knob to enable to
change to the other behavior for debugging purposes.
Reported by: dchagin
Reviewed by: dchagin, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30675
Traditionally, BSD drops signals with the default action during send,
not even putting them to the destination process queue. This semantic
is not shared with other operating systems (Linux), which do queue
such signals. In particular, sigtimedwait(2) and related syscalls can
observe the delivery.
Add a global knob kern.sig_discard_ign which can be set to false to force
enqueuing of the signals with default action. Also add an ABI flag to
indicate that signals should be queued.
Note that it is not practical to run with the knob turned on, because almost
all software that care about the delivery of such signals, is aware of the
difference, and misbehaves if the signals are actually queued. The purpose
of the knob as is is to allow for easier diagnostic of the programs that
need the adjustments, to confirm the cause of problem.
Reported by: dchagin
Reviewed by: dchagin, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30675
Some code was using it already, but in many places we were testing
SO_ACCEPTCONN directly. As a small step towards fixing some bugs
involving synchronization with listen(2), make the kernel consistently
use SOLISTENING(). No functional change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
uipc_ktls.c was missing opt_ratelimit.h, so it was
never noticing that RATELIMIT was enabled. Once it was
enabled, it failed to compile as ktls_modify_txrtlmt()
had accrued a compilation error when it was not being
compiled in.
Sponsored by: Netflix
The kern_poll_kfds() operates on clear kernel data, kfds points to an
array in the kernel, while kern_poll() operates on user supplied pollfd.
Move nfds check to kern_poll_maxfds().
No functional changes, it's for future use in the Linux emulation layer.
Reviewd by: kib
Differential Revision: https://reviews.freebsd.org/D30690
MFC after: 2 weeks
The eventfd code was written by me, rdivacky@ copyrigth applicable only
to epoll part of the Linuxulator code. Roman is ok to retire his copyright
from sys/kern/sys_eventfd.c and 'All rights reserved.' lines from
sys/compat/linux/linux_event.[c|h] and sys/kern/sys_eventfd.c files.
Reviewed by: kib, emaste
Approved by: rdivacky
Differential Revision: https://reviews.freebsd.org/D30677
MFC after: 2 weeks
It was meant to suppress only the printf(), not the subsequent injection
of Giant-protected thunks for various file operations.
Fixes: fbeb4ccac9
Reported by: pho
Tested by: pho
MFC after: 6 days
Pointy hat: markj
During boot we warn that the kbd and openfirm drivers are Giant-locked
and may be deleted. Generally, the warning helps signal that certain
old drivers are not being maintained and are subject to removal, but
this doesn't really apply to certain drivers which are harder to
detangle from Giant.
Add a flag, D_GIANTOK, that devices can specify to suppress the
misleading warning. Use it in the kbd and openfirm drivers.
Reviewed by: imp, jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D30649