Move the timerfd impelemntation from linux compat code to sys/kern. Use
it to implement the new system calls for timerfd. Add a hook to kern_tc
to allow timerfd to know when the system time has stepped. Add kqueue
support to timerfd. Adjust a few names to be less Linux centric.
RelNotes: YES
Reviewed by: markj (on irc), imp, kib (with reservations), jhb (slack)
Differential Revision: https://reviews.freebsd.org/D38459
We don't always know the size of the register set at compile time,
e.g. on arm64 the size of the SVE registers need to be queried on boot.
To support register sets that needs to be calculated at run time
query the correct size when it is zero.
Reviewed by: markj, kib (earlier version)
Sponsored by: Arm Ltd
Differential Revision: https://reviews.freebsd.org/D41302
This is an attempt at clean-room implementation of the Linux'
membarrier(2) syscall. For documentation, you would need to read
both membarrier(2) Linux man page, the comments in Linux
kernel/sched/membarrier.c implementation and possibly look at
actual uses.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32360
Restructure parts of pctrie code to make it more compatible with the
needs of vm_radix code.
1. End passing function pointers for memory management.
By breaking insertion into two functions, the call for allocating
memory can happen at the top level and be inlined, rather than
happening via an function pointer to a memory allocator.
By changing the remove function slightly, freeing of memory, when
necessary, can happen at the top level and be inlined.
By turning the reclamation code into two functions, one for starting
iteration over to-be-freed nodes and the other continuing it, all the
freeing can happen at the top level and be inlined.
2. Offer a version of remove that does not panic and returns the freed
value (or NULL).
3. Offer a 'replace' operation, to replace one leaf with another that
has the same key.
These are three of the roadblocks that prevent code sharing between
pctrie and vm_radix code.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D41396
This has two effects:
1. We can mergesort the sysinits instead of bubblesorting them, which
shaves about 2 ms off the boot time in Firecracker.
2. Adding more sysinits (e.g. from a KLD) can be performed by sorting
them and then merging the sorted lists, which is both faster than
the previous "append and sort" approach and avoids needing malloc.
Reviewed by: jhb (previous version)
Sponsored by: https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D41075
It was noted in D41404 that these reallocations aren't actually
guaranteed to succeed, despite assertions to the contrary. We're
talking relatively small allocations, so just free up the individual
slot to be reused later as needed.
Note that this doesn't track the last active slot as of this moment, but
this could be done later if we find it's worth the complexity for what
little that would allow to be optimized (osd_call, slightly).
While we're here, fix the debug message that indicates which slot we
just allocated when we find an unused one; the slot # is actually one
higher than the index.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D41409
Per recent discussions on arch@ and at the BSDCan developer summit, we
are considering removing support for 32-bit platforms (in some form)
for 15.0 (at the earliest). A final decision on what will ship in
15.0 will be made closer to the release of 15.0. However, we should
communicate the potential deprecation in 14.0 to provide notice to
users.
This commit adds a warning during boot on 32-bit kernels that they are
deprecated and may be removed in 15.0. More details will be included
in a followup commit to RELNOTES.
Reviewed by: brooks, imp, emaste
Differential Revision: https://reviews.freebsd.org/D41163
As uncovered by e3ba0d6add we are copying lots of irrelevant options
from the listener to an accepted socket, even those that aren't relevant
to a non-listener, e.g. SO_REUSE*, SO_ACCEPTFILTER. Stop doing that
and provide a fixed opt-in list for options to be inherited. Ideally
we shall not inherit anything at all. For compatibility inherit a set
of options that are meaningful for a non-listening socket of a protocol
that can listen(2).
Differential Revision: https://reviews.freebsd.org/D41412
Fixes: e3ba0d6add
If a slot is freed that isn't the last one, we'll set its destructor to
NULL to indicate that it's been freed and leave a hole in the slot map.
Check osd_destructors in osd_call() to avoid dereferencing a method that
is potentially from a module that's been unloaded.
This scenario would most commonly surface when two modules are loaded
that osd_register(), then the earlier one deregisters and an osd_call()
is made after the fact. In the specific report that triggered the
investigation, kldload if_wg -> kldload linux* -> kldunload if_wg ->
destroy a jail -> panic.
Noted in the review, but left for follow-up work, is that the realloc
that may happen in osd_deregister() should likely go away and the
assumption that reallocating to a smaller size cannot fail is actually
not correct.
Reported by: dim
Reviewed by: markj, jamie
Differential Revision: https://reviews.freebsd.org/D41404
subr_rangeset.c is the only source file that calls functions like
pctrie_insert and pctrie_remove directly; other users of pctries use
the PCTRIE_DEFINE macro to define interfaces to pctrie that let them
ignore issues of offsets within structs and uint64_t return values.
Change subr_rangeset.c to use PCTRIE_DEFINE too. And change pctrie.h
to mark the lookup function as unused, to avoid warnings when
compiling files, like subr_rangeset.c, that don't invoke lookup().
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D41391
This reverts commit 96c76d9306.
There are other relatively common reasons why init might get killed
during reboot, the workaround was really not sparc64-specific.
Discussed with: marius
Sponsored by: The FreeBSD Foundation
This reverts commit 5b353925ff.
The reason is the lesser scalability of the vnode' rangelock comparing
with the vnode lock.
Requested by: mjg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41334
Previously the log message indicated only "(core dumped)" if a core was
successfully created, or nothing if it was not. This provides
insufficient information to faciliate debugging. Dtrace is no help as
coredump() is static and we cannot find the return value via fbt.
Expand the log message to include error return value information.
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D39942
Early during boot, thread0 runs with td->td_ucred == NULL. This is
fixed up in proc0_init() at SI_SUB_INTRINSIC. If a panic occurs before
then, rather than dereference a NULL pointer, simply allow the thread to
enter KDB.
Reported by: stevek
Reviewed by: mhorne, stevek
MFC after: 1 week
Fixes: cab1056105 ("kdb: Modify securelevel policy")
Differential Revision: https://reviews.freebsd.org/D41280
These are modeled on the API used for m_copydata/m_copyback and the
crypto buffer APIs. One day it might be nice to reduce the
proliferation of these by adding cursors and using memdesc directly
for crypto request buffers.
Reviewed by: markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D40615
The computation of keybarr(), the function that determines when a
search has failed at a non-leaf node, can be done in a way that
computes the 'slot' value when keybarr() fails, which is exactly when
slot() would next be invoked. Computing things this way saves space in
search loops.
This reduces the amd64 coding of the search loop in vm_radix_lookup
from 40 bytes to 28 bytes.
Reviewed by: alc
Tested by: pho (as part of a larger change)
Differential Revision: https://reviews.freebsd.org/D41235
The clev field in the node struct is almost always multiplied by
WIDTH; occasionally, it is incremented and then multiplied by
WIDTH. Instructions can be saved by storing it always multiplied by
WIDTH.
For the computation of slot(), this just eliminates a
multiplication. For trimkey(), where the caller always adds one to
clev before passing it as an argument, this change has the caller, not
the caller, do that. Trimkey() handles it not by adding WIDTH to the
input parameter, but by shifting COUNT, and not 1. That produces the
same result, and it relieves keybarr of the need to test to avoid
shifting by more than 63 bits, since level is always <= 63.
This takes 3 instrutions and 14 bytes out of the basic lookup loop on
amd64.
Reviewed by: kib
Tested by: pho (as part of a larger change)
Differential Revision: https://reviews.freebsd.org/D41226
NULL (non-leaf) pointers with NULL leaves, there is a NULL test
removed from every iteration of an index-based search loop.
This speeds up radix trie searches by few percent. If there are any
radix tries that are not initialized with the init() function, but
instead depend on zeroing everything being proper initialization, this
will break those tries.
Reviewed by: alc, kib
Tested by: pho (as part of a larger change)
Differential Revision: https://reviews.freebsd.org/D41171
Currently for the MFS, firmware and VDSO template assembly files we pass
the path to include with .incbin unquoted and use __XSTRING within the
assembly file to stringify it. However, __XSTRING doesn't just perform a
single level of expansion, it performs the normal full expansion of the
macro, and so if the path itself happens to tokenise to something that
includes a defined macro in it that will itself be substituted. For
example, with #define MACRO 1, a path like /path/containing/MACRO/in/it
will expand to /path/containing/1/in/it and then, when stringified, end
up as "/path/containing/1/in/it", not the intended string. Normally,
macros have names that start or end witih underscores and are unlikely
to appear in a tokenised path (even if technically they could), but now
that we've switched to GNU C as of commit ec41a96daa ("sys: Switch the
kernel's C standard from C99 to GNU99.") there are a few new macros
defined which don't start or end with underscores: unix, which is always
defined to 1, and i386, which is defined to 1 on i386. The former
probably doesn't appear in user paths in practice, but the latter has
been seen to and is likely quite common in the wild.
Fix this by defining the macro pre-quoted instead of using __XSTRING.
Note that technically we don't need to do this for vdso_wrap.S today as
all the paths passed to it are safe file names with no user-controlled
prefix but we should do it anyway for consistency and robustness against
future changes.
This allows make tinderbox to pass when built with source and object
directories inside ~/path-with-unix, which would otherwise expand to
~/path-with-1 and break.
PR: 272744
Fixes: ec41a96daa ("sys: Switch the kernel's C standard from C99 to GNU99.")
Upon discovering a violation kmsan_check_arg() passes a pointer to
function parameter shadow state to kmsan_report_hook().
kmsan_report_hook() uses that address to find the origin cells, assuming
that the passed address belongs to the kernel map. This has two
problems:
1) Function parameter origin state is also located in TLS, not in the
origin map, but kmsan_report_hook() doesn't know this.
2) KMSAN TLS for thread0 is statically allocated and thus isn't shadowed
(because the kernel itself is not shadowed).
These bugs could result in inaccuracies in KMSAN reports, or a page
fault when trying to report a KMSAN violation (which by default panics
the kernel anyway).
Fix the problem by making callers of kmsan_report_hook() provide a
pointer to origin cells.
Sponsored by: The FreeBSD Foundation
When we are sending terminating signal to the group, killpg() needs
to guarantee that all group members are to be terminated (it does not
need to ensure that they are terminated on return from killpg()). The
pg_killsx change eliminates the largest window there, but still, if
a multithreaded process is signalled, the following could happen:
- thread 1 is selected for the signal delivery and gets descheduled
- thread 2 waits for pg_killsx lock, obtains it and forks
- thread 1 continue executing and terminates the process
This scenario allows the child to escape still.
Fix it by single-threading forking parent if a conflict with pg_killsx
is noted. We try to lock pg_killsx without sleeping, and failure to
acquire it means that a parallel killpg(2) is executed. Then, stop
other threads for running and in particular, receive signals, to avoid
the situation explained above.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41128
This should improve signal delivery latency and better expose the
process state to the executing threads.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41128
This reverts commits 81a37995c7 and
565a343ae3.
There is still a leakage of the p_killpg_cnt, some but not all sources
of which were identified.
Second, and more important, is that there is a fundamental issue with
blocked signals having KSI_KILLPG flag set. Queueing of such signal
increments p_killpg_cnt, but it cannot be decremented until the signal
is delivered. If, for instance, a single-threaded process with blocked
signal receives killpg-kill and executes fork(2), the fork enter check
returns with ERESTART. And since signal is blocked, the condition
cannot be cleared.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41128
To ensure atomicity of reads against parallel writes and truncates,
vnode lock was not enough at least since introduction of vn_io_fault().
That code only take rangelock when it was possible that vn_read() and
vn_write() could drop the vnode lock.
At least since the introduction of VOP_READ_PGCACHE() which generally
does not lock the vnode at all, rangelocks become required even
for filesystems that do not need vn_io_fault() workaround. For
instance, tmpfs.
PR: 272678
Analyzed and reviewed by: Andrew Gierth <andrew@tao11.riddles.org.uk>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D41158
These are no longer needed after commit c5312bd79e. This did
require adding an include of <sys/limits.h> instead for SIZE_T_MAX
which previously was dragged in via header pollution.
IFCAP2_* has the bit position and not the shifted value.
Reviewed by: kib@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D41100
Replace the implementations of lookup_le and lookup_ge with ones
that do not use a stack or climb back up the tree, and instead
exploit the popmap field to quickly identify the place to resume
searching if the straightforward indexed search fails.
The code size of the original functions shrinks by a combined 160
bytes on amd64, and the cumulative cycle count per invocation of
the two functions together is reduced 20% in a buildworld test.
Reviewed by: alc, markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D40936
This routine is specific to CAM and no longer assumes any internal
bus_dma knowledge as it is simple wrapper around bus_dmamap_load_mem.
Fixes: 60381fd1ee memdesc: Retire MEMDESC_CCB.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D41058