Remove page zeroing code from consumers and stop specifying
VM_ALLOC_NOOBJ. In a few places, also convert an allocation loop to
simply use VM_ALLOC_WAITOK.
Similarly, convert vm_page_alloc_domain() callers.
Note that callers are now responsible for assigning the pindex.
Reviewed by: alc, hselasky, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31986
When iterating over the process group members, skip zombies same as it
is done by pfind() for single-process operation.
Reviewed by: emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32513
for state control over TRACE, TRAPCAP, ASLR, PROTMAX, STACKGAP,
NO_NEWPRIVS, and WXMAP.
Reported by: emaste
Reviewed by: emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32513
and remove zeroing of it from specific functions. This way it is
guaranteed that we do not leak kernel data.
Suggested by: markj
Reviewed by: emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32513
Timecounter registration is dynamic, i.e., there is no requirement that
timecounters must be registered during single-threaded boot. Loadable
drivers may in principle register timecounters (which can be switched to
automatically). Timecounters cannot be unregistered, though this could
be implemented.
Registered timecounters belong to a global linked list. Add a mutex to
synchronize insertions and the traversals done by (mpsafe) sysctl
handlers. No functional change intended.
Reviewed by: imp, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32511
Add a SIG_FOREACH macro that can be used to iterate over a signal set.
This is a bit cleaner and more efficient than calling sig_ffs() in a
loop. The implementation is based on BIT_FOREACH_ISSET(), except
that the bitset limbs are always 32 bits wide, and signal sets are
1-indexed rather than 0-indexed like bitset(9) sets.
issignal() cannot really be modified to use SIG_FOREACH() directly.
Take this opportunity to split the function into two explicit loops.
I've always found this function hard to read and think that this change
is an improvement.
Remove sig_ffs(), nothing uses it now.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32473
In cases such as daemons launched via limits(1), a process may call
exec multiple times; the last name of the last binary executed is
usually (always?) more informative.
Fixes: 46dd801acb Add userland boot profiling to TSLOG
Sponsored by: https://www.patreon.com/cperciva
This is needed for LinuxKPI's _ioremap_attr. This reuses the generic
implementation introduced for aarch64, and itself requires implementing
pmap_kenter, which is trivial to do given riscv currently treats all
mapping attributes the same due to the Svpbmt extension not yet being
ratified and in hardware.
Reviewed by: markj, mhorne
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D32445
This avoids spurious drop offs as EMPTY is passed regardless of the
actual path name.
Pushign the work inside the lookup instead of just ignorign the flag
allows avoid checking for empty pathname for all other lookups.
On kernels compiled with 'options TSLOG', record for each process ID:
* The timestamp of the fork() which creates it and the parent
process ID,
* The first path passed to execve(), if any,
* The first path resolved by namei, if any, and
* The timestamp of the exit() which terminates the process.
Expose this information via a new sysctl, debug.tslog_user.
On kernels lacking 'options TSLOG' (the default), no information is
recorded and the sysctl does not exist.
Note that recording namei is needed in order to obtain the names of
rc.d scripts being launched, as the rc system sources them in a
subshell rather than execing the scripts.
With this commit it is now possible to generate flamecharts of the
entire boot process from the start of the loader to the end of
/etc/rc. The code needed to perform this processing is currently
found in github: https://github.com/cperciva/freebsd-boot-profiling
Reviewed by: mhorne
Sponsored by: https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D32493
With stack gap enabled top of the stack is moved down by a random
amount of bytes. Because of that some multithreaded applications
which use kern.usrstack sysctl to calculate address of stacks for
their threads can fail. Add kern.stacktop sysctl, which can be used
to retrieve address of the stack after stack gap is applied to it.
Returns value identical to kern.usrstack for processes which have
no stack gap.
Reviewed by: kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D31897
Calling setrlimit with stack gap enabled and with low values of stack
resource limit often caused the program to abort immediately after
exiting the syscall. This happened due to the fact that the resource
limit was calculated assuming that the stack started at sv_usrstack,
while with stack gap enabled the stack is moved by a random number
of bytes.
Save information about stack size in struct vmspace and adjust the
rlim_cur value. If the rlim_cur and stack gap is bigger than rlim_max,
then the value is truncated to rlim_max.
PR: 253208
Reviewed by: kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D31516
Run ktls_init() when the first KTLS session is created rather than
unconditionally during boot. This avoids creating unused threads and
allocating unused resources on systems which do not use KTLS.
Reviewed by: gallatin, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32487
Without this change, unmounting smbfs filesystems with an INVARIANTS
kernel would panic after 10e64782ed.
Found by: markj
Reviewed by: markj, jhb
Obtained from: CheriBSD
MFC after: 3 days
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D32492
TLS 1.0 records are encrypted as one continuous CBC chain where the
last block of the previous record is used as the IV for the next
record. As a result, TLS 1.0 records cannot be encrypted out of order
but must be encrypted as a FIFO.
If the later pages of a sendfile(2) request complete before the first
pages, then TLS records can be encrypted out of order. For TLS 1.1
and later this is fine, but this can break for TLS 1.0.
To cope, add a queue in each TLS session to hold TLS records that
contain valid unencrypted data but are waiting for an earlier TLS
record to be encrypted first.
- In ktls_enqueue(), check if a TLS record being queued is the next
record expected for a TLS 1.0 session. If not, it is placed in
sorted order in the pending_records queue in the TLS session.
If it is the next expected record, queue it for SW encryption like
normal. In addition, check if this new record (really a potential
batch of records) was holding up any previously queued records in
the pending_records queue. Any of those records that are now in
order are also placed on the queue for SW encryption.
- In ktls_destroy(), free any TLS records on the pending_records
queue. These mbufs are marked M_NOTREADY so were not freed when the
socket buffer was purged in sbdestroy(). Instead, they must be
freed explicitly.
Reviewed by: gallatin, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D32381
To implement -o emptydir, vfs_emptydir() checks that the passed
directory is empty. This should be done after checking whether the
vnode is of type VDIR, though, or vfs_emptydir() may end up calling
VOP_READDIR on a non-directory.
Reported by: syzbot+4006732c69fb0f792b2c@syzkaller.appspotmail.com
Reviewed by: kib, imp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32475
Create the initial pool of kprocs on demand when the first socket AIO
request is submitted instead. The pool of kprocs used for other AIO
requests is similarly created on first use.
Reviewed by: asomers
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32468
This gives the vfs layer a chance to provide handling for EVFILT_VNODE,
for instance. Change pipe_specops to use the default vop_kqfilter to
accommodate fifoops that don't specify the method (i.e. all in-tree).
Based on a patch by Jan Kokemüller.
PR: 225934
Reviewed by: kib, markj (both pre-KASSERT)
Differential Revision: https://reviews.freebsd.org/D32271
These calls do operate on vnodes only, not file contents.
This is useful for e.g. the xdg-document-portal fuse filesystem.
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D32438
Support changing the protection of preloaded kernel modules by
implementing pmap_change_prot on arm64 and calling it from
preload_protect.
Reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32026
This can be disabled by sysctl kern.core_dump_can_intr
Reported and tested by: pho
Reviewed by: imp, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32313
Function returns an indicator that the process was killed with SIGKILL
Reviewed by: imp, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32313
Consumers that want the full allocation size will typically access the
full buffer, so mark the entire allocation as valid to avoid useless
KASAN reports.
Sponsored by: The FreeBSD Foundation
The flag should be accessible from non-current threads.
Reviewed by: markj
Tested by: trasz
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32252
This function is actively used by sbuf_vprintf(), so this simple
inlining in half reduces time of kern.geom.confxml generation.
MFC after: 2 weeks
Sponsored by: iXsystem, Inc.
Callout c_time is always bigger or equal than the scheduled time. It
is also smaller than sbinuptime() and can't change while the callback
is running. So we reliably can use it instead of sbinuptime() here.
In case there was a race and the callout was rescheduled to the later
time, the callback will be called again.
According to profiles it saves ~5% of the timer interrupt time even
with fast TSC timecounter.
MFC after: 1 month
Before this change kern.sched.interact sysctl setting above 32 gave
all interactive threads identical priority of PRI_MIN_INTERACT due to
((PRI_MAX_INTERACT - PRI_MIN_INTERACT + 1) / sched_interact) turning
zero. Setting the sysctl lower reduced the range of used priority
levels up to half, that is not great either.
Change of the operations order should fix the issue, always using full
range of priorities, while overflow is impossible there since both
score and priority values are small. While there, make the variables
unsigned as they really are.
MFC after: 1 month