This is needed to ensure that resolvers that reference global symbols
return correct results.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33120
This avoids the need to keep a freebsd32-specific syscalls.master
in sync with the default ABI. As evidenced by the number of commits
required to sync the two, it is extremely easy for them to get out
of sync due to misunderstandings and user errors.
Reviewed by: kevans, kib
Add _Contains_ annotations indicating that the data pointed to by a
pointer argument contains types that vary between FreeBSD ABIs. The
supported set is long (including size_t), pointer (including
intptr_t), and time_t. The first two vary between 32- and 64-bit
ABIs. The laste betwen i386 and everything else.
These will be used to detect which syscalls require handling on
particular ABIs.
Reviewed by: kevans, kib
Optionally return errors when truncating dev_t, ino_t, and nlink_t.
In the interest of code reuse, use freebsd11_cvtstat() to perform the
truncation and error handling and then convert the resulting struct
freebsd11_stat to struct nstat.
Add missing freebsd32 compat syscalls. These syscalls require
translation because struct nstat contains four instances of struct
timespec which in turn contains a time_t and a long.
Reviewed by: kib
The last usage of this function was removed in e3b1c847a4.
There are no in-tree consumers of kernel_vmount().
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D32607
Hyper-V wants to register its MSR-based timecounter during
SI_SUB_HYPERVISOR, before SI_SUB_LOCK, since an emulated 8254 may not be
available for DELAY(). So we cannot use MTX_SYSINIT to initialize the
timecounter lock.
PR: 259878
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D33014
Add a boolean parameter to minidumpsys(), to indicate a live dump. When
requested, take a snapshot of important global state, and pass this to
the machine-dependent minidump function. For now this includes the
kernel message buffer, and the bitset of pages to be dumped. Beyond
this, we don't take much action to protect the integrity of the dump
from changes in the running system.
A new function msgbuf_duplicate() is added for snapshotting the message
buffer. msgbuf_copy() is insufficient for this purpose since it marks
any new characters it finds as read.
For now, nothing can actually trigger a live minidump. A future patch
will add the mechanism for this. For simplicity and safety, live dumps
are disallowed for mips.
Reviewed by: markj, jhb
MFC after: 2 weeks
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D31993
The minidump code is written assuming that certain global state will not
change, and rightly so, since it executes from a kernel debugger
context. In order to support taking minidumps of a live system, we
should allow copies of relevant global state that is likely to change to
be passed as parameters to the minidumpsys() function.
This patch does the work of parameterizing this function, by adding a
struct minidumpstate argument. For now, this struct allows for copies of
the kernel message buffer, and the bitset that tracks which pages should
be dumped (vm_page_dump). Follow-up changes will actually make use of
these arguments.
Notably, dump_avail[] does not need a snapshot, since it is not expected
to change after system initialization.
The existing minidumpsys() definitions are renamed, and a thin MI
wrapper is added to kern_dump.c, which handles the construction of
the state struct. Thus, calling minidumpsys() remains as simple as
before.
Reviewed by: kib, markj, jhb
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31989
Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: The University of Cambridge, Google Inc.
Differential Revision: https://reviews.freebsd.org/D33052
Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: The University of Cambridge, Google Inc.
Differential Revision: https://reviews.freebsd.org/D33051
Add freebsd32 versions of getfsstat and freebsd11_getfsstat so that
bufsize is properly sign-extended if a negative value is passed.
Reject negative values before passing to kern_getfsstat as a size_t.
Reviewed by: kevans
Previously, the code would copy twice as many pointers as specified
and print pairs of them a single 64-bit pointer.
abort2 doesn't return so make the return type void
freebsd32_abort2 is in it's own file with a 2-clause BSD license
based on a discussion with Wojciech many years ago.
Reviewed by: kevans
Use this for COMPAT7 support. In practice it's the same as
union semun32 since the pointers become uint32_t's the it's more
symetric and is the logical thing to generate from semun_old.
Reviewed by: kevans
Match the function decleration which takes an int not a signed int.
No functional change as the range of valid values is 0-2.
Obtained from: CheriBSD
Reviewed by: kevans
Address Space Layout Randomization (ASLR) is an exploit mitigation
technique implemented in the majority of modern operating systems.
It involves randomly positioning the base address of an executable
and the position of libraries, heap, and stack, in a process's address
space. Although over the years ASLR proved to not guarantee full OS
security on its own, this mechanism can make exploitation more difficult.
Tests on the tier 1 64-bit architectures demonstrated that the ASLR is
stable and does not result in noticeable performance degradation,
therefore it should be safe to enable this mechanism by default.
Moreover its effectiveness is increased for PIE (Position Independent
Executable) binaries. Thanks to commit 9a227a2fd6 ("Enable PIE by
default on 64-bit architectures"), building from src is not necessary
to have PIE binaries. It is enough to control usage of ASLR in the
OS solely by setting the appropriate sysctls.
This patch toggles the kernel settings to use address map randomization
for PIE & non-PIE 64-bit binaries. It also disables SBRK, in order
to allow utilization of the bss grow region for mappings. The latter
has no effect if ASLR is disabled, so apply it to all architectures.
As for the drawbacks, a consequence of using the ASLR is more
significant VM fragmentation, hence the issues may be encountered
in the systems with a limited address space in high memory consumption
cases, such as buildworld. As a result, although the tests on 32-bit
architectures with ASLR enabled were mostly on par with what was
observed on 64-bit ones, the defaults for the former are not changed
at this time. Also, for the sake of safety keep the feature disabled
for 32-bit executables on 64-bit machines, too.
The committed change affects the overall OS operation, so the
following should be taken into consideration:
* Address space fragmentation.
* A changed ABI due to modified layout of address space.
* More complicated debugging due to:
* Non-reproducible address space layout between runs.
* Some debuggers automatically disable ASLR for spawned processes,
making target's environment different between debug and
non-debug runs.
In order to confirm/rule-out the dependency of any encountered issue
on ASLR it is strongly advised to re-run the test with the feature
disabled - it can be done by setting the following sysctls
in the /etc/sysctl.conf file:
kern.elf64.aslr.enable=0
kern.elf64.aslr.pie_enable=0
Co-developed by: Dawid Gorecki <dgr@semihalf.com>
Reviewed by: emaste, kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D27666
Some upcoming changes will modify software checksum routines like
in_cksum() to operate using m_apply(), which uses the direct map to
access packet data for unmapped mbufs. This approach of course does not
work on platforms without a direct map, so we have to disallow the use
of unmapped mbufs on such platforms.
I believe this is the right tradeoff: we only configure KTLS on amd64
and arm64 today (and one KTLS consumer, NFS TLS, requires a direct map
already), and the use of unmapped mbufs with plain sendfile is a recent
optimization. If need be, m_apply() could be modified to create
CPU-private mappings of extpg mbuf pages as a fallback.
So, change mb_use_ext_pgs to be hard-wired to zero on systems without a
direct map. Note that PMAP_HAS_DMAP is not a compile-time constant on
some systems, so the default value of mb_use_ext_pgs has to be
determined during boot.
Reviewed by: jhb
Discussed with: gallatin
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32940
When doing an initial mount(8) with its -f (force) flag, the MNT_FORCE
flag is not passed through to the underlying filesystem mount routine.
MNT_FORCE is only passed through on later updates to an existing
mount. With this commit the MNT_FORCE flag is now passed through on the
initial mount.
Sanity check: kib
Sponsored by: Netflix
This is how most SYSINITs are defined. Also annotate the dummy
parameter with __unused. No functional change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Rename to match the naming of syscalls and allow 32 to be appended
without making an ugly name like kevent_freebsd1132.
While here, make the kevent changelist argument const.
Reviewed by: kib
This will be used to break a deadlock in ZFS between the per-mountpoint
teardown lock and page busy locks. In particular, when purging data
from the page cache during dataset rollback, we want to avoid blocking
on the busy state of invalid pages since the busying thread may be
blocked on the teardown lock in zfs_getpages().
Add a helper, vn_pages_remove_valid(), for use by filesystems. Bump
__FreeBSD_version so that the OpenZFS port can make use of the new
helper.
PR: 258208
Reviewed by: avg, kib, sef
Tested by: pho (part of a larger patch)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32931
- Return an errno value upon failure, instead of 1.
- Provide a bus_translate_resource() wrapper.
- Implement the generic version, which traverses the hierarchy until a
bus driver with a non-trivial implementation is found, in subr_bus.c
like other similar default implementations.
- Make ofw_pcib_translate_resource() return an error if a matching PCI
address range is not found.
- Make generic_pcie_translate_resource_common() return an int instead of
a bool. Fix up callers.
No functional change intended.
Reviewed by: imp, jhb
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32855
We do not require devvp vnode locked for metadata io. It is typically
not needed indeed, since correctness of the file system using
corresponding block device ensures that there is no incorrect or racy
manipulations.
But right now DEBUG_VFS_LOCKS option excludes both character device
vnodes and completely destroyed (VBAD) vnodes from asserts. This is not
too bad since WITNESS still ensures that we do not leak locks. On the
other hand, asserts do not mean what they should, to the reader, and
reliance on them being enforced might result in wrong code.
Note that ASSERT_VOP_LOCKED() still silently accepts NULLVP, I think it
is worth fixing as well, in the next round.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D32761
For devfs vnodes, it is fine to not lock vnodes for VOP_FSYNC().
Otherwise vnode must be locked exclusively, except for MNT_SHARED_WRITES()
where the shared lock is enough.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32761
We were not including the requested starting offset in the page offset.
Reviewed by: jhb
Fixes: 3c7a01d773 ("Extend m_apply() to support unmapped mbufs.")
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32922
for compatibility with Linux.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32901
Previously, sorele() always required the socket lock and dropped the
lock if the released reference was not the last reference. Many
callers locked the socket lock just before calling sorele() resulting
in a wasted lock/unlock when not dropping the last reference.
Move the previous implementation of sorele() into a new
sorele_locked() function and use it instead of sorele() for various
places in uipc_socket.c that called sorele() while already holding the
socket lock.
The sorele() macro now uses refcount_release_if_not_last() try to drop
the socket reference without locking the socket. If that shortcut
fails, it locks the socket and calls sorele_locked().
Reviewed by: kib, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32741
Some syscalls checked for invalid AT_* flags in sys_* and others in
kern_*.
Reviewed by: kib
Obtained from: CheriBSD
Sponsored by: The University of Cambridge, Google Inc.
Differential Revision: https://reviews.freebsd.org/D32864
When the NFSv4.2 server does a VOP_ALLOCATE(), it needs
the operation to be done for the RPC's credential and not
td_ucred. It also needs the writing to be done synchronously.
This patch adds "ioflag" and "cred" arguments to VOP_ALLOCATE()
and modifies vop_stdallocate() to use these arguments.
The VOP_ALLOCATE.9 man page will be patched separately.
Reviewed by: khng, kib
Differential Revision: https://reviews.freebsd.org/D32865
sched_throw() can no longer take a NULL thread, APs enter through
sched_ap_entry() instead. This completely removes branching in the
common case and cleans up both paths. No functional change intended.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D32829
This define will later on be used by coming TLS RX hardware offload patches.
No functional change intended.
Reviewed by: jhb@
MFC after: 1 week
Sponsored by: NVIDIA Networking
Normally setting kern.ipc.maxsockets returns EINVAL if the new value
is not greater than the previous value. This can cause spurious
error messages when sysctl.conf is processed multiple times, or when
automation systems try to ensure the sysctl is set to the correct
value. If the value is unchanged, then just do nothing.
PR: 243532
Reviewed by: markj
MFC after: 3 days
Sponsored by: Modirum MDPay
Sponsored by: Klara Inc.
Differential Revision: https://reviews.freebsd.org/D32775
Change the 'period' argument to 'duration' and change its type to
sbintime_t so we can more easily express different durations.
Reviewed by: tsoome, glebius
Differential Revision: https://reviews.freebsd.org/D32619
schedinit_ap() sets up an AP for a later call to sched_throw(NULL).
Currently, ULE sets up some pcpu bits and fixes the idlethread lock with
a call to sched_throw(NULL); this results in a window where curthread is
setup in platforms' init_secondary(), but it has the wrong td_lock.
Typical platform AP startup procedure looks something like:
- Setup curthread
- ... other stuff, including cpu_initclocks_ap()
- Signal smp_started
- sched_throw(NULL) to enter the scheduler
cpu_initclocks_ap() may have callouts to process (e.g., nvme) and
attempt to sched_add() for this AP, but this attempt fails because
of the noted violated assumption leading to locking heartburn in
sched_setpreempt().
Interrupts are still disabled until cpu_throw() so we're not really at
risk of being preempted -- just let the scheduler in on it a little
earlier as part of setting up curthread.
Reviewed by: alfredo, kib, markj
Triage help from: andrew, markj
Smoke-tested by: alfredo (ppc), kevans (arm64, x86), mhorne (arm)
Differential Revision: https://reviews.freebsd.org/D32797
When working on the ports these functions were slightly different, but
now there's no reason for them to be separate.
No functional change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
for strange case where queried process does not have text.
Reported by: Michael Butler <imb@protected-networks.net>
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
vn_fullpath() call was not converted to pass newtextvp, instead it used
imgp->vp which is still NULL there. As result vn_fullpath() always
returned EINVAL and execpath was recorded from the value of arg0.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
The existing logic didn't take into account newly inserted mappings
wholly contained by an existing region (or vice versa), nor did it
account for weird overlap scenarios. The latter is probably unlikely
to happen, but the former may happen in UEFI: BootServicesData allocated
within a large chunk of ConventionalMemory. This situation blows up vm
initialization.
While we're here, remove the "exact match" logic as it's likely wrong;
if an exact match exists with conflicting flags, for instance, then we
should probably be doing something else. The new logic takes into
account exact matches as part of the overlapping efforts.
Reviewed by: kib, mhorne (both earlier version)
Differential Revision: https://reviews.freebsd.org/D32701
In 6e66030c4c, additional ptracestop was added in order
to implement PTRACE_EVENT_EXEC. Make it only apply to cases
where the debugger is a Linux processes; native FreeBSD
debuggers can trace Linux processes too, but they don't
expect that additonal ptracestop.
Fixes: 6e66030c4c
Reported By: kib
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D32726
This change is a slight performance optimization for systems with a slow
64-bit division.
The th->th_scale and th->th_large_delta values only depend on the
timecounter frequency and the th->th_adjustment. The timecounter
frequency of a timehand only changes when a new timecounter is activated
for the timehand. The th->th_adjustment is only changed by the NTP
second update. The NTP second update is not done for every call of
tc_windup().
Move the code block to recalculate the scaling factor and
the large delta of a timehand to the new helper function
recalculate_scaling_factor_and_large_delta().
Call recalculate_scaling_factor_and_large_delta() when a new timecounter
is activated and a NTP second update occurred.
MFC after: 1 week
This allows the pmap_remove(min, max) call to see empty pmap and exploit
empty pmap optimization.
Reviewed by: markj
Tested by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32569
While doing it, also move all the code to resolve pathnames and obtain
text vp and dvp, into single place. Besides simplifying the code, it
avoids spurious vnode relocks and validates the explanation why
a transient text reference on the script vnode is not harmful.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611
For this, use vn_fullpath_hardlink() to resolve executable name for
execve(2).
This should provide the right hardlink name, used for execution, instead
of random hardlink pointing to this binary. Also this should make the
AT_EXECNAME reliable for execve(2), since kernel only needs to resolve
parent directory path, which should always succeed (except pathological
cases like unlinking a directory).
PR: 248184
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611
Also re-align comments, and group booleans and char members together.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32611
An ordered series of BIO_READ and BIO_WRITE operations are
typically done as:
while (work to do) {
setup bp for I/O
g_io_request(bp, consumer);
biowait(bp);
}
Here you need to have biodone() called at the completion of
the I/O to set the BIO_DONE flag and awaken the biowait(). The
obvious way to do this would be to set bio_done = biodone, but
biodone() will only take the desired action if bio_done == NULL.
The relevant code at the end of biodone() is:
done = bp->bio_done;
if (done == NULL) {
mtxp = mtx_pool_find(mtxpool_sleep, bp);
mtx_lock(mtxp);
bp->bio_flags |= BIO_DONE;
wakeup(bp);
mtx_unlock(mtxp);
} else
done(bp);
This code would infinitely recurse if biodone() is specified as the
routine to use at completion. So before this change, a wrapper done
function had to be written:
static void
g_io_done(struct bio *bp)
{
bp->bio_done = NULL;
biodone(bp);
bp->bio_done = g_io_done;
}
This commit changes
if (done == NULL)
to
if (done == NULL || done == biodone)
which eliminates the need for the wrapper function.
Reviewed by: kib
Sponsored by: Netflix
In shm_largepage_phys_populate(), the result from vm_page_grab() is only
needed for assertion.
In shm_dotruncate_largepage(), there is a commented-out prototype code
for managed largepages. The oldobjsz is saved for its sake, so mark
the variable as __unused directly.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
The function ignores result returned by linker_release_module().
The FW_UNLOAD flag on the file is cleared, so even on error it would
not be tried again.
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
A future change to TOE TLS will require a software fallback for the
first few TLS records received. Future support for NIC TLS on receive
will also require a software fallback for certain cases.
Reviewed by: gallatin, hselasky
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32566
In particular, ktls_pending_rx_info() determines which TLS record is
at the end of the current receive socket buffer (including
not-yet-decrypted data) along with how much data in that TLS record is
not yet present in the socket buffer.
This is useful for future changes to support NIC TLS receive offload
and enhancements to TOE TLS receive offload. Those use cases need a
way to synchronize a state machine on the NIC with the TLS record
boundaries in the TCP stream.
Reviewed by: gallatin, hselasky
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32564
Remove page zeroing code from consumers and stop specifying
VM_ALLOC_NOOBJ. In a few places, also convert an allocation loop to
simply use VM_ALLOC_WAITOK.
Similarly, convert vm_page_alloc_domain() callers.
Note that callers are now responsible for assigning the pindex.
Reviewed by: alc, hselasky, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31986
When iterating over the process group members, skip zombies same as it
is done by pfind() for single-process operation.
Reviewed by: emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32513
for state control over TRACE, TRAPCAP, ASLR, PROTMAX, STACKGAP,
NO_NEWPRIVS, and WXMAP.
Reported by: emaste
Reviewed by: emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32513
and remove zeroing of it from specific functions. This way it is
guaranteed that we do not leak kernel data.
Suggested by: markj
Reviewed by: emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32513
Timecounter registration is dynamic, i.e., there is no requirement that
timecounters must be registered during single-threaded boot. Loadable
drivers may in principle register timecounters (which can be switched to
automatically). Timecounters cannot be unregistered, though this could
be implemented.
Registered timecounters belong to a global linked list. Add a mutex to
synchronize insertions and the traversals done by (mpsafe) sysctl
handlers. No functional change intended.
Reviewed by: imp, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32511
Add a SIG_FOREACH macro that can be used to iterate over a signal set.
This is a bit cleaner and more efficient than calling sig_ffs() in a
loop. The implementation is based on BIT_FOREACH_ISSET(), except
that the bitset limbs are always 32 bits wide, and signal sets are
1-indexed rather than 0-indexed like bitset(9) sets.
issignal() cannot really be modified to use SIG_FOREACH() directly.
Take this opportunity to split the function into two explicit loops.
I've always found this function hard to read and think that this change
is an improvement.
Remove sig_ffs(), nothing uses it now.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32473
In cases such as daemons launched via limits(1), a process may call
exec multiple times; the last name of the last binary executed is
usually (always?) more informative.
Fixes: 46dd801acb Add userland boot profiling to TSLOG
Sponsored by: https://www.patreon.com/cperciva
This is needed for LinuxKPI's _ioremap_attr. This reuses the generic
implementation introduced for aarch64, and itself requires implementing
pmap_kenter, which is trivial to do given riscv currently treats all
mapping attributes the same due to the Svpbmt extension not yet being
ratified and in hardware.
Reviewed by: markj, mhorne
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D32445
This avoids spurious drop offs as EMPTY is passed regardless of the
actual path name.
Pushign the work inside the lookup instead of just ignorign the flag
allows avoid checking for empty pathname for all other lookups.
On kernels compiled with 'options TSLOG', record for each process ID:
* The timestamp of the fork() which creates it and the parent
process ID,
* The first path passed to execve(), if any,
* The first path resolved by namei, if any, and
* The timestamp of the exit() which terminates the process.
Expose this information via a new sysctl, debug.tslog_user.
On kernels lacking 'options TSLOG' (the default), no information is
recorded and the sysctl does not exist.
Note that recording namei is needed in order to obtain the names of
rc.d scripts being launched, as the rc system sources them in a
subshell rather than execing the scripts.
With this commit it is now possible to generate flamecharts of the
entire boot process from the start of the loader to the end of
/etc/rc. The code needed to perform this processing is currently
found in github: https://github.com/cperciva/freebsd-boot-profiling
Reviewed by: mhorne
Sponsored by: https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D32493
With stack gap enabled top of the stack is moved down by a random
amount of bytes. Because of that some multithreaded applications
which use kern.usrstack sysctl to calculate address of stacks for
their threads can fail. Add kern.stacktop sysctl, which can be used
to retrieve address of the stack after stack gap is applied to it.
Returns value identical to kern.usrstack for processes which have
no stack gap.
Reviewed by: kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D31897
Calling setrlimit with stack gap enabled and with low values of stack
resource limit often caused the program to abort immediately after
exiting the syscall. This happened due to the fact that the resource
limit was calculated assuming that the stack started at sv_usrstack,
while with stack gap enabled the stack is moved by a random number
of bytes.
Save information about stack size in struct vmspace and adjust the
rlim_cur value. If the rlim_cur and stack gap is bigger than rlim_max,
then the value is truncated to rlim_max.
PR: 253208
Reviewed by: kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D31516
Run ktls_init() when the first KTLS session is created rather than
unconditionally during boot. This avoids creating unused threads and
allocating unused resources on systems which do not use KTLS.
Reviewed by: gallatin, markj
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32487
Without this change, unmounting smbfs filesystems with an INVARIANTS
kernel would panic after 10e64782ed.
Found by: markj
Reviewed by: markj, jhb
Obtained from: CheriBSD
MFC after: 3 days
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D32492
TLS 1.0 records are encrypted as one continuous CBC chain where the
last block of the previous record is used as the IV for the next
record. As a result, TLS 1.0 records cannot be encrypted out of order
but must be encrypted as a FIFO.
If the later pages of a sendfile(2) request complete before the first
pages, then TLS records can be encrypted out of order. For TLS 1.1
and later this is fine, but this can break for TLS 1.0.
To cope, add a queue in each TLS session to hold TLS records that
contain valid unencrypted data but are waiting for an earlier TLS
record to be encrypted first.
- In ktls_enqueue(), check if a TLS record being queued is the next
record expected for a TLS 1.0 session. If not, it is placed in
sorted order in the pending_records queue in the TLS session.
If it is the next expected record, queue it for SW encryption like
normal. In addition, check if this new record (really a potential
batch of records) was holding up any previously queued records in
the pending_records queue. Any of those records that are now in
order are also placed on the queue for SW encryption.
- In ktls_destroy(), free any TLS records on the pending_records
queue. These mbufs are marked M_NOTREADY so were not freed when the
socket buffer was purged in sbdestroy(). Instead, they must be
freed explicitly.
Reviewed by: gallatin, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D32381
To implement -o emptydir, vfs_emptydir() checks that the passed
directory is empty. This should be done after checking whether the
vnode is of type VDIR, though, or vfs_emptydir() may end up calling
VOP_READDIR on a non-directory.
Reported by: syzbot+4006732c69fb0f792b2c@syzkaller.appspotmail.com
Reviewed by: kib, imp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32475
Create the initial pool of kprocs on demand when the first socket AIO
request is submitted instead. The pool of kprocs used for other AIO
requests is similarly created on first use.
Reviewed by: asomers
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32468
This gives the vfs layer a chance to provide handling for EVFILT_VNODE,
for instance. Change pipe_specops to use the default vop_kqfilter to
accommodate fifoops that don't specify the method (i.e. all in-tree).
Based on a patch by Jan Kokemüller.
PR: 225934
Reviewed by: kib, markj (both pre-KASSERT)
Differential Revision: https://reviews.freebsd.org/D32271
These calls do operate on vnodes only, not file contents.
This is useful for e.g. the xdg-document-portal fuse filesystem.
Reviewed by: kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D32438
Support changing the protection of preloaded kernel modules by
implementing pmap_change_prot on arm64 and calling it from
preload_protect.
Reviewed by: alc (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32026
This can be disabled by sysctl kern.core_dump_can_intr
Reported and tested by: pho
Reviewed by: imp, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32313
Function returns an indicator that the process was killed with SIGKILL
Reviewed by: imp, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32313
Consumers that want the full allocation size will typically access the
full buffer, so mark the entire allocation as valid to avoid useless
KASAN reports.
Sponsored by: The FreeBSD Foundation
The flag should be accessible from non-current threads.
Reviewed by: markj
Tested by: trasz
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32252
This function is actively used by sbuf_vprintf(), so this simple
inlining in half reduces time of kern.geom.confxml generation.
MFC after: 2 weeks
Sponsored by: iXsystem, Inc.
Callout c_time is always bigger or equal than the scheduled time. It
is also smaller than sbinuptime() and can't change while the callback
is running. So we reliably can use it instead of sbinuptime() here.
In case there was a race and the callout was rescheduled to the later
time, the callback will be called again.
According to profiles it saves ~5% of the timer interrupt time even
with fast TSC timecounter.
MFC after: 1 month
Before this change kern.sched.interact sysctl setting above 32 gave
all interactive threads identical priority of PRI_MIN_INTERACT due to
((PRI_MAX_INTERACT - PRI_MIN_INTERACT + 1) / sched_interact) turning
zero. Setting the sysctl lower reduced the range of used priority
levels up to half, that is not great either.
Change of the operations order should fix the issue, always using full
range of priorities, while overflow is impossible there since both
score and priority values are small. While there, make the variables
unsigned as they really are.
MFC after: 1 month
A core segment is bounded in size only by memory size. On 64-bit
architectures this means a segment can be much larger than 4GB.
However, compress_chunk() takes only a u_int, clamping segment size to
4GB-1, resulting in a truncated core. Everything else, including the
compressor internally, uses size_t, so use size_t at the boundary here.
This dates back to the original refactor back in 2015 (r279801 /
aa14e9b7).
MFC after: 1 week
Sponsored by: Juniper Networks, Inc.
NOTE_ABSTIME may also have a zero timeout, which indicates that we
should still fire immediately as an absolute time in the past. A test
has been added for this one as well.
Fixes: 9c999a259f ("kqueue: don't arbitrarily restrict long-past...")
Point hat: kevans
Reported by: syzbot+1c8d1154f560b3930042@syzkaller.appspotmail.com
NOTE_ABSTIME values are converted to values relative to boottime in
filt_timervalidate(), and negative values are currently rejected. We
don't reject times in the past in general, so clamp this up to 0 as
needed such that the timer fires immediately rather than imposing what
looks like an arbitrary restriction.
Another possible scenario is that the system clock had to be adjusted
by ~minutes or ~hours and we have less than that in terms of uptime,
making a reasonable short-timeout suddenly invalid. Firing it is still
a valid choice in this scenario so that applications can at least
expect a consistent behavior.
Reviewed by: kib, markj
Discussed with: allanjude
Differential Revision: https://reviews.freebsd.org/D32230
The implementation of the progress bar is simple, but duplicated for
most minidump implementations. Extract the common bits to kern_dump.c.
Ensure that the bar is reset with each subsequent dump; this was only
done on some platforms previously.
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D31885
This function was renamed to kern_reboot() in 2010, but the man page has
failed to keep in sync. Bring it up to date on the rename, add the
shutdown hooks to the synopsis, and document the (obvious) fact that
kern_reboot() does not return.
Fix an outdated reference to the old name in kern_reboot(), and leave a
reference to the man page so future readers might find it before any
large changes.
Reviewed by: imp, markj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32085
If the default range of [0, ~0] is given, then (~0 - 0) + 1 == 0. This
in turn will cause any allocation of non-zero size to fail. Zero-sized
allocations are prohibited, so add a KASSERT to this effect.
History indicates it is part of the original rman code. This bug may in
fact be older than some contributors.
Reviewed by: mhorne
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D30280
e745d729be caused infinite loop with interrupts disabled in load
stealing code if steal_thresh set below 2. Such configuration should
not generally be used, but appeared some people are using it to
workaround some problems.
To fix the problem explicitly pass to sched_highest() minimum number
of transferrable threads, supported by the caller, instead of guessing.
MFC after: 25 days
For alignment we do not need to do anything to make it operational.
For size, upgrade zero sized request to one byte so that we do not
request insane amount of memory for placeholder.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D32127
Before this device unit number match was coincidental and broke if I
disabled some CPU device(s). Aside of cosmetics, for some drivers
(may be considered broken) it caused talking to wrong CPUs.
When device driver probe method returns 0, i.e. absolute priority, do
not remove its class from the device just to set it back few lines
later, that may change the device unit number, etc. and after which
we'd better call the probe again.
If during search we found some driver with absolute priority, we do
not need to set device driver and class since we haven't removed them
before.
It should not happen, but if second probe method call failed, remove
the driver and possibly the class from the device as it was when we
started.
Reviewed by: imp, jhb
Differential Revision: https://reviews.freebsd.org/D32125
I did DF_REBID to allow for 'hoover' drivers that would attach to
otherwise unattached devices in the tree. This notion didn't catch on as
it was tricky to make work well and it was easier to just publish a /dev
node of some flavor by the parent device. It's been nothing but dead
weight for a long time.
Reviewed by: mav
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D32056
When this flag is set, operations that update an existing kevent will
not change the udata field. This can be used to NOTE_TRIGGER or
EV_{EN,DIS}ABLE events without overwriting the stashed pointer.
Reviewed by: Domagoj Stolfa <domagoj.stolfa@gmail.com>
Obtained from: CheriBSD
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D30286
Depending on hardware, NUMA nodes may match last level caches, or
they may be above them (AMD Zen 2/3) or below (Intel Xeon w/ SNC).
This information is provided by ACPI instead of CPUID, and it is
provided for each CPU individually instead of mask widths, but
this code should be able to properly handle all the above cases.
This change should immediately allow idle stealing in sched_ule(4)
to prefer load from NUMA-local CPUs to remote ones when the node
does not match LLC. Later we may think of how to better handle it
on sched_pickcpu() side.
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D32025
Avoid using atomics as it_wait is guarded by td_lock.
Report threshold calculation is done only if at least one PMC hook
is installed
Fixes:
* avoid unnecessary branching (if frame != null ...)
by having PMC_HOOK_INSTALLED_ANY
condition on the top of them, which should hint
the core not to execute speculatively anything
which us underneath;
* access intr_hwpmc_waiting_report_threshold cacheline
only if at least one hook is loaded;
Before this change long-term load balancer was unable to migrate
running threads, only ones waiting on run queues. But with growing
number of CPU cores it is quite typical now for system to not have
many waiting threads. But same time if due to some coincidence two
long-running CPU-bound threads ended up sharing same physical CPU
core, they could suffer from the SMT penalty indefinitely, and the
load balancer couldn't help.
Improve that by teaching the load balancer to hint running threads
to migrate by marking them with TDF_NEEDRESCHED and new TDF_PICKCPU
flag, making sched_pickcpu() to search for better CPU later, when
it is convenient.
Fix CPU search logic when balancing to limit round-robin migrations
in case of almost equal load to the group of physical cores. The
previous code bounced threads across all the system, that should be
pretty bad for caches and NUMA affinity, while additional fairness
was almost invisible, diminishing with number of cores in the group.
MFC after: 1 month
According to https://github.com/NuxiNL/cloudlibc:
CloudABI is no longer being maintained. It was an awesome experiment,
but it never got enough traction to be sustainable.
There is no reason to keep it in FreeBSD.
Approved by: ed (private mail)
Reviewed by: emaste
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D31923
In scenarios when first thread in the queue can migrate to specified
CPU, but later ones can't runq_steal_from() incorrectly returned NULL.
MFC after: 2 weeks
For signal send, copyout from the user FPU save area directly.
For sigreturn, we are in sleepable context and can do temporal
allocation of the transient save area. We cannot copying from userspace
directly to user save area because XSAVE state needs to be validated,
also partial copyins can corrupt it.
Requested by: jhb
Reviewed by: jhb, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31954
Instead do one more allocation at the thread creation time. This frees
a lot of space on the stack.
Also do not use alloca() for temporal storage in signal delivery sendsig()
function and signal return syscall sys_sigreturn(). This saves equal
amount of space, again by the cost of one more allocation at the thread
creation time.
A useful experiment now would be to reduce KSTACK_PAGES.
Reviewed by: jhb, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31954
Generialize bus specific property accessors. Those functions allow driver code
to access device specific information.
Currently there is only support for FDT and ACPI buses.
Reviewed by: manu, mw
Sponsored by: Semihalf
Differential revision: https://reviews.freebsd.org/D31597
We need to load the socket pointer after locking the PCB, otherwise
the socket may have been detached and freed by the time that unp_drop()
sets so_error.
This previously went unnoticed as the socket zone was _NOFREE.
Reported by: pho
MFC after: 1 week
As with FIFOs, a path descriptor for a unix socket cannot be used with
kevent().
In principle connectat(2) and bindat(2) could be modified to support an
AT_EMPTY_PATH-like mode which operates on the socket referenced by an
O_PATH fd referencing a unix socket. That would eliminate the path
length limit imposed by sockaddr_un.
Update O_PATH tests.
Reviewed by: kib
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31970
To handle shutdown(SHUT_RD) we flush the receive buffer of the socket.
This may involve searching for control messages of type SCM_RIGHTS,
since we need to close the file references. Closing arbitrary files
with socket buffer locks held is undesirable, mainly due to lock
ordering issues, so we instead make a copy of the socket buffer and
operate on that without any locks. Fields in the original buffer are
cleared.
This behaviour clobbered the AIO job queue associated with a receive
buffer. It could also cause us to leak a KTLS session reference.
Reorder socket buffer fields to address this.
An alternate solution would be to remove the hack in sorflush(), but
this is not quite feasible (yet). In particular, though sorflush()
flags the sockbuf with SBS_CANTRCVMORE, it is possible for more data to
be queued - the flag just prevents userspace from reading more data. I
suspect we should fix this; SBS_CANTRCVMORE represents a terminal state
and protocols can likely just drop any data destined for such a buffer.
Many of them already do, but in some cases the check is racy, and some
KPI churn will be needed to fix everything. This approach is more
straightforward for now.
Reported by: syzbot+104d8ee3430361cb2795@syzkaller.appspotmail.com
Reported by: syzbot+5bd2e7d05f84a59d0d1b@syzkaller.appspotmail.com
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31976
This flag was added during the transition away from the legacy zone
allocator, commit c897b81311. The old
zone allocator effectively provided _NOFREE semantics, but it seems that
they are not required for sockets. In particular, we use reference
counting to keep sockets live.
One somewhat dangerous case is sonewconn(), which returns a pointer to a
socket with reference count 0. This socket is still effectively owned
by the listening socket. Protocols must therefore be careful to
synchronize sonewconn() calls with their pru_close implementations,
since for listening sockets soclose() will abort the child sockets. For
example, TCP holds the listening socket's PCB read locked across the
sonewconn() call, which blocks tcp_usr_close(), and sofree()
synchronizes with a concurrent soabort() of the nascent socket.
However, _NOFREE semantics are not required here.
Eliminating _NOFREE has several benefits: it enables use-after-free
detection (e.g., by KASAN) and lets the system reclaim memory from the
socket zone under memory pressure. No functional change intended.
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31975
Sockets in a listen queue hold a reference to the parent listening
socket. Several code paths release this reference manually when moving
a child socket out of the queue.
Replace comments about the expected post-decrement refcount value with
assertions. Use refcount_load() instead of a plain load. No functional
change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31974
After releasing the fd reference to a socket "so", we should avoid
testing SOLISTENING(so) since the socket may have been freed. Instead,
directly test whether the list of unaccepted sockets is empty.
Fixes: f4bb1869dd ("Consistently use the SOLISTENING() macro")
Pointy hat: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31973
ktls_get_(rx|tx)_mode() can return an errno value or a TLS mode, so
errors are effectively hidden. Fix this by using a separate output
parameter. Convert to the new socket buffer locking macros while here.
Note that the socket buffer lock is not needed to synchronize the
SOLISTENING check here, we can rely on the PCB lock.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31977
It allows to override kern.elf{32,64}.allow_wx on per-process basis.
In particular, it makes it possible to run binaries without PT_GNU_STACK
and without elfctl note while allow_wx = 0.
Reviewed by: brooks, emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31779