parse_dir_md() opens /dev/mdctl but only closes the resulting fd on
success, not upon failure of the ioctl or when we exceed the md unit
max.
Reviewed by: kib (slightly previous version)
Sponsored by: NetApp, Inc.
Sponsored by: Klara, Inc.
X-NetApp-PR: #62
Differential Revision: https://reviews.freebsd.org/D31229
This builds on recently introduced NO_NEW_PRIVS flag to implement
unprivileged chroot, enabled by `security.bsd.unprivileged_chroot`.
It allows non-root processes to chroot(2), provided they have the
NO_NEW_PRIVS flag set.
The chroot(8) utility gets a new flag, -n, which sets NO_NEW_PRIVS
before chrooting.
Reviewed By: kib
Sponsored By: EPSRC
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D30130
Temporary add stubs to the Linux emulation layer which calls the existing hook.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D30911
MFC after: 2 weeks
For future use in the Linux emulation layer call sv_onexec hook right after
the new process address space is created. It's safe, as sv_onexec used only
by Linux abi and linux_on_exec() does not depend on a state of process VA.
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D30899
MFC after: 2 weeks
For future use in the Linux emulation layer modify the exec_sysvec_init()
to allow non-native abi to fill sv_timekeep_base and sv_shared_page_obj.
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D30898
MFC after: 2 weeks
The early environment is typically cleared, so these new options
need the PRESERVE_EARLY_KENV kernel config(8) option. These environments
are reported as missing by kenv(1) if the option is not present in the
running kernel.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D30835
Some downstream configurations do not store secrets in the
early (loader/static) environments and desire a way to preserve these
for diagnostic reasons. Provide an option to do so.
Reviewed by: imp, jhb (earlier version)
Differential Revision: https://reviews.freebsd.org/D30834
The syscall number is stored in the same register as the syscall return
on amd64 (and possibly other architectures) and so it is impossible to
recover in the signal handler after the call has returned. This small
tweak delivers it in the `si_value` field of the signal, which is
sufficient to catch capability violations and emulate them with a call
to a more-privileged process in the signal handler.
This reapplies 3a522ba1bc852c3d4660a4fa32e4a94999d09a47 with a fix for
the static assertion failure on i386.
Approved by: markj (mentor)
Reviewed by: kib, bcr (manpages)
Differential Revision: https://reviews.freebsd.org/D29185
One is allowed to use LIO_NOWAIT without specifying a sigevent. In this
case, lj->lioj_signal is left uninitialized, but several code paths
examine liov_signal.sigev_notify to figure out which notification to
post. Unconditionally initialize that field to SIGEV_NONE.
Add a dumb test case which triggers the bug.
Reported by: KMSAN+syzkaller
Reviewed by: asomers
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31197
Commit bb4a27f927a1 added the ability to allocate a span of blocks
crossing a meta node boundary. To ensure that blst_next_leaf_alloc()
does not walk past the end of the tree, an extra all-zero meta node
needs to be present at the end of the allocation, and
blst_next_leaf_alloc() is implemented such that the presence of this
node terminates the search.
blist_create() computes the number of nodes required. It had two
problems:
1. When the size of the blist is a power of BLIST_RADIX, we would
unnecessarily allocate an extra level in the tree.
2. When the size of the blist is a multiple of BLIST_RADIX, we would
fail to allocate a terminator node. In this case,
blst_next_leaf_alloc() could scan beyond the bounds of the
allocation. This was found using KASAN.
Modify blist_create() to handle these cases correctly.
Reported by: pho
Reviewed by: dougm
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31158
Its callers do not make use the modified size that malloc_large() was
returning, so there's no need to pass a pointer. No functional change
intended.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
The syscall number is stored in the same register as the syscall return
on amd64 (and possibly other architectures) and so it is impossible to
recover in the signal handler after the call has returned. This small
tweak delivers it in the `si_value` field of the signal, which is
sufficient to catch capability violations and emulate them with a call
to a more-privileged process in the signal handler.
Approved by: markj (mentor)
Reviewed by: kib, bcr (manpages)
Differential Revision: https://reviews.freebsd.org/D29185
It will be called during KLD unload to unpoison the redzones following
global variables. Otherwise, virtual address ranges previously used for
a KLD may be left tainted, triggering false positives when they are
recycled.
Reported by: pho
Sponsored by: The FreeBSD Foundation
The first release of an interrupt in a situation where the interrupt table
is full should schedule a full table check the next time an interrupt is
allocated. A full check is necessary to ensure maximum separation between
the order of allocation and the order of release.
Submitted by: ehem_freebsd@m5p.com (initial version)
Discussed in: https://reviews.freebsd.org/D29310
MFC after: 4 weeks
HPTS drives both rack and bbr, and yet there have been many complaints
about performance. This bit of work restructures hpts to help reduce CPU
overhead. It does this by now instead of relying on the timer/callout to
drive it instead use user return from a system call as well as lro flushes
to drive hpts. The timer becomes a backstop that dynamically adjusts
based on how "late" we are.
Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31083
Use sysentvec hooks to only call umtx_thread_exit/umtx_exec, which handle
robust mutexes, for native FreeBSD ABI. Similarly, there is no sense
in calling sigfastblock_clear() for non-native ABIs.
Requested by: dchagin
Reviewed by: dchagin, markj (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30987
after itimers are stopped. This makes it more usable for e.g. native FreeBSD
ABI sysentvecs.
Reviewed by: dchagin, markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30987
Unlike sv_onexec(), it is called from the old (pre-exec) sysentvec structure.
The old vmspace for the process is still intact during the call.
Reviewed by: dchagin, markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30987
The intent is to eliminate the MT_NOINIT flag and consequently a branch
from the constructor.
Reviewed by: gallatin
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31080
Ifnet (inline) hw kTLS NICs typically keep state within
a TLS record, so that when transmitting in-order,
they can continue encryption on each segment sent without
DMA'ing extra state from the host.
This breaks down when transmits are out of order (eg,
TCP retransmits). In this case, the NIC must re-DMA
the entire TLS record up to and including the segment
being retransmitted. This means that when re-transmitting
the last 1448 byte segment of a TLS record, the NIC will
have to re-DMA the entire 16KB TLS record. This can lead
to the NIC running out of PCIe bus bandwidth well before
it saturates the network link if a lot of TCP connections have
a high retransmoit rate.
This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct),
where TCP connections with higher retransmit rate will be
switched to SW kTLS so as to conserve PCIe bandwidth.
Reviewed by: hselasky, markj, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D30908
Since commit 2dd1bdf1834c in 2016 the r_start and r_end fields have been
rman_res_t, which was briefly unsigned long, but commit da1b038af9f9
changed the typedef to be uintmax_t instead. C99 is also something we
assume these days.
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D30808
The intent is to remove all direct zone_mbuf consumers so that ctor/dtor
from that zone can be reimplemented as wrappers around uma, avoiding an
indirect function call.
Reviewed by: kbowling
Discussed with: gallatin
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30959
The flag was added in 2016 but remains unused.
Reviewed by: kbowling
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D30958
This introduces a new, per-process flag, "NO_NEW_PRIVS", which
is inherited, preserved on exec, and cannot be cleared. The flag,
when set, makes subsequent execs ignore any SUID and SGID bits,
instead executing those binaries as if they not set.
The main purpose of the flag is implementation of Linux
PROC_SET_NO_NEW_PRIVS prctl(2), and possibly also unpriviledged
chroot.
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30939
Instead of p_elf_machine use machine member of the Elf_Brandinfo which is now
cached in the struct proc at p_elf_brandinfo member.
Note to MFC: D30918, KBI
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D30926
MFC after: 2 weeks
To allow the ABI to make a dicision based on the Brandinfo add a link
to the Elf_Brandinfo into the struct proc. Add a note that the high 8 bits
of Elf_Brandinfo flags is private to the ABI.
Note to MFC: it breaks KBI.
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D30918
MFC after: 2 weeks
This adds `sv_elf_core_osabi`, `sv_elf_core_abi_vendor`,
and `sv_elf_core_prepare_notes` fields to `struct sysentvec`,
and modifies imgact_elf.c to make use of them instead
of hardcoding FreeBSD-specific values. It also updates all
of the ABI definitions to preserve current behaviour.
This makes it possible to implement non-native ELF coredump
support without unnecessary code duplication. It will be used
for Linux coredumps.
Reviewed By: kib
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30921
Introduce SLEEPQ_DROP sleepq_signal() flag, allowing one to drop the
sleep queue chain lock before returning. Reduced lock scope allows
significantly reduce lock contention inside taskqueue_enqueue() for
ZFS worker threads doing ~350K disk reads/s on 40-thread system.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Its intent is to do the initialization of the future part of struct nameidata
which should be used across several namei() and VOPs. Right now it is NOP.
Reviewed by: mckusick
Discussed with: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D30041
Now that the upper layers all go through a layer to tie into these
information functions that translates an sbuf into char * and len. The
current interface suffers issues of what to do in cases of truncation,
etc. Instead, migrate all these functions to using struct sbuf and these
issues go away. The caller is also in charge of any memory allocation
and/or expansion that's needed during this process.
Create a bus_generic_child_{pnpinfo,location} and make it default. It
just returns success. This is for those busses that have no information
for these items. Migrate the now-empty routines to using this as
appropriate.
Document these new interfaces with man pages, and oversight from before.
Reviewed by: jhb, bcr
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D29937
The new buffer is somewhat larger, but there should be no functional
changes.
Reviewed By: kib, imp
Sponsored By: EPSRC
Differential Revision: https://reviews.freebsd.org/D30821
The i386 loader (and hopefully others to come) now passes tslog data
as a "preloaded module". Include this in the data returned by the
debug.tslog sysctl.
Reviewed by: kevans
armv6 and armv7 systems already were 1000Hz. The other armv5 were a
mix of 100 and 1000. This changes them to 1000. Should there be
issues, we can add options HZ=100 to the systems that have bad
performance at the drop of a hat.
mips is a lot more complicated. But most of the systems are already
1000HZ. The hardware exceptions are all fast enough to run at
1000Hz. MALTA is our primary emulator, and history has shown emulators
tend to like 100Hz better, so run those systems at 100Hz. As with arm,
any system that shows a huge performance regression can reverted to
100Hz easily.
This was going to be committed well in advance of the 13 branch, but
it was delayed and forgotten til now.
Discussed on: #bsdmips ages ago
Sponsored by: Netflix
The TOE driver might receive decrypted TLS records that are enqueued
to the socket buffer after ktls_try_toe() returns and before
ktls_enable_rx() locks the receive buffer to call sb_mark_notready().
In that case, sb_mark_notready() would incorrectly treat the decrypted
TLS record as an encrypted record and schedule it for decryption.
This always resulted in the connection being dropped as the data in
the control message did not look like a valid TLS header.
To fix, don't try to handle software decryption of existing buffers in
the socket buffer for TOE TLS in ktls_enable_rx(). If a TOE TLS
driver needs to decrypt existing data in the socket buffer, the driver
will need to manage that in its tod_alloc_tls_session method.
Sponsored by: Chelsio Communications
It seems that Linux does not dequeue siginfo for SIGCHLD when wait*(2)
reports status of the running process. In particular, sigwaitinfo(2)
and other signal querying syscalls can observe the siginfo after wait.
FreeBSD dequeued siginfo from the beginning, so we cannot change the
default ABI to be more compatible. Still, add a knob to enable to
change to the other behavior for debugging purposes.
Reported by: dchagin
Reviewed by: dchagin, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30675
Traditionally, BSD drops signals with the default action during send,
not even putting them to the destination process queue. This semantic
is not shared with other operating systems (Linux), which do queue
such signals. In particular, sigtimedwait(2) and related syscalls can
observe the delivery.
Add a global knob kern.sig_discard_ign which can be set to false to force
enqueuing of the signals with default action. Also add an ABI flag to
indicate that signals should be queued.
Note that it is not practical to run with the knob turned on, because almost
all software that care about the delivery of such signals, is aware of the
difference, and misbehaves if the signals are actually queued. The purpose
of the knob as is is to allow for easier diagnostic of the programs that
need the adjustments, to confirm the cause of problem.
Reported by: dchagin
Reviewed by: dchagin, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D30675