When device driver probe method returns 0, i.e. absolute priority, do
not remove its class from the device just to set it back few lines
later, that may change the device unit number, etc. and after which
we'd better call the probe again.
If during search we found some driver with absolute priority, we do
not need to set device driver and class since we haven't removed them
before.
It should not happen, but if second probe method call failed, remove
the driver and possibly the class from the device as it was when we
started.
Reviewed by: imp, jhb
Differential Revision: https://reviews.freebsd.org/D32125
(cherry picked from commit f73c2bbf81)
Fix 0389e9be63 for LINT built. Removed an arg only from code
under BUS_DEBUG w/o rebuilding LINT...
Sponsored by: Netflix
Fixes: 0389e9be63
(cherry picked from commit 67a9e76da6)
I did DF_REBID to allow for 'hoover' drivers that would attach to
otherwise unattached devices in the tree. This notion didn't catch on as
it was tricky to make work well and it was easier to just publish a /dev
node of some flavor by the parent device. It's been nothing but dead
weight for a long time.
Reviewed by: mav
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D32056
(cherry picked from commit 0389e9be63)
In particular, we need to initialize efbuf->flags, since
export_vnode_to_sb() loads that field. This was mostly harmless since
the flag only determines whether the output kinfo_file is packed, and
KERN_PROC_CWD only ever emits a single kinfo_file anyway.
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 327060bd77)
syslog(3) was recently change to support larger messages, up to 8KB.
Our syslogd handles this fine, as it adjusts /dev/log's recv buffer to a
large size. rsyslog, however, uses the system default of 4KB. This
leads to problems since our syslog(3) retries indefinitely when a send()
returns ENOBUFS, but if the message is large enough this will never
succeed.
Increase the default recv buffer size for datagram sockets to support
8KB syslog messages without requiring the logging daemon to adjust its
buffers.
PR: 260126
Reviewed by: asomers
Sponsored by: The FreeBSD Foundation
(cherry picked from commit d157f2627b)
With various firmware files used by graphics and wireless drivers
we are exceeding the current 32 character module name (file path
in kldxref) length.
In order to overcome this issue bump it to the maximum path length
for the next version.
To be able to MFC provide backward compat support for another version
of the struct as the offsets for the second half change due to the
array size increase.
MAXMODNAME being defined to MAXPATHLEN needs param.h to be
included first. With only 7 modules (or LinuxKPI module.h) not
doing that adjust them rather than including param.h in module.h [1].
Reported by: Greg V (greg unrelenting.technology)
Sponsored by: The FreeBSD Foundation
Suggested by: imp [1]
Reviewed by: imp (and others to different level)
Differential Revision: https://reviews.freebsd.org/D32383
(cherry picked from commit df38ada293)
With stack gap enabled top of the stack is moved down by a random
amount of bytes. Because of that some multithreaded applications
which use kern.usrstack sysctl to calculate address of stacks for
their threads can fail. Add kern.stacktop sysctl, which can be used
to retrieve address of the stack after stack gap is applied to it.
Returns value identical to kern.usrstack for processes which have
no stack gap.
Reviewed by: kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D31897
(cherry picked from commit a97d697122)
Calling setrlimit with stack gap enabled and with low values of stack
resource limit often caused the program to abort immediately after
exiting the syscall. This happened due to the fact that the resource
limit was calculated assuming that the stack started at sv_usrstack,
while with stack gap enabled the stack is moved by a random number
of bytes.
Save information about stack size in struct vmspace and adjust the
rlim_cur value. If the rlim_cur and stack gap is bigger than rlim_max,
then the value is truncated to rlim_max.
PR: 253208
Reviewed by: kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D31516
(cherry picked from commit 889b56c8cd)
Most of the nvme initialization time in my tests is being spent here
(via pause_sbt).
Sponsored by: https://www.patreon.com/cperciva
(cherry picked from commit bd11e253a9)
stand/common: Add file_addbuf()
libsa: Add support for timestamp logging (tslog)
stand/common: Add support for timestamp logging (tslog)
i386/loader: Call tslog_init
efi/loader: Call tslog_init (+ bugfix)
stand/common command_boot: Pass tslog to kernel
kern_tslog: Include tslog data from loader
loader: Use tslog to instrument some functions
Add userland boot profiling to TSLOG (+ bugfix)
Sponsored by: https://www.patreon.com/cperciva
(cherry picked from commit 60a978bec9)
(cherry picked from commit e193d3ba33)
(cherry picked from commit c8dfc327db)
(cherry picked from commit c4b65e954f)
(cherry picked from commit f49381ccb6)
(cherry picked from commit 537a44bf28)
(cherry picked from commit fe51b5a76d)
(cherry picked from commit 313724bab9)
(cherry picked from commit 46dd801acb)
(cherry picked from commit 52e125c2bd)
(cherry picked from commit 19e4f2f289)
When the NFSv4.2 server does a VOP_ALLOCATE(), it needs
the operation to be done for the RPC's credential and not
td_ucred. It also needs the writing to be done synchronously.
This patch adds "ioflag" and "cred" arguments to VOP_ALLOCATE()
and modifies vop_stdallocate() to use these arguments.
The VOP_ALLOCATE.9 man page will be patched separately.
(cherry picked from commit f0c9847a6c)
Introduce new MSGBUF_WRAP flag, indicating that buffer has wrapped
at least once and does not keep zeroes from the last msgbuf_clear().
It allows msgbuf_peekbytes() to return only real data, not requiring
every consumer to trim the leading zeroes after doing pointless copy.
The most visible effect is that kern.msgbuf sysctl now always returns
proper zero-terminated string, not only after the first buffer wrap.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
(cherry picked from commit 81dc00331d)
There are two places where we convert from a timecounter delta to
a bintime delta: tc_windup and bintime_off.
Both functions use the same calculations when the timecounter delta is
small. But for a large delta (greater than approximately an equivalent
of 1 second) the calculations were different. Both functions use
approximate calculations based on th_scale that avoid division. Both
produce values slightly greater than a true value, calculated with
division by tc_frequency, would be. tc_windup is slightly more
accurate, so its result is closer to the true value and, thus, smaller
than bintime_off result.
As a consequence there can be a jump back in time when time hands are
switched after a long period of time (a large delta). Just before the
switch the time would be calculated with a large delta from
th_offset_count in bintime_off. tc_windup does the switch using its own
calculations of a new th_offset using the large delta. As explained
earlier, the new th_offset may end up being less than the previously
produced binuptime. So, for a period of time new binuptime values may
be "back in time" comparing to values just before the switch.
Such a jump must never happen. All the code assumes that the uptime is
monotonically nondecreasing and some code works incorrectly when that
assumption is broken. For example, we have observed sleepq_timeout()
ignoring a timeout when the sbinuptime value obtained by the callout
code was greater than the expiration value, but the sbinuptime obtained
in sleepq_timeout() was less than it. In that case the target thread
would never get woken up.
The unified calculations should ensure the monotonic property of the
uptime.
The problem is quite rare as normally tc_windup should be called HZ
times per second (typically 1000 or 100). But it may happen in VMs on
very busy hypervisors where a VM's virtual CPU may not get an execution
time slot for a second or more.
Reviewed by: kib
Sponsored by: Panzura LLC
(cherry picked from commit 3d9d64aa18)
Add a boolean parameter to minidumpsys(), to indicate a live dump. When
requested, take a snapshot of important global state, and pass this to
the machine-dependent minidump function. For now this includes the
kernel message buffer, and the bitset of pages to be dumped. Beyond
this, we don't take much action to protect the integrity of the dump
from changes in the running system.
A new function msgbuf_duplicate() is added for snapshotting the message
buffer. msgbuf_copy() is insufficient for this purpose since it marks
any new characters it finds as read.
For now, nothing can actually trigger a live minidump. A future patch
will add the mechanism for this. For simplicity and safety, live dumps
are disallowed for mips.
Reviewed by: markj, jhb
MFC after: 2 weeks
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D31993
(cherry picked from commit 588ab3c774)
The minidump code is written assuming that certain global state will not
change, and rightly so, since it executes from a kernel debugger
context. In order to support taking minidumps of a live system, we
should allow copies of relevant global state that is likely to change to
be passed as parameters to the minidumpsys() function.
This patch does the work of parameterizing this function, by adding a
struct minidumpstate argument. For now, this struct allows for copies of
the kernel message buffer, and the bitset that tracks which pages should
be dumped (vm_page_dump). Follow-up changes will actually make use of
these arguments.
Notably, dump_avail[] does not need a snapshot, since it is not expected
to change after system initialization.
The existing minidumpsys() definitions are renamed, and a thin MI
wrapper is added to kern_dump.c, which handles the construction of
the state struct. Thus, calling minidumpsys() remains as simple as
before.
Reviewed by: kib, markj, jhb
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31989
(cherry picked from commit 1adebe3cd6)
This is needed to ensure that resolvers that reference global symbols
return correct results.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
(cherry picked from commit b11e6fd75b)
Some upcoming changes will modify software checksum routines like
in_cksum() to operate using m_apply(), which uses the direct map to
access packet data for unmapped mbufs. This approach of course does not
work on platforms without a direct map, so we have to disallow the use
of unmapped mbufs on such platforms.
I believe this is the right tradeoff: we only configure KTLS on amd64
and arm64 today (and one KTLS consumer, NFS TLS, requires a direct map
already), and the use of unmapped mbufs with plain sendfile is a recent
optimization. If need be, m_apply() could be modified to create
CPU-private mappings of extpg mbuf pages as a fallback.
So, change mb_use_ext_pgs to be hard-wired to zero on systems without a
direct map. Note that PMAP_HAS_DMAP is not a compile-time constant on
some systems, so the default value of mb_use_ext_pgs has to be
determined during boot.
Reviewed by: jhb
Discussed with: gallatin
Sponsored by: The FreeBSD Foundation
(cherry picked from commit fcaa890c44)
This will be used to break a deadlock in ZFS between the per-mountpoint
teardown lock and page busy locks. In particular, when purging data
from the page cache during dataset rollback, we want to avoid blocking
on the busy state of invalid pages since the busying thread may be
blocked on the teardown lock in zfs_getpages().
Add a helper, vn_pages_remove_valid(), for use by filesystems. Bump
__FreeBSD_version so that the OpenZFS port can make use of the new
helper.
PR: 258208
Reviewed by: avg, kib, sef
Tested by: pho (part of a larger patch)
Sponsored by: The FreeBSD Foundation
(cherry picked from commit d28af1abf0)
A core segment is bounded in size only by memory size. On 64-bit
architectures this means a segment can be much larger than 4GB.
However, compress_chunk() takes only a u_int, clamping segment size to
4GB-1, resulting in a truncated core. Everything else, including the
compressor internally, uses size_t, so use size_t at the boundary here.
This dates back to the original refactor back in 2015 (r279801 /
aa14e9b7).
PR: 260006
Sponsored by: Juniper Networks, Inc.
(cherry picked from commit 63cb9308a7)
TLS 1.0 records are encrypted as one continuous CBC chain where the
last block of the previous record is used as the IV for the next
record. As a result, TLS 1.0 records cannot be encrypted out of order
but must be encrypted as a FIFO.
If the later pages of a sendfile(2) request complete before the first
pages, then TLS records can be encrypted out of order. For TLS 1.1
and later this is fine, but this can break for TLS 1.0.
To cope, add a queue in each TLS session to hold TLS records that
contain valid unencrypted data but are waiting for an earlier TLS
record to be encrypted first.
- In ktls_enqueue(), check if a TLS record being queued is the next
record expected for a TLS 1.0 session. If not, it is placed in
sorted order in the pending_records queue in the TLS session.
If it is the next expected record, queue it for SW encryption like
normal. In addition, check if this new record (really a potential
batch of records) was holding up any previously queued records in
the pending_records queue. Any of those records that are now in
order are also placed on the queue for SW encryption.
- In ktls_destroy(), free any TLS records on the pending_records
queue. These mbufs are marked M_NOTREADY so were not freed when the
socket buffer was purged in sbdestroy(). Instead, they must be
freed explicitly.
Reviewed by: gallatin, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D32381
(cherry picked from commit 9f03d2c001)
I missed updating this counter when rebasing the changes in
9c64fc4029 after the switch to
COUNTER_U64_DEFINE_EARLY in 1755b2b989.
Fixes: 9c64fc4029 Add Chacha20-Poly1305 as a KTLS cipher suite.
Sponsored by: Netflix
(cherry picked from commit 90972f0402)
Chacha20-Poly1305 for TLS is an AEAD cipher suite for both TLS 1.2 and
TLS 1.3 (RFCs 7905 and 8446). For both versions, Chacha20 uses the
server and client IVs as implicit nonces xored with the record
sequence number to generate the per-record nonce matching the
construction used with AES-GCM for TLS 1.3.
Reviewed by: gallatin
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D27839
(cherry picked from commit 9c64fc4029)
Create the initial pool of kprocs on demand when the first socket AIO
request is submitted instead. The pool of kprocs used for other AIO
requests is similarly created on first use.
Reviewed by: asomers
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D32468
(cherry picked from commit d1b6fef075)
This is how most SYSINITs are defined. Also annotate the dummy
parameter with __unused. No functional change intended.
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 2287ced2f5)
Hyper-V wants to register its MSR-based timecounter during
SI_SUB_HYPERVISOR, before SI_SUB_LOCK, since an emulated 8254 may not be
available for DELAY(). So we cannot use MTX_SYSINIT to initialize the
timecounter lock.
PR: 259878
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 3339950117)
Change the 'period' argument to 'duration' and change its type to
sbintime_t so we can more easily express different durations.
Reviewed by: tsoome, glebius
Differential Revision: https://reviews.freebsd.org/D32619
(cherry picked from commit 072d5b98c4)
This define will later on be used by coming TLS RX hardware offload patches.
No functional change intended.
Reviewed by: jhb@
Sponsored by: NVIDIA Networking
(cherry picked from commit dd31400c3c)
Remove page zeroing code from consumers and stop specifying
VM_ALLOC_NOOBJ. In a few places, also convert an allocation loop to
simply use VM_ALLOC_WAITOK.
Similarly, convert vm_page_alloc_domain() callers.
Note that callers are now responsible for assigning the pindex.
Reviewed by: alc, hselasky, kib
Sponsored by: The FreeBSD Foundation
(cherry picked from commit a4667e09e6)
Callout c_time is always bigger or equal than the scheduled time. It
is also smaller than sbinuptime() and can't change while the callback
is running. So we reliably can use it instead of sbinuptime() here.
In case there was a race and the callout was rescheduled to the later
time, the callback will be called again.
According to profiles it saves ~5% of the timer interrupt time even
with fast TSC timecounter.
MFC after: 1 month
(cherry picked from commit 6df1359e55)
Similar to commit 3ead60236f ("Generalize bus_space(9) and atomic(9)
sanitizer interceptors"), use a more generic scheme for interposing
sanitizer implementations of routines like memcpy().
No functional change intended.
Sponsored by: The FreeBSD Foundation
(cherry picked from commit ec8f1ea8d5)
Make it easy to define interceptors for new sanitizer runtimes, rather
than assuming KCSAN. Lay a bit of groundwork for KASAN and KMSAN.
When a sanitizer is compiled in, atomic(9) and bus_space(9) definitions
in atomic_san.h are used by default instead of the inline
implementations in the platform's atomic.h. These definitions are
implemented in the sanitizer runtime, which includes
machine/{atomic,bus}.h with SAN_RUNTIME defined to pull in the actual
implementations.
No functional change intended.
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 3ead60236f)
KASAN hooks will not generate reports if panicstr != NULL, but then
there is a window after the initial panic() call where another report
may be raised. This can happen if a false positive occurs; to simplify
debugging of such problems, avoid recursing.
Sponsored by: The FreeBSD Foundation
(cherry picked from commit ea3fbe0707)
It will be called during KLD unload to unpoison the redzones following
global variables. Otherwise, virtual address ranges previously used for
a KLD may be left tainted, triggering false positives when they are
recycled.
Reported by: pho
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 588c7a06df)
When copying from the old buffer to the new buffer, we don't know the
requested size of the old allocation, but only the size of the
allocation provided by UMA. This value is "alloc". Because the copy
may access bytes in the old allocation's red zone, we must mark the full
allocation valid in the shadow map. Do so using the correct size.
Reported by: kp
Tested by: kp
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 9a7c2de364)
- Reuse some REDZONE bits to keep track of the requested and allocated
sizes, and use that to provide red zones.
- As in UMA, disable memory trashing to avoid unnecessary CPU overhead.
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 06a53ecf24)
We cache mapped execve argument buffers to avoid the overhead of TLB
shootdowns. Mark them invalid when they are freed to the cache.
Sponsored by: The FreeBSD Foundation
(cherry picked from commit f1c3adefd9)
vnodes are a bit special in that they may exist on per-CPU lists even
while free. Add a KASAN-only destructor that poisons regions of each
vnode that are not expected to be accessed after a free.
Sponsored by: The FreeBSD Foundation
(cherry picked from commit b261bb4057)
The idea behind KASAN is to use a region of memory to track the validity
of buffers in the kernel map. This region is the shadow map. The
compiler inserts calls to the KASAN runtime for every emitted load
and store, and the runtime uses the shadow map to decide whether the
access is valid. Various kernel allocators call kasan_mark() to update
the shadow map.
Since the shadow map tracks only accesses to the kernel map, accesses to
other kernel maps are not validated by KASAN. UMA_MD_SMALL_ALLOC is
disabled when KASAN is configured to reduce usage of the direct map.
Currently we have no mechanism to completely eliminate uses of the
direct map, so KASAN's coverage is not comprehensive.
The shadow map uses one byte per eight bytes in the kernel map. In
pmap_bootstrap() we create an initial set of page tables for the kernel
and preloaded data.
When pmap_growkernel() is called, we call kasan_shadow_map() to extend
the shadow map. kasan_shadow_map() uses pmap_kasan_enter() to allocate
memory for the shadow region and map it.
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29417
(cherry picked from commit 6faf45b34b)
KASAN enables the use of LLVM's AddressSanitizer in the kernel. This
feature makes use of compiler instrumentation to validate memory
accesses in the kernel and detect several types of bugs, including
use-after-frees and out-of-bounds accesses. It is particularly
effective when combined with test suites or syzkaller. KASAN has high
CPU and memory usage overhead and so is not suited for production
environments.
The runtime and pmap maintain a shadow of the kernel map to store
information about the validity of memory mapped at a given kernel
address.
The runtime implements a number of functions defined by the compiler
ABI. These are prefixed by __asan. The compiler emits calls to
__asan_load*() and __asan_store*() around memory accesses, and the
runtime consults the shadow map to determine whether a given access is
valid.
kasan_mark() is called by various kernel allocators to update state in
the shadow map. Updates to those allocators will come in subsequent
commits.
The runtime also defines various interceptors. Some low-level routines
are implemented in assembly and are thus not amenable to compiler
instrumentation. To handle this, the runtime implements these routines
on behalf of the rest of the kernel. The sanitizer implementation
validates memory accesses manually before handing off to the real
implementation.
The sanitizer in a KASAN-configured kernel can be disabled by setting
the loader tunable debug.kasan.disable=1.
Obtained from: NetBSD
Sponsored by: The FreeBSD Foundation
(cherry picked from commit 38da497a4d)