Commit Graph

18493 Commits

Author SHA1 Message Date
Mark Johnston
3388bf06d7 Generalize sanitizer interceptors for memory and string routines
Similar to commit 3ead60236f ("Generalize bus_space(9) and atomic(9)
sanitizer interceptors"), use a more generic scheme for interposing
sanitizer implementations of routines like memcpy().

No functional change intended.

Sponsored by:	The FreeBSD Foundation

(cherry picked from commit ec8f1ea8d5)
2021-11-01 10:20:50 -04:00
Mark Johnston
bf0986b742 Generalize bus_space(9) and atomic(9) sanitizer interceptors
Make it easy to define interceptors for new sanitizer runtimes, rather
than assuming KCSAN.  Lay a bit of groundwork for KASAN and KMSAN.

When a sanitizer is compiled in, atomic(9) and bus_space(9) definitions
in atomic_san.h are used by default instead of the inline
implementations in the platform's atomic.h.  These definitions are
implemented in the sanitizer runtime, which includes
machine/{atomic,bus}.h with SAN_RUNTIME defined to pull in the actual
implementations.

No functional change intended.

Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 3ead60236f)
2021-11-01 10:16:39 -04:00
Mark Johnston
252b6ae3e6 KASAN: Disable checking before triggering a panic
KASAN hooks will not generate reports if panicstr != NULL, but then
there is a window after the initial panic() call where another report
may be raised.  This can happen if a false positive occurs; to simplify
debugging of such problems, avoid recursing.

Sponsored by:	The FreeBSD Foundation

(cherry picked from commit ea3fbe0707)
2021-11-01 10:07:45 -04:00
Mark Johnston
224a01a342 KASAN: Implement __asan_unregister_globals()
It will be called during KLD unload to unpoison the redzones following
global variables.  Otherwise, virtual address ranges previously used for
a KLD may be left tainted, triggering false positives when they are
recycled.

Reported by:	pho
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 588c7a06df)
2021-11-01 10:07:13 -04:00
Mark Johnston
28c338b342 realloc: Fix KASAN(9) shadow map updates
When copying from the old buffer to the new buffer, we don't know the
requested size of the old allocation, but only the size of the
allocation provided by UMA.  This value is "alloc".  Because the copy
may access bytes in the old allocation's red zone, we must mark the full
allocation valid in the shadow map.  Do so using the correct size.

Reported by:	kp
Tested by:	kp
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 9a7c2de364)
2021-11-01 10:05:22 -04:00
Mark Johnston
9710b74dd0 malloc: Add state transitions for KASAN
- Reuse some REDZONE bits to keep track of the requested and allocated
  sizes, and use that to provide red zones.
- As in UMA, disable memory trashing to avoid unnecessary CPU overhead.

Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 06a53ecf24)
2021-11-01 10:03:36 -04:00
Mark Johnston
2748ecec95 execve: Mark exec argument buffers
We cache mapped execve argument buffers to avoid the overhead of TLB
shootdowns.  Mark them invalid when they are freed to the cache.

Sponsored by:	The FreeBSD Foundation

(cherry picked from commit f1c3adefd9)
2021-11-01 10:03:28 -04:00
Mark Johnston
75306778f1 vfs: Add KASAN state transitions for vnodes
vnodes are a bit special in that they may exist on per-CPU lists even
while free.  Add a KASAN-only destructor that poisons regions of each
vnode that are not expected to be accessed after a free.

Sponsored by:	The FreeBSD Foundation

(cherry picked from commit b261bb4057)
2021-11-01 10:03:19 -04:00
Mark Johnston
a3d4c8e21d amd64: Implement a KASAN shadow map
The idea behind KASAN is to use a region of memory to track the validity
of buffers in the kernel map.  This region is the shadow map.  The
compiler inserts calls to the KASAN runtime for every emitted load
and store, and the runtime uses the shadow map to decide whether the
access is valid.  Various kernel allocators call kasan_mark() to update
the shadow map.

Since the shadow map tracks only accesses to the kernel map, accesses to
other kernel maps are not validated by KASAN.  UMA_MD_SMALL_ALLOC is
disabled when KASAN is configured to reduce usage of the direct map.
Currently we have no mechanism to completely eliminate uses of the
direct map, so KASAN's coverage is not comprehensive.

The shadow map uses one byte per eight bytes in the kernel map.  In
pmap_bootstrap() we create an initial set of page tables for the kernel
and preloaded data.

When pmap_growkernel() is called, we call kasan_shadow_map() to extend
the shadow map.  kasan_shadow_map() uses pmap_kasan_enter() to allocate
memory for the shadow region and map it.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29417

(cherry picked from commit 6faf45b34b)
2021-11-01 09:57:30 -04:00
Mark Johnston
48d2c7cc30 Add the KASAN runtime
KASAN enables the use of LLVM's AddressSanitizer in the kernel.  This
feature makes use of compiler instrumentation to validate memory
accesses in the kernel and detect several types of bugs, including
use-after-frees and out-of-bounds accesses.  It is particularly
effective when combined with test suites or syzkaller.  KASAN has high
CPU and memory usage overhead and so is not suited for production
environments.

The runtime and pmap maintain a shadow of the kernel map to store
information about the validity of memory mapped at a given kernel
address.

The runtime implements a number of functions defined by the compiler
ABI.  These are prefixed by __asan.  The compiler emits calls to
__asan_load*() and __asan_store*() around memory accesses, and the
runtime consults the shadow map to determine whether a given access is
valid.

kasan_mark() is called by various kernel allocators to update state in
the shadow map.  Updates to those allocators will come in subsequent
commits.

The runtime also defines various interceptors.  Some low-level routines
are implemented in assembly and are thus not amenable to compiler
instrumentation.  To handle this, the runtime implements these routines
on behalf of the rest of the kernel.  The sanitizer implementation
validates memory accesses manually before handing off to the real
implementation.

The sanitizer in a KASAN-configured kernel can be disabled by setting
the loader tunable debug.kasan.disable=1.

Obtained from:	NetBSD
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 38da497a4d)
2021-11-01 09:56:31 -04:00
Mark Johnston
bb5c81812f timecounter: Lock the timecounter list
Timecounter registration is dynamic, i.e., there is no requirement that
timecounters must be registered during single-threaded boot.  Loadable
drivers may in principle register timecounters (which can be switched to
automatically).  Timecounters cannot be unregistered, though this could
be implemented.

Registered timecounters belong to a global linked list.  Add a mutex to
synchronize insertions and the traversals done by (mpsafe) sysctl
handlers.  No functional change intended.

Reviewed by:	imp, kib
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 621fd9dcb2)
2021-11-01 09:20:11 -04:00
Mark Johnston
943421bdf7 signal: Add SIG_FOREACH and refactor issignal()
Add a SIG_FOREACH macro that can be used to iterate over a signal set.
This is a bit cleaner and more efficient than calling sig_ffs() in a
loop.  The implementation is based on BIT_FOREACH_ISSET(), except
that the bitset limbs are always 32 bits wide, and signal sets are
1-indexed rather than 0-indexed like bitset(9) sets.

issignal() cannot really be modified to use SIG_FOREACH() directly.
Take this opportunity to split the function into two explicit loops.
I've always found this function hard to read and think that this change
is an improvement.

Remove sig_ffs(), nothing uses it now.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 81f2e9063d)
2021-11-01 09:20:11 -04:00
Gordon Bergling
6ad1c6a826 jail(8): Fix a few common typos in source code comments
- s/phyiscal/physical/

(cherry picked from commit 70de1003da)
2021-10-30 09:48:43 +02:00
Konstantin Belousov
c3c880be15 uipc_shm: silent warnings about write-only variables in largepage code
(cherry picked from commit 3b5331dd8d)
2021-10-27 03:24:41 +03:00
Konstantin Belousov
17c83b7670 sig_ast_checksusp(): mark the local p as __diagused
(cherry picked from commit 3d2778515a)
2021-10-27 03:24:40 +03:00
Konstantin Belousov
ec235e162a subr_firmware.c::unloadentry(): remove write-only variable
(cherry picked from commit 6776747a0e)
2021-10-27 03:24:40 +03:00
Konstantin Belousov
485cc5549c procctl: stop using SA_*LOCKED, define local enum
(cherry picked from commit c7f38a2df1)
2021-10-26 05:26:27 +03:00
Konstantin Belousov
59447a02f1 kern_procctl: skip zombies for process group operations
(cherry picked from commit 49db81aa05)
2021-10-26 05:26:27 +03:00
Konstantin Belousov
8589a3470d kern_procctl.c: use td->td_proc instead of curproc
(cherry picked from commit 3692877a6c)
2021-10-26 05:26:27 +03:00
Konstantin Belousov
c802b970a5 procctl: actually require debug privileges over target
(cherry picked from commit f5bb6e5a6d)
2021-10-26 05:26:27 +03:00
Konstantin Belousov
c7d4bd7477 procctl: make it possible to specify that some operations require debug privilege over the target
(cherry picked from commit 1c4dbee5dd)
2021-10-26 05:26:27 +03:00
Konstantin Belousov
84722e8171 sys_procctl(): zero the data buffer once, on syscall entry
(cherry picked from commit 32026f5983)
2021-10-26 05:26:27 +03:00
Konstantin Belousov
a89f144b0d sys_procctl(): use table data to do copyin/copyout
(cherry picked from commit 56d5323b4d)
2021-10-26 05:26:27 +03:00
Konstantin Belousov
38506cebc1 kern_procctl_single(): convert to use table data
(cherry picked from commit 68dc5b381a)
2021-10-26 05:26:26 +03:00
Konstantin Belousov
3c7f03c25f procctl: convert PDEATHSIG_CTL/STATUS to regular kern_procctl_single() cases
(cherry picked from commit 34f39a8c0e)
2021-10-26 05:26:26 +03:00
Konstantin Belousov
2e69ba48b9 procctl(2): add consistent shortcut P_ID:0 as curproc
(cherry picked from commit f833ab9dd1)
2021-10-26 05:26:26 +03:00
Konstantin Belousov
19eec36599 kern_procctl(): convert the function to be table-driven
(cherry picked from commit 7ae879b14a)
2021-10-26 05:26:26 +03:00
Konstantin Belousov
1d72df1c3d sys_procctl(2): remove sysproto and argused
(cherry picked from commit 31faa565ed)
2021-10-26 05:26:26 +03:00
Andrew Turner
f803dd1e24 Add pmap_change_prot on arm64
Support changing the protection of preloaded kernel modules by
implementing pmap_change_prot on arm64 and calling it from
preload_protect.

Reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32026

(cherry picked from commit a85ce4ad72)
2021-10-25 14:46:44 +01:00
Jessica Clarke
af818612a5 riscv: Implement pmap_mapdev_attr
This is needed for LinuxKPI's _ioremap_attr. This reuses the generic
implementation introduced for aarch64, and itself requires implementing
pmap_kenter, which is trivial to do given riscv currently treats all
mapping attributes the same due to the Svpbmt extension not yet being
ratified and in hardware.

Reviewed by:	markj, mhorne
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D32445

(cherry picked from commit 682c00a6ce)
2021-10-24 19:51:10 +01:00
Alexander Motin
1e7091ac7c sched_ule(4): Fix possible significance loss.
Before this change kern.sched.interact sysctl setting above 32 gave
all interactive threads identical priority of PRI_MIN_INTERACT due to
((PRI_MAX_INTERACT - PRI_MIN_INTERACT + 1) / sched_interact) turning
zero.  Setting the sysctl lower reduced the range of used priority
levels up to half, that is not great either.

Change of the operations order should fix the issue, always using full
range of priorities, while overflow is impossible there since both
score and priority values are small.  While there, make the variables
unsigned as they really are.

MFC after:	1 month

(cherry picked from commit 1c119e173d)
2021-10-21 18:24:36 -04:00
Alexander Motin
11f14b3362 sched_ule(4): Fix hang with steal_thresh < 2.
e745d729be caused infinite loop with interrupts disabled in load
stealing code if steal_thresh set below 2.  Such configuration should
not generally be used, but appeared some people are using it to
workaround some problems.

To fix the problem explicitly pass to sched_highest() minimum number
of transferrable threads, supported by the caller, instead of guessing.

MFC after:	25 days

(cherry picked from commit 08063e9f98)
2021-10-21 18:24:36 -04:00
Alexander Motin
b5919ea4e6 x86: Add NUMA nodes into CPU topology.
Depending on hardware, NUMA nodes may match last level caches, or
they may be above them (AMD Zen 2/3) or below (Intel Xeon w/ SNC).
This information is provided by ACPI instead of CPUID, and it is
provided for each CPU individually instead of mask widths, but
this code should be able to properly handle all the above cases.

This change should immediately allow idle stealing in sched_ule(4)
to prefer load from NUMA-local CPUs to remote ones when the node
does not match LLC.  Later we may think of how to better handle it
on sched_pickcpu() side.

MFC after:	1 month

(cherry picked from commit ef50d5fbc3)
2021-10-21 18:24:36 -04:00
Alexander Motin
a3d50144cc Fix build without SMP.
MFC after:	1 month

(cherry picked from commit 8db1669959)
2021-10-21 18:24:35 -04:00
Alexander Motin
4808bab7fa sched_ule(4): Improve long-term load balancer.
Before this change long-term load balancer was unable to migrate
running threads, only ones waiting on run queues.  But with growing
number of CPU cores it is quite typical now for system to not have
many waiting threads.  But same time if due to some coincidence two
long-running CPU-bound threads ended up sharing same physical CPU
core, they could suffer from the SMT penalty indefinitely, and the
load balancer couldn't help.

Improve that by teaching the load balancer to hint running threads
to migrate by marking them with TDF_NEEDRESCHED and new TDF_PICKCPU
flag, making sched_pickcpu() to search for better CPU later, when
it is convenient.

Fix CPU search logic when balancing to limit round-robin migrations
in case of almost equal load to the group of physical cores.  The
previous code bounced threads across all the system, that should be
pretty bad for caches and NUMA affinity, while additional fairness
was almost invisible, diminishing with number of cores in the group.

MFC after:	1 month

(cherry picked from commit e745d729be)
2021-10-21 18:24:35 -04:00
Alexander Motin
fa226878a5 sbuf(9): Microoptimize sbuf_put_byte()
This function is actively used by sbuf_vprintf(), so this simple
inlining in half reduces time of kern.geom.confxml generation.

MFC after:	2 weeks
Sponsored by:	iXsystem, Inc.

(cherry picked from commit 7835b2cb4a)
2021-10-21 18:24:29 -04:00
John Baldwin
58d69f4ecf crypto: Add a new type of crypto buffer for a single mbuf.
This is intended for use in KTLS transmit where each TLS record is
described by a single mbuf that is itself queued in the socket buffer.
Using the existing CRYPTO_BUF_MBUF would result in
bus_dmamap_load_crp() walking additional mbufs in the socket buffer
that are not relevant, but generating a S/G list that potentially
exceeds the limit of the tag (while also wasting CPU cycles).

Reviewed by:	markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D30136

(cherry picked from commit 883a0196b6)
2021-10-21 08:51:26 -07:00
John Baldwin
60b9ce7245 sglist: Add sglist_append_single_mbuf().
This function appends the contents of a single mbuf to an sglist
rather than an entire mbuf chain.

Reviewed by:	gallatin, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D30135

(cherry picked from commit 6663f8a23e)
2021-10-21 08:51:26 -07:00
John Baldwin
da557f2fe6 Rename m_unmappedtouio() to m_unmapped_uiomove().
This function doesn't only copy data into a uio but instead is a
variant of uiomove() similar to uiomove_fromphys().

Reviewed by:	gallatin, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D30444

(cherry picked from commit aa341db39b)
2021-10-21 08:51:26 -07:00
John Baldwin
8efc88d0d6 Extend m_copyback() to support unmapped mbufs.
Reviewed by:	gallatin, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D30133

(cherry picked from commit 3f9dac85cc)
2021-10-21 08:51:25 -07:00
John Baldwin
2ba824366c Extend m_apply() to support unmapped mbufs.
m_apply() invokes the callback function separately on each segment of
an unmapped mbuf: the TLS header, individual pages, and the TLS
trailer.

Reviewed by:	markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D30132

(cherry picked from commit 3c7a01d773)
2021-10-21 08:51:25 -07:00
Mark Johnston
348fc38fd5 mount: Check for !VDIR mount points before handling -o emptydir
To implement -o emptydir, vfs_emptydir() checks that the passed
directory is empty.  This should be done after checking whether the
vnode is of type VDIR, though, or vfs_emptydir() may end up calling
VOP_READDIR on a non-directory.

Reported by:	syzbot+4006732c69fb0f792b2c@syzkaller.appspotmail.com
Reviewed by:	kib, imp
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 03d5820f73)
2021-10-19 20:53:33 -04:00
John Baldwin
59a5099ec1 Document kern.log_wakeups_per_second.
PR:		148680

(cherry picked from commit c51e4962a3)
2021-10-19 16:53:26 -07:00
Brooks Davis
3b55b61371 selsocket: handle sopoll() errors correctly
Without this change, unmounting smbfs filesystems with an INVARIANTS
kernel would panic after 10e64782ed.

PR:		253079
Found by:	markj
Reviewed by:	markj, jhb
Obtained from:	CheriBSD
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D32492

(cherry picked from commit 04c91ac48a)
2021-10-20 00:19:57 +01:00
Brooks Davis
fe388671ac makesyscalls.lua: add a CAPENABLED flag
The CAPENABLED flag indicates that the syscall can be used in capsicum
capability mode.  It is intended to replace capabilities.conf.

Reviewed by:	kevans, emaste
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D31349

(cherry picked from commit 6945df3fff)
2021-10-20 00:19:56 +01:00
Brooks Davis
81184e92e0 makesyscalls.lua: Add a new syscall type: RESERVED
RESERVED syscall number are reserved for local/vendor use.  RESERVED is
identical to UNIMPL except that comments are ignored.

Reviewed by:	kevans
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D27988

(cherry picked from commit 119fa6ee8a)
2021-10-20 00:19:56 +01:00
Mark Johnston
54a01b5326 vfs: Permit unix sockets to be opened with O_PATH
As with FIFOs, a path descriptor for a unix socket cannot be used with
kevent().

In principle connectat(2) and bindat(2) could be modified to support an
AT_EMPTY_PATH-like mode which operates on the socket referenced by an
O_PATH fd referencing a unix socket.  That would eliminate the path
length limit imposed by sockaddr_un.

Update O_PATH tests.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 2bd9826995)
2021-10-17 17:15:44 -04:00
Mark Johnston
66f5f95864 timecounter: Let kern.timecounter.stepwarnings be set as a tunable
(cherry picked from commit fa9da1f590)
2021-10-16 09:31:19 -04:00
Greg V
1625e2db22 O_PATH: allow vfs_extattr syscalls
(cherry picked from commit 98dae405de)
2021-10-16 16:01:47 +03:00
Konstantin Belousov
f824a0d090 Style
(cherry picked from commit 1adebca1fc)
2021-10-15 23:39:07 +03:00