Commit Graph

18633 Commits

Author SHA1 Message Date
Alexander Motin
d97bfe3ff8 bus: Cleanup device_probe_child()
When device driver probe method returns 0, i.e. absolute priority, do
not remove its class from the device just to set it back few lines
later, that may change the device unit number, etc. and after which
we'd better call the probe again.

If during search we found some driver with absolute priority, we do
not need to set device driver and class since we haven't removed them
before.

It should not happen, but if second probe method call failed, remove
the driver and possibly the class from the device as it was when we
started.

Reviewed by:	imp, jhb
Differential Revision:	https://reviews.freebsd.org/D32125

(cherry picked from commit f73c2bbf81)
2022-01-04 12:10:55 -05:00
Warner Losh
dd39806d1d bus: Fix LINT / BUS_DEBUG build
Fix 0389e9be63 for LINT built. Removed an arg only from code
under BUS_DEBUG w/o rebuilding LINT...

Sponsored by:		Netflix
Fixes: 0389e9be63

(cherry picked from commit 67a9e76da6)
2022-01-04 12:10:42 -05:00
Warner Losh
7be46aeea1 bus: retire DF_REBID
I did DF_REBID to allow for 'hoover' drivers that would attach to
otherwise unattached devices in the tree. This notion didn't catch on as
it was tricky to make work well and it was easier to just publish a /dev
node of some flavor by the parent device. It's been nothing but dead
weight for a long time.

Reviewed by:		mav
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D32056

(cherry picked from commit 0389e9be63)
2022-01-04 12:00:53 -05:00
Konstantin Belousov
e163ee6ef5 Add kern.elf{32,64}.vdso knobs to enable/disable vdso preloading
(cherry picked from commit eb02958748)
2022-01-02 18:43:01 +02:00
Konstantin Belousov
e85becdf19 vdso for ia32 on amd64
(cherry picked from commit 98c8b62524)
2022-01-02 18:43:01 +02:00
Justin Hibbits
8851242d9d Fix assert check for SV_DSO_SIG in exec_sysvec_init_secondary()
(cherry picked from commit d2de68811a)
2022-01-02 18:43:01 +02:00
Konstantin Belousov
d00ebd9b9c Pass vdso address to userspace
(cherry picked from commit 01c77a436e)
2022-01-02 18:43:01 +02:00
Konstantin Belousov
203bcad731 amd64: wrap 64bit sigtramp into vdso
(cherry picked from commit ab4524b3d7)
2022-01-02 18:43:01 +02:00
Dmitry Chagin
25904983e8 Remove bogus cast from exec_sysvec_init().
(cherry picked from commit b39fa4770d)
2022-01-02 18:43:01 +02:00
Dmitry Chagin
0ceca7923f Modify exec_sysvec_init() to allow non-native abi to setup their sysentvecs.
(cherry picked from commit 21629e2a45)
2022-01-02 18:43:00 +02:00
Konstantin Belousov
0af1cbb038 itimers: strip unused bits from struct itimer and struct itimers
(cherry picked from commit 23ba59fbfb)
2022-01-02 18:43:00 +02:00
Konstantin Belousov
840b422b7c itimers_alloc: no need to initialize its_timers array
(cherry picked from commit 3f15708478)
2022-01-02 18:43:00 +02:00
Mark Johnston
2f9116e480 fd: Initialize more export_fd_buf fields in kern_proc_cwd_out()
In particular, we need to initialize efbuf->flags, since
export_vnode_to_sb() loads that field.  This was mostly harmless since
the flag only determines whether the output kinfo_file is packed, and
KERN_PROC_CWD only ever emits a single kinfo_file anyway.

Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 327060bd77)
2021-12-31 09:26:07 -05:00
Mark Johnston
7a06849669 unix: Increase the default datagram recv buffer size
syslog(3) was recently change to support larger messages, up to 8KB.
Our syslogd handles this fine, as it adjusts /dev/log's recv buffer to a
large size.  rsyslog, however, uses the system default of 4KB.  This
leads to problems since our syslog(3) retries indefinitely when a send()
returns ENOBUFS, but if the message is large enough this will never
succeed.

Increase the default recv buffer size for datagram sockets to support
8KB syslog messages without requiring the logging daemon to adjust its
buffers.

PR:		260126
Reviewed by:	asomers
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit d157f2627b)
2021-12-31 09:25:54 -05:00
Bjoern A. Zeeb
a34668185b modules: increase MAXMODNAME and provide backward compat
With various firmware files used by graphics and wireless drivers
we are exceeding the current 32 character module name (file path
in kldxref) length.
In order to overcome this issue bump it to the maximum path length
for the next version.
To be able to MFC provide backward compat support for another version
of the struct as the offsets for the second half change due to the
array size increase.

MAXMODNAME being defined to MAXPATHLEN needs param.h to be
included first.  With only 7 modules (or LinuxKPI module.h) not
doing that adjust them rather than including param.h in module.h [1].

Reported by:	Greg V (greg unrelenting.technology)
Sponsored by:	The FreeBSD Foundation
Suggested by:	imp [1]
Reviewed by:	imp (and others to different level)
Differential Revision:	https://reviews.freebsd.org/D32383

(cherry picked from commit df38ada293)
2021-12-30 18:26:18 +00:00
Dawid Gorecki
532d925b6f kern_exec: Add kern.stacktop sysctl.
With stack gap enabled top of the stack is moved down by a random
amount of bytes. Because of that some multithreaded applications
which use kern.usrstack sysctl to calculate address of stacks for
their threads can fail. Add kern.stacktop sysctl, which can be used
to retrieve address of the stack after stack gap is applied to it.
Returns value identical to kern.usrstack for processes which have
no stack gap.

Reviewed by: kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D31897

(cherry picked from commit a97d697122)
2021-12-30 16:25:22 +01:00
Dawid Gorecki
16a900ae02 setrlimit: Take stack gap into account.
Calling setrlimit with stack gap enabled and with low values of stack
resource limit often caused the program to abort immediately after
exiting the syscall. This happened due to the fact that the resource
limit was calculated assuming that the stack started at sv_usrstack,
while with stack gap enabled the stack is moved by a random number
of bytes.

Save information about stack size in struct vmspace and adjust the
rlim_cur value. If the rlim_cur and stack gap is bigger than rlim_max,
then the value is truncated to rlim_max.

PR: 253208
Reviewed by: kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D31516

(cherry picked from commit 889b56c8cd)
2021-12-30 16:24:59 +01:00
Colin Percival
f388ad85c2 Add _sleep to TSLOG
Most of the nvme initialization time in my tests is being spent here
(via pause_sbt).

Sponsored by:	https://www.patreon.com/cperciva

(cherry picked from commit bd11e253a9)
2021-12-29 14:53:19 -08:00
Colin Percival
89a9852f32 MFC loader+userland TSLOG support
stand/common: Add file_addbuf()
libsa: Add support for timestamp logging (tslog)
stand/common: Add support for timestamp logging (tslog)
i386/loader: Call tslog_init
efi/loader: Call tslog_init (+ bugfix)
stand/common command_boot: Pass tslog to kernel
kern_tslog: Include tslog data from loader
loader: Use tslog to instrument some functions
Add userland boot profiling to TSLOG (+ bugfix)

Sponsored by:	https://www.patreon.com/cperciva

(cherry picked from commit 60a978bec9)
(cherry picked from commit e193d3ba33)
(cherry picked from commit c8dfc327db)
(cherry picked from commit c4b65e954f)
(cherry picked from commit f49381ccb6)
(cherry picked from commit 537a44bf28)
(cherry picked from commit fe51b5a76d)
(cherry picked from commit 313724bab9)
(cherry picked from commit 46dd801acb)
(cherry picked from commit 52e125c2bd)
(cherry picked from commit 19e4f2f289)
2021-12-29 14:53:18 -08:00
Konstantin Belousov
bafbbd46ca Regen 2021-12-20 02:29:11 +02:00
Konstantin Belousov
1791debf4a swapoff: add one more variant of the syscall
For MFC, COMPAT_FREEBSD13 braces were removed.

(cherry picked from commit 5346570276)
2021-12-20 02:29:11 +02:00
Florian Walpen
30c3a5f248 Add idle priority scheduling privilege group to MAC/priority
(cherry picked from commit a9545eede4)
2021-12-19 04:42:51 +02:00
Florian Walpen
5719dba765 Add PRIV_SCHED_IDPRIO
(cherry picked from commit a20a2450cd)
2021-12-19 04:42:51 +02:00
Konstantin Belousov
c35b12cc7d exec_elf: use intermediate u_long variable to correct mismatched type
(cherry picked from commit e499988f0c)
2021-12-19 04:42:51 +02:00
Konstantin Belousov
f1b1fa3505 imgact_elf: avoid mapsz overflow
(cherry picked from commit bf83941638)
2021-12-19 04:42:51 +02:00
Konstantin Belousov
f1f4d58a6b imgact_elf: check that the alignment of PT_LOAD segment is power of two
(cherry picked from commit 36df8f540f)
2021-12-19 04:42:51 +02:00
Konstantin Belousov
681e834b02 imgact_elf: exclude invalid alignment requests
(cherry picked from commit 714d6d09b5)
2021-12-19 04:42:51 +02:00
Konstantin Belousov
29ef89ab1b rnd_elf: add comment explaining the interface
(cherry picked from commit a4007ae10c)
2021-12-19 04:42:51 +02:00
Konstantin Belousov
e0dc92e185 elf image activator: convert asserts into errors
(cherry picked from commit 9cf78c1cf6)
2021-12-19 04:42:51 +02:00
Konstantin Belousov
5b1800d62f exec_elf: assert that the image vnode is still locked on return
(cherry picked from commit b4b20492cd)
2021-12-19 04:42:50 +02:00
Konstantin Belousov
995954f0dd Style
(cherry picked from commit 88dd7a0a39)
2021-12-19 04:42:50 +02:00
Rick Macklem
18f5b477ee vfs: Add "ioflag" and "cred" arguments to VOP_ALLOCATE
When the NFSv4.2 server does a VOP_ALLOCATE(), it needs
the operation to be done for the RPC's credential and not
td_ucred. It also needs the writing to be done synchronously.

This patch adds "ioflag" and "cred" arguments to VOP_ALLOCATE()
and modifies vop_stdallocate() to use these arguments.

The VOP_ALLOCATE.9 man page will be patched separately.

(cherry picked from commit f0c9847a6c)
2021-12-18 14:30:25 -08:00
Alexander Motin
d87f1e2e36 Make msgbuf_peekbytes() not return leading zeroes.
Introduce new MSGBUF_WRAP flag, indicating that buffer has wrapped
at least once and does not keep zeroes from the last msgbuf_clear().
It allows msgbuf_peekbytes() to return only real data, not requiring
every consumer to trim the leading zeroes after doing pointless copy.
The most visible effect is that kern.msgbuf sysctl now always returns
proper zero-terminated string, not only after the first buffer wrap.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.

(cherry picked from commit 81dc00331d)
2021-12-17 20:36:23 -05:00
Andriy Gapon
9c0050b0a6 kern_tc: unify timecounter to bintime delta conversion
There are two places where we convert from a timecounter delta to
a bintime delta: tc_windup and bintime_off.
Both functions use the same calculations when the timecounter delta is
small.  But for a large delta (greater than approximately an equivalent
of 1 second) the calculations were different.  Both functions use
approximate calculations based on th_scale that avoid division.  Both
produce values slightly greater than a true value, calculated with
division by tc_frequency, would be.  tc_windup is slightly more
accurate, so its result is closer to the true value and, thus, smaller
than bintime_off result.

As a consequence there can be a jump back in time when time hands are
switched after a long period of time (a large delta).  Just before the
switch the time would be calculated with a large delta from
th_offset_count in bintime_off.  tc_windup does the switch using its own
calculations of a new th_offset using the large delta.  As explained
earlier, the new th_offset may end up being less than the previously
produced binuptime.  So, for a period of time new binuptime values may
be "back in time" comparing to values just before the switch.

Such a jump must never happen.  All the code assumes that the uptime is
monotonically nondecreasing and some code works incorrectly when that
assumption is broken.  For example, we have observed sleepq_timeout()
ignoring a timeout when the sbinuptime value obtained by the callout
code was greater than the expiration value, but the sbinuptime obtained
in sleepq_timeout() was less than it.  In that case the target thread
would never get woken up.

The unified calculations should ensure the monotonic property of the
uptime.

The problem is quite rare as normally tc_windup should be called HZ
times per second (typically 1000 or 100).  But it may happen in VMs on
very busy hypervisors where a VM's virtual CPU may not get an execution
time slot for a second or more.

Reviewed by:	kib
Sponsored by:	Panzura LLC

(cherry picked from commit 3d9d64aa18)
2021-12-17 09:28:24 +02:00
Konstantin Belousov
4305fd126c Kernel linkers: add emergency sysctl to restore old behavior
PR:	207898

(cherry picked from commit ecd8245e0d)
2021-12-15 03:41:29 +02:00
Konstantin Belousov
da536d64b7 kernel linker: do not read debug symbol tables for non-debug symbols
PR:	207898

(cherry picked from commit 95c20faf11)
2021-12-15 03:41:29 +02:00
Konstantin Belousov
b23c24558b linker_debug_symbol_values(): use proper linker interface to get debug values
(cherry picked from commit 72f6662662)
2021-12-15 03:41:29 +02:00
Konstantin Belousov
8cca53de0c Style
(cherry picked from commit c37c6f994f)
2021-12-13 02:58:22 +02:00
Konstantin Belousov
9258e9e3a8 fcntl(2): add F_KINFO operation
(cherry picked from commit 794d3e8e63)
2021-12-13 02:58:22 +02:00
Konstantin Belousov
c14695417b Add declaration for static export_file_to_kinfo()
(cherry picked from commit 6e51d61a96)
2021-12-13 02:58:22 +02:00
Konstantin Belousov
9e75c46527 imgact_aout.c: some style
(cherry picked from commit 290e05dde0)
2021-12-10 04:32:18 +02:00
Konstantin Belousov
aa1d548128 imgact_aout.c: We do not expect the aout support to be ported
(cherry picked from commit 9da5257e1c)
2021-12-10 04:32:18 +02:00
Wuyang Chung
815c26d4e1 Correct the name of the second parameter of biowait to wmesg
This parameter is passed directly to msleep, and the name of the msleep
parameter is wmesg. Make them match.

Pull Request: https://github.com/freebsd/freebsd-src/pull/557

(cherry picked from commit 8587d75255)
2021-12-06 08:55:55 -07:00
Konstantin Belousov
aebdfa9515 Expand comment explaining reasons for automatic swapoff on shutdown
(cherry picked from commit a5c2d59ed3)
2021-12-06 02:29:43 +02:00
Konstantin Belousov
c1abd6bd3d shutdown: unmount filesystems after swapoff
(cherry picked from commit 08bb51f8d6)
2021-12-06 02:29:43 +02:00
Gordon Bergling
33daf0eb60 kern: Correct a typo in a sysctl description
- s/osbolete/obsolete/

(cherry picked from commit fe96f62d61)
2021-12-05 10:07:36 +01:00
Konstantin Belousov
2c52eba4f4 linker_kldload_busy(): allow recursion
PR:	259748

(cherry picked from commit 4f924a786a)
2021-12-05 03:02:57 +02:00
Gordon Bergling
2c68c93e2e vfs: Fix a typo in a sysctl description
- s/dependecies/dependencies/

(cherry picked from commit b6f4818a7e)
2021-12-03 16:51:32 +01:00
Mitchell Horne
86aa46c79c Allow minidumps to be performed on the live system
Add a boolean parameter to minidumpsys(), to indicate a live dump. When
requested, take a snapshot of important global state, and pass this to
the machine-dependent minidump function. For now this includes the
kernel message buffer, and the bitset of pages to be dumped. Beyond
this, we don't take much action to protect the integrity of the dump
from changes in the running system.

A new function msgbuf_duplicate() is added for snapshotting the message
buffer. msgbuf_copy() is insufficient for this purpose since it marks
any new characters it finds as read.

For now, nothing can actually trigger a live minidump. A future patch
will add the mechanism for this. For simplicity and safety, live dumps
are disallowed for mips.

Reviewed by:	markj, jhb
MFC after:	2 weeks
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D31993

(cherry picked from commit 588ab3c774)
2021-12-03 10:02:03 -04:00
Mitchell Horne
eb2ea57ef1 minidump: Parameterize minidumpsys()
The minidump code is written assuming that certain global state will not
change, and rightly so, since it executes from a kernel debugger
context. In order to support taking minidumps of a live system, we
should allow copies of relevant global state that is likely to change to
be passed as parameters to the minidumpsys() function.

This patch does the work of parameterizing this function, by adding a
struct minidumpstate argument. For now, this struct allows for copies of
the kernel message buffer, and the bitset that tracks which pages should
be dumped (vm_page_dump). Follow-up changes will actually make use of
these arguments.

Notably, dump_avail[] does not need a snapshot, since it is not expected
to change after system initialization.

The existing minidumpsys() definitions are renamed, and a thin MI
wrapper is added to kern_dump.c, which handles the construction of
the state struct. Thus, calling minidumpsys() remains as simple as
before.

Reviewed by:	kib, markj, jhb
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D31989

(cherry picked from commit 1adebe3cd6)
2021-12-03 10:02:03 -04:00
Mark Johnston
2e837779a7 link_elf_obj: Process global ifunc relocs after other global relocs
This is needed to ensure that resolvers that reference global symbols
return correct results.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit b11e6fd75b)
2021-12-02 09:15:15 -05:00
Mark Johnston
02e3eb8d48 mbuf: Only allow extpg mbufs if the system has a direct map
Some upcoming changes will modify software checksum routines like
in_cksum() to operate using m_apply(), which uses the direct map to
access packet data for unmapped mbufs.  This approach of course does not
work on platforms without a direct map, so we have to disallow the use
of unmapped mbufs on such platforms.

I believe this is the right tradeoff: we only configure KTLS on amd64
and arm64 today (and one KTLS consumer, NFS TLS, requires a direct map
already), and the use of unmapped mbufs with plain sendfile is a recent
optimization.  If need be, m_apply() could be modified to create
CPU-private mappings of extpg mbuf pages as a fallback.

So, change mb_use_ext_pgs to be hard-wired to zero on systems without a
direct map.  Note that PMAP_HAS_DMAP is not a compile-time constant on
some systems, so the default value of mb_use_ext_pgs has to be
determined during boot.

Reviewed by:	jhb
Discussed with:	gallatin
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit fcaa890c44)
2021-11-29 20:34:54 -05:00
Mark Johnston
fdd27db348 vm: Add a mode to vm_object_page_remove() which skips invalid pages
This will be used to break a deadlock in ZFS between the per-mountpoint
teardown lock and page busy locks.  In particular, when purging data
from the page cache during dataset rollback, we want to avoid blocking
on the busy state of invalid pages since the busying thread may be
blocked on the teardown lock in zfs_getpages().

Add a helper, vn_pages_remove_valid(), for use by filesystems.  Bump
__FreeBSD_version so that the OpenZFS port can make use of the new
helper.

PR:		258208
Reviewed by:	avg, kib, sef
Tested by:	pho (part of a larger patch)
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit d28af1abf0)
2021-11-29 09:09:28 -05:00
Justin Hibbits
2949655427 Fix segment size in compressing core dumps
A core segment is bounded in size only by memory size.  On 64-bit
architectures this means a segment can be much larger than 4GB.
However, compress_chunk() takes only a u_int, clamping segment size to
4GB-1, resulting in a truncated core.  Everything else, including the
compressor internally, uses size_t, so use size_t at the boundary here.

This dates back to the original refactor back in 2015 (r279801 /
aa14e9b7).

PR:		260006
Sponsored by:	Juniper Networks, Inc.

(cherry picked from commit 63cb9308a7)
2021-11-29 09:08:11 -05:00
Gordon Bergling
c8f21cc79f sched_ule(4): Fix two typo in source code comments
- s/conditons/conditions/
- s/unconditonally/unconditionally/

(cherry picked from commit 15b5c347f1)
2021-11-28 12:41:11 +01:00
John Baldwin
94280c5811 ktls: Reject some invalid cipher suites.
- Reject AES-CBC cipher suites for TLS 1.0 and TLS 1.1 using auth
  algorithms other than SHA1-HMAC.

- Reject AES-GCM cipher suites for TLS versions older than 1.2.

Reviewed by:	markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D32842

(cherry picked from commit 900a28fe33)
2021-11-23 15:11:53 -08:00
John Baldwin
ba6b771d17 ktls: Ensure FIFO encryption order for TLS 1.0.
TLS 1.0 records are encrypted as one continuous CBC chain where the
last block of the previous record is used as the IV for the next
record.  As a result, TLS 1.0 records cannot be encrypted out of order
but must be encrypted as a FIFO.

If the later pages of a sendfile(2) request complete before the first
pages, then TLS records can be encrypted out of order.  For TLS 1.1
and later this is fine, but this can break for TLS 1.0.

To cope, add a queue in each TLS session to hold TLS records that
contain valid unencrypted data but are waiting for an earlier TLS
record to be encrypted first.

- In ktls_enqueue(), check if a TLS record being queued is the next
  record expected for a TLS 1.0 session.  If not, it is placed in
  sorted order in the pending_records queue in the TLS session.

  If it is the next expected record, queue it for SW encryption like
  normal.  In addition, check if this new record (really a potential
  batch of records) was holding up any previously queued records in
  the pending_records queue.  Any of those records that are now in
  order are also placed on the queue for SW encryption.

- In ktls_destroy(), free any TLS records on the pending_records
  queue.  These mbufs are marked M_NOTREADY so were not freed when the
  socket buffer was purged in sbdestroy().  Instead, they must be
  freed explicitly.

Reviewed by:	gallatin, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D32381

(cherry picked from commit 9f03d2c001)
2021-11-23 15:11:44 -08:00
John Baldwin
0053fedc1b ktls: Reject attempts to enable AES-CBC with TLS 1.3.
AES-CBC cipher suites are not supported in TLS 1.3.

Reported by:	syzbot+ab501c50033ec01d53c6@syzkaller.appspotmail.com
Reviewed by:	tuexen, markj
Differential Revision:	https://reviews.freebsd.org/D32404

(cherry picked from commit a63752cce6)
2021-11-23 15:11:44 -08:00
John Baldwin
6afc00ed13 ktls: Use COUNTER_U64_DEFINE_EARLY for the ktls_toe_chacha20 counter.
I missed updating this counter when rebasing the changes in
9c64fc4029 after the switch to
COUNTER_U64_DEFINE_EARLY in 1755b2b989.

Fixes:		9c64fc4029 Add Chacha20-Poly1305 as a KTLS cipher suite.
Sponsored by:	Netflix

(cherry picked from commit 90972f0402)
2021-11-23 15:11:44 -08:00
John Baldwin
b7f27a60ac Add Chacha20-Poly1305 as a KTLS cipher suite.
Chacha20-Poly1305 for TLS is an AEAD cipher suite for both TLS 1.2 and
TLS 1.3 (RFCs 7905 and 8446).  For both versions, Chacha20 uses the
server and client IVs as implicit nonces xored with the record
sequence number to generate the per-record nonce matching the
construction used with AES-GCM for TLS 1.3.

Reviewed by:	gallatin
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D27839

(cherry picked from commit 9c64fc4029)
2021-11-23 15:11:44 -08:00
John Baldwin
b07b1f890e Stop creating socket aio kprocs during boot.
Create the initial pool of kprocs on demand when the first socket AIO
request is submitted instead.  The pool of kprocs used for other AIO
requests is similarly created on first use.

Reviewed by:	asomers
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D32468

(cherry picked from commit d1b6fef075)
2021-11-23 15:11:43 -08:00
Mark Johnston
35dfdb88ea unix: Remove a write-only local variable
Reported by:	clang
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 42188bb5c1)
2021-11-23 09:32:46 -05:00
Mark Johnston
d16fbc488e clock: Group the "clocks" SYSINIT with the function definition
This is how most SYSINITs are defined.  Also annotate the dummy
parameter with __unused.  No functional change intended.

Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 2287ced2f5)
2021-11-22 08:45:47 -05:00
Mark Johnston
686b143f37 timecounter: Initialize tc_lock earlier
Hyper-V wants to register its MSR-based timecounter during
SI_SUB_HYPERVISOR, before SI_SUB_LOCK, since an emulated 8254 may not be
available for DELAY().  So we cannot use MTX_SYSINIT to initialize the
timecounter lock.

PR:		259878
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 3339950117)
2021-11-22 08:44:49 -05:00
Konstantin Belousov
7dab2a7cf5 Kernel linkers: some style
(cherry picked from commit a7e4eb1422)
2021-11-21 02:27:44 +02:00
Warner Losh
87586bff11 sysbeep: Adjust interface to take a duration as a sbt
Change the 'period' argument to 'duration' and change its type to
sbintime_t so we can more easily express different durations.

Reviewed by:	tsoome, glebius
Differential Revision:	https://reviews.freebsd.org/D32619

(cherry picked from commit 072d5b98c4)
2021-11-18 21:52:22 -07:00
Konstantin Belousov
19f2755d9e DEBUG_VFS_LOCKS: stop excluding devfs and doomed vnode from asserts
(cherry picked from commit d032cda0d0)
2021-11-19 06:25:29 +02:00
Konstantin Belousov
b9283ea323 Make locking assertions for VOP_FSYNC() and VOP_FDATASYNC() more correct
(cherry picked from commit 47b248ac65)
2021-11-19 06:25:29 +02:00
Konstantin Belousov
3a12ea648f freevnode(): lock the freeing vnode around destroy_vpollinfo()
(cherry picked from commit d1d675cb30)
2021-11-19 06:25:29 +02:00
Konstantin Belousov
4c04226222 getblk(): do not require devvp vnodes to be locked
(cherry picked from commit a7b4a54d2c)
2021-11-19 06:25:28 +02:00
Konstantin Belousov
5bd64640f7 start_init: use 'p'
(cherry picked from commit 8660813153)
2021-11-18 02:32:32 +02:00
Hans Petter Selasky
4a36455c41 Factor out flags preserved during mbuf demote into a separate define.
This define will later on be used by coming TLS RX hardware offload patches.

No functional change intended.

Reviewed by:	jhb@
Sponsored by:	NVIDIA Networking

(cherry picked from commit dd31400c3c)
2021-11-12 15:33:54 +01:00
Konstantin Belousov
9de9a33050 fexecve(2): allow O_PATH file descriptors opened without O_EXEC
(cherry picked from commit be10c0a910)
2021-11-06 04:12:33 +02:00
Konstantin Belousov
5291b294d3 proc_get_binpath(): provide syntaxically correct value for unused NDINIT arg
(cherry picked from commit 7ac82c96fe)
2021-11-06 04:12:32 +02:00
Konstantin Belousov
392fbf5cce proc_get_binpath(): return empty string instead of NULL
(cherry picked from commit 02de91d740)
2021-11-06 04:12:32 +02:00
Konstantin Belousov
17aab23bf7 fexecve(2): restore the attempts to calculate the executable path
(cherry picked from commit e4ce23b238)
2021-11-06 04:12:32 +02:00
Konstantin Belousov
0303cc4be8 Extract proc_get_binpath() from sysctl_kern_proc_pathname()
(cherry picked from commit f34fc6ba06)
2021-11-06 04:12:32 +02:00
Konstantin Belousov
ea4e8e191c sysctl kern.proc.procname: report right hardlink name
PR:	248184

(cherry picked from commit ee92c8a842)
2021-11-06 04:12:32 +02:00
Konstantin Belousov
d39bd6d14d exec: store parent directory and hardlink name of the binary in struct proc
(cherry picked from commit 351d5f7fc5)
2021-11-06 04:12:32 +02:00
Konstantin Belousov
a69fb7452e exec: provide right hardlink name in AT_EXECPATH
PR:	248184

(cherry picked from commit 0c10648fbb)
2021-11-06 04:12:31 +02:00
Konstantin Belousov
b94df11d52 Make vn_fullpath_hardlink() externally callable
(cherry picked from commit 9a0bee9f6a)
2021-11-06 04:12:31 +02:00
Konstantin Belousov
1849361644 struct image_params: use bool type for boolean members
(cherry picked from commit 15bf81f354)
2021-11-06 04:12:31 +02:00
Konstantin Belousov
3b4baefca9 do_execve(): switch boolean locals to use bool type
(cherry picked from commit 9d58243fbc)
2021-11-06 04:12:31 +02:00
Konstantin Belousov
0b06c284ae kern_exec.c: style
(cherry picked from commit 143dba3a91)
2021-11-06 04:12:31 +02:00
Konstantin Belousov
3e322ded35 Unmap shared page manually before doing vm_map_remove() on exit or exec
(cherry picked from commit 1c69690319)
2021-11-04 02:56:39 +02:00
Sebastian Huber
b765d3da06 kern_tc.c: Scaling/large delta recalculation
(cherry picked from commit ae750fbac7)
2021-11-04 02:56:38 +02:00
Mark Johnston
66cb1858f4 Convert vm_page_alloc() callers to use vm_page_alloc_noobj().
Remove page zeroing code from consumers and stop specifying
VM_ALLOC_NOOBJ.  In a few places, also convert an allocation loop to
simply use VM_ALLOC_WAITOK.

Similarly, convert vm_page_alloc_domain() callers.

Note that callers are now responsible for assigning the pindex.

Reviewed by:	alc, hselasky, kib
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit a4667e09e6)
2021-11-03 13:39:36 -04:00
Mark Johnston
b5e5020260 rmslock: Update td_locks during lock and unlock operations
Reviewed by:	mjg
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 71f31d784e)
2021-11-03 09:15:05 -04:00
Mark Johnston
10d94487df kasan: Use vm_offset_t for the first parameter to kasan_shadow_map()
No functional change intended.

Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 20e3b9d8bd)
2021-11-02 18:17:58 -04:00
Alexander Motin
aac9d07f93 sleepqueue(9): Remove sbinuptime() from sleepq_timeout().
Callout c_time is always bigger or equal than the scheduled time.  It
is also smaller than sbinuptime() and can't change while the callback
is running.  So we reliably can use it instead of sbinuptime() here.
In case there was a race and the callout was rescheduled to the later
time, the callback will be called again.

According to profiles it saves ~5% of the timer interrupt time even
with fast TSC timecounter.

MFC after:	1 month

(cherry picked from commit 6df1359e55)
2021-11-01 20:24:07 -04:00
Mark Johnston
3388bf06d7 Generalize sanitizer interceptors for memory and string routines
Similar to commit 3ead60236f ("Generalize bus_space(9) and atomic(9)
sanitizer interceptors"), use a more generic scheme for interposing
sanitizer implementations of routines like memcpy().

No functional change intended.

Sponsored by:	The FreeBSD Foundation

(cherry picked from commit ec8f1ea8d5)
2021-11-01 10:20:50 -04:00
Mark Johnston
bf0986b742 Generalize bus_space(9) and atomic(9) sanitizer interceptors
Make it easy to define interceptors for new sanitizer runtimes, rather
than assuming KCSAN.  Lay a bit of groundwork for KASAN and KMSAN.

When a sanitizer is compiled in, atomic(9) and bus_space(9) definitions
in atomic_san.h are used by default instead of the inline
implementations in the platform's atomic.h.  These definitions are
implemented in the sanitizer runtime, which includes
machine/{atomic,bus}.h with SAN_RUNTIME defined to pull in the actual
implementations.

No functional change intended.

Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 3ead60236f)
2021-11-01 10:16:39 -04:00
Mark Johnston
252b6ae3e6 KASAN: Disable checking before triggering a panic
KASAN hooks will not generate reports if panicstr != NULL, but then
there is a window after the initial panic() call where another report
may be raised.  This can happen if a false positive occurs; to simplify
debugging of such problems, avoid recursing.

Sponsored by:	The FreeBSD Foundation

(cherry picked from commit ea3fbe0707)
2021-11-01 10:07:45 -04:00
Mark Johnston
224a01a342 KASAN: Implement __asan_unregister_globals()
It will be called during KLD unload to unpoison the redzones following
global variables.  Otherwise, virtual address ranges previously used for
a KLD may be left tainted, triggering false positives when they are
recycled.

Reported by:	pho
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 588c7a06df)
2021-11-01 10:07:13 -04:00
Mark Johnston
28c338b342 realloc: Fix KASAN(9) shadow map updates
When copying from the old buffer to the new buffer, we don't know the
requested size of the old allocation, but only the size of the
allocation provided by UMA.  This value is "alloc".  Because the copy
may access bytes in the old allocation's red zone, we must mark the full
allocation valid in the shadow map.  Do so using the correct size.

Reported by:	kp
Tested by:	kp
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 9a7c2de364)
2021-11-01 10:05:22 -04:00
Mark Johnston
9710b74dd0 malloc: Add state transitions for KASAN
- Reuse some REDZONE bits to keep track of the requested and allocated
  sizes, and use that to provide red zones.
- As in UMA, disable memory trashing to avoid unnecessary CPU overhead.

Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 06a53ecf24)
2021-11-01 10:03:36 -04:00
Mark Johnston
2748ecec95 execve: Mark exec argument buffers
We cache mapped execve argument buffers to avoid the overhead of TLB
shootdowns.  Mark them invalid when they are freed to the cache.

Sponsored by:	The FreeBSD Foundation

(cherry picked from commit f1c3adefd9)
2021-11-01 10:03:28 -04:00
Mark Johnston
75306778f1 vfs: Add KASAN state transitions for vnodes
vnodes are a bit special in that they may exist on per-CPU lists even
while free.  Add a KASAN-only destructor that poisons regions of each
vnode that are not expected to be accessed after a free.

Sponsored by:	The FreeBSD Foundation

(cherry picked from commit b261bb4057)
2021-11-01 10:03:19 -04:00
Mark Johnston
a3d4c8e21d amd64: Implement a KASAN shadow map
The idea behind KASAN is to use a region of memory to track the validity
of buffers in the kernel map.  This region is the shadow map.  The
compiler inserts calls to the KASAN runtime for every emitted load
and store, and the runtime uses the shadow map to decide whether the
access is valid.  Various kernel allocators call kasan_mark() to update
the shadow map.

Since the shadow map tracks only accesses to the kernel map, accesses to
other kernel maps are not validated by KASAN.  UMA_MD_SMALL_ALLOC is
disabled when KASAN is configured to reduce usage of the direct map.
Currently we have no mechanism to completely eliminate uses of the
direct map, so KASAN's coverage is not comprehensive.

The shadow map uses one byte per eight bytes in the kernel map.  In
pmap_bootstrap() we create an initial set of page tables for the kernel
and preloaded data.

When pmap_growkernel() is called, we call kasan_shadow_map() to extend
the shadow map.  kasan_shadow_map() uses pmap_kasan_enter() to allocate
memory for the shadow region and map it.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29417

(cherry picked from commit 6faf45b34b)
2021-11-01 09:57:30 -04:00
Mark Johnston
48d2c7cc30 Add the KASAN runtime
KASAN enables the use of LLVM's AddressSanitizer in the kernel.  This
feature makes use of compiler instrumentation to validate memory
accesses in the kernel and detect several types of bugs, including
use-after-frees and out-of-bounds accesses.  It is particularly
effective when combined with test suites or syzkaller.  KASAN has high
CPU and memory usage overhead and so is not suited for production
environments.

The runtime and pmap maintain a shadow of the kernel map to store
information about the validity of memory mapped at a given kernel
address.

The runtime implements a number of functions defined by the compiler
ABI.  These are prefixed by __asan.  The compiler emits calls to
__asan_load*() and __asan_store*() around memory accesses, and the
runtime consults the shadow map to determine whether a given access is
valid.

kasan_mark() is called by various kernel allocators to update state in
the shadow map.  Updates to those allocators will come in subsequent
commits.

The runtime also defines various interceptors.  Some low-level routines
are implemented in assembly and are thus not amenable to compiler
instrumentation.  To handle this, the runtime implements these routines
on behalf of the rest of the kernel.  The sanitizer implementation
validates memory accesses manually before handing off to the real
implementation.

The sanitizer in a KASAN-configured kernel can be disabled by setting
the loader tunable debug.kasan.disable=1.

Obtained from:	NetBSD
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 38da497a4d)
2021-11-01 09:56:31 -04:00