Commit Graph

18640 Commits

Author SHA1 Message Date
Kyle Evans
7a129c973b kern: add an option for preserving the early kenv
Some downstream configurations do not store secrets in the
early (loader/static) environments and desire a way to preserve these
for diagnostic reasons.  Provide an option to do so.

Reviewed by:	imp, jhb (earlier version)
Differential Revision:	https://reviews.freebsd.org/D30834
2021-07-18 23:05:48 -05:00
David Chisnall
cf98bc28d3 Pass the syscall number to capsicum permission-denied signals
The syscall number is stored in the same register as the syscall return
on amd64 (and possibly other architectures) and so it is impossible to
recover in the signal handler after the call has returned.  This small
tweak delivers it in the `si_value` field of the signal, which is
sufficient to catch capability violations and emulate them with a call
to a more-privileged process in the signal handler.

This reapplies 3a522ba1bc with a fix for
the static assertion failure on i386.

Approved by:	markj (mentor)

Reviewed by:	kib, bcr (manpages)

Differential Revision: https://reviews.freebsd.org/D29185
2021-07-16 18:06:44 +01:00
Mark Johnston
c1aff72cfa callout: Make cc_cpu local to kern_timeout.c
No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-07-15 22:41:10 -04:00
Mark Johnston
2e5f615295 lio_listio: Don't post a completion notification if none was requested
One is allowed to use LIO_NOWAIT without specifying a sigevent.  In this
case, lj->lioj_signal is left uninitialized, but several code paths
examine liov_signal.sigev_notify to figure out which notification to
post.  Unconditionally initialize that field to SIGEV_NONE.

Add a dumb test case which triggers the bug.

Reported by:	KMSAN+syzkaller
Reviewed by:	asomers
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31197
2021-07-15 22:41:10 -04:00
Konstantin Belousov
0bdb2cbf9d procctl(PROC_ASLR_STATUS): fix vmspace leak
Reported by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2021-07-15 03:02:50 +03:00
Mark Johnston
2783335cae blist: Correct the node count computed in blist_create()
Commit bb4a27f927 added the ability to allocate a span of blocks
crossing a meta node boundary.  To ensure that blst_next_leaf_alloc()
does not walk past the end of the tree, an extra all-zero meta node
needs to be present at the end of the allocation, and
blst_next_leaf_alloc() is implemented such that the presence of this
node terminates the search.

blist_create() computes the number of nodes required.  It had two
problems:
1. When the size of the blist is a power of BLIST_RADIX, we would
   unnecessarily allocate an extra level in the tree.
2. When the size of the blist is a multiple of BLIST_RADIX, we would
   fail to allocate a terminator node.  In this case,
   blst_next_leaf_alloc() could scan beyond the bounds of the
   allocation.  This was found using KASAN.

Modify blist_create() to handle these cases correctly.

Reported by:	pho
Reviewed by:	dougm
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D31158
2021-07-13 17:47:27 -04:00
Mark Johnston
45e2357113 malloc: Pass the allocation size to malloc_large() by value
Its callers do not make use the modified size that malloc_large() was
returning, so there's no need to pass a pointer.  No functional change
intended.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-07-13 17:47:02 -04:00
Mateusz Guzik
844aa31c6d cache: add cache_enter_time_flags 2021-07-12 07:03:14 +02:00
David Chisnall
d2b558281a Revert "Pass the syscall number to capsicum permission-denied signals"
This broke the i386 build.

This reverts commit 3a522ba1bc.
2021-07-10 20:26:01 +01:00
David Chisnall
3a522ba1bc Pass the syscall number to capsicum permission-denied signals
The syscall number is stored in the same register as the syscall return
on amd64 (and possibly other architectures) and so it is impossible to
recover in the signal handler after the call has returned.  This small
tweak delivers it in the `si_value` field of the signal, which is
sufficient to catch capability violations and emulate them with a call
to a more-privileged process in the signal handler.

Approved by:	markj (mentor)

Reviewed by:	kib, bcr (manpages)

Differential Revision: https://reviews.freebsd.org/D29185
2021-07-10 17:19:52 +01:00
Alexander Motin
63ca9ea4f3 Use sleepq_signal(SLEEPQ_DROP) in cv_signal().
Same as wakeup_one()/wakeup_any() commit before it reduces the lock
hold time and so contention.

MFC after:	1 week
2021-07-09 20:57:58 -04:00
Mark Johnston
588c7a06df KASAN: Implement __asan_unregister_globals()
It will be called during KLD unload to unpoison the redzones following
global variables.  Otherwise, virtual address ranges previously used for
a KLD may be left tainted, triggering false positives when they are
recycled.

Reported by:	pho
Sponsored by:	The FreeBSD Foundation
2021-07-09 20:38:50 -04:00
Michal Meloun
e88c3b1b02 intrng: remove now redundant shadow variable.
Should not be a functional change.

Submitted by: 	ehem_freebsd@m5p.com
Discussed in:	https://reviews.freebsd.org/D29310
MFC after:	4 weeks
2021-07-08 08:46:41 +02:00
Michal Meloun
a49f208d94 intrng: Releasing interrupt source should clear interrupt table full state.
The first release of an interrupt in a situation where the interrupt table
is full should schedule a full table check the next time an interrupt is
allocated. A full check is necessary to ensure maximum separation between
the order of allocation and the order of release.

Submitted by:	ehem_freebsd@m5p.com (initial version)
Discussed in:	https://reviews.freebsd.org/D29310
MFC after:	4 weeks
2021-07-08 08:16:46 +02:00
Andrew Gallatin
4150a5a87e ktls: fix NOINET build
Reported by: mjguzik
Sponsored by: Netflix
2021-07-07 10:40:02 -04:00
Randall Stewart
d7955cc0ff tcp: HPTS performance enhancements
HPTS drives both rack and bbr, and yet there have been many complaints
about performance. This bit of work restructures hpts to help reduce CPU
overhead. It does this by now instead of relying on the timer/callout to
drive it instead use user return from a system call as well as lro flushes
to drive hpts. The timer becomes a backstop that dynamically adjusts
based on how "late" we are.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31083
2021-07-07 07:22:35 -04:00
Konstantin Belousov
28a66fc3da Do not call FreeBSD-ABI specific code for all ABIs
Use sysentvec hooks to only call umtx_thread_exit/umtx_exec, which handle
robust mutexes, for native FreeBSD ABI.  Similarly, there is no sense
in calling sigfastblock_clear() for non-native ABIs.

Requested by:	dchagin
Reviewed by:	dchagin, markj (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D30987
2021-07-07 14:12:07 +03:00
Konstantin Belousov
55976ce11a Move sv_onexit() sysentvec hook slightly later
after itimers are stopped.  This makes it more usable for e.g. native FreeBSD
ABI sysentvecs.

Reviewed by:	dchagin, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D30987
2021-07-07 14:12:07 +03:00
Konstantin Belousov
71ab344524 Add sv_onexec_old() sysent hook for exec event
Unlike sv_onexec(), it is called from the old (pre-exec) sysentvec structure.
The old vmspace for the process is still intact during the call.

Reviewed by:	dchagin, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D30987
2021-07-07 14:12:07 +03:00
Mateusz Guzik
c2c34ee540 mbuf: add m_get_raw and m_gethdr_raw
The intent is to eliminate the MT_NOINIT flag and consequently a branch
from the constructor.

Reviewed by:	gallatin
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D31080
2021-07-07 11:05:46 +00:00
Mateusz Guzik
0a718a6e6e mbuf: replace all direct uma_zfree(zone_mbuf) calls with m_free_raw
Reviewed by:	donner
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D31082
2021-07-07 11:05:46 +00:00
Andrew Gallatin
28d0a740dd ktls: auto-disable ifnet (inline hw) kTLS
Ifnet (inline) hw kTLS NICs typically keep state within
a TLS record, so that when transmitting in-order,
they can continue encryption on each segment sent without
DMA'ing extra state from the host.

This breaks down when transmits are out of order (eg,
TCP retransmits).  In this case, the NIC must re-DMA
the entire TLS record up to and including the segment
being retransmitted.  This means that when re-transmitting
the last 1448 byte segment of a TLS record, the NIC will
have to re-DMA the entire 16KB TLS record. This can lead
to the NIC running out of PCIe bus bandwidth well before
it saturates the network link if a lot of TCP connections have
a high retransmoit rate.

This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct),
where TCP connections with higher retransmit rate will be
switched to SW kTLS so as to conserve PCIe bandwidth.

Reviewed by:	hselasky, markj, rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D30908
2021-07-06 10:28:32 -04:00
Jessica Clarke
55c57a7811 rman: Remove an outdated comment that no longer applies
Since commit 2dd1bdf183 in 2016 the r_start and r_end fields have been
rman_res_t, which was briefly unsigned long, but commit da1b038af9
changed the typedef to be uintmax_t instead. C99 is also something we
assume these days.

Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D30808
2021-07-05 16:15:03 +01:00
Mateusz Guzik
904a08f342 ktls: switch bare zone_mbuf use to m_free_raw
Reviewed by:	gallatin
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D30955
2021-07-02 08:30:22 +00:00
Mateusz Guzik
05462babd4 mbuf: add m_free_raw to be used instead of directly calling uma_zfree
The intent is to remove all direct zone_mbuf consumers so that ctor/dtor
from that zone can be reimplemented as wrappers around uma, avoiding an
indirect function call.

Reviewed by:	kbowling
Discussed with:	gallatin
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D30959
2021-07-02 08:30:22 +00:00
Mateusz Guzik
fb32c8dbeb iflib: retire MB_DTOR_SKIP
The flag was added in 2016 but remains unused.

Reviewed by:	kbowling
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D30958
2021-07-02 08:30:22 +00:00
Edward Tomasz Napierala
db8d680ebe procctl(2): add PROC_NO_NEW_PRIVS_CTL, PROC_NO_NEW_PRIVS_STATUS
This introduces a new, per-process flag, "NO_NEW_PRIVS", which
is inherited, preserved on exec, and cannot be cleared.  The flag,
when set, makes subsequent execs ignore any SUID and SGID bits,
instead executing those binaries as if they not set.

The main purpose of the flag is implementation of Linux
PROC_SET_NO_NEW_PRIVS prctl(2), and possibly also unpriviledged
chroot.

Reviewed By:	kib
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D30939
2021-07-01 09:42:07 +01:00
Dmitry Chagin
5d9f790191 Eliminate p_elf_machine from struct proc.
Instead of p_elf_machine use machine member of the Elf_Brandinfo which is now
cached in the struct proc at p_elf_brandinfo member.

Note to MFC: D30918, KBI

Reviewed by:		kib, markj
Differential Revision:	https://reviews.freebsd.org/D30926
MFC after:		2 weeks
2021-06-29 20:18:29 +03:00
Dmitry Chagin
615f22b2fb Add a link to the Elf_Brandinfo into the struc proc.
To allow the ABI to make a dicision based on the Brandinfo add a link
to the Elf_Brandinfo into the struct proc. Add a note that the high 8 bits
of Elf_Brandinfo flags is private to the ABI.

Note to MFC: it breaks KBI.

Reviewed by:		kib, markj
Differential Revision:	https://reviews.freebsd.org/D30918
MFC after:		2 weeks
2021-06-29 20:15:08 +03:00
Edward Tomasz Napierala
435754a59e Add infrastructure required for Linux coredump support
This adds `sv_elf_core_osabi`, `sv_elf_core_abi_vendor`,
and `sv_elf_core_prepare_notes` fields to `struct sysentvec`,
and modifies imgact_elf.c to make use of them instead
of hardcoding FreeBSD-specific values.  It also updates all
of the ABI definitions to preserve current behaviour.

This makes it possible to implement non-native ELF coredump
support without unnecessary code duplication.  It will be used
for Linux coredumps.

Reviewed By:	kib
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D30921
2021-06-29 08:49:12 +01:00
Edward Tomasz Napierala
61b4c62718 imgact_elf.c: style, remove unnecessary casts
Remove unnecessary type casts and redundant brackets.
No functional changes.

Suggested By:	kib
Reviewed By:	kib
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D30841
2021-06-27 17:05:59 +01:00
Alexander Motin
6df35af4d8 Allow sleepq_signal() to drop the lock.
Introduce SLEEPQ_DROP sleepq_signal() flag, allowing one to drop the
sleep queue chain lock before returning.  Reduced lock scope allows
significantly reduce lock contention inside taskqueue_enqueue() for
ZFS worker threads doing ~350K disk reads/s on 40-thread system.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2021-06-25 14:12:21 -04:00
Konstantin Belousov
802cf4ab0e namei: add NDPREINIT() macro
Its intent is to do the initialization of the future part of struct nameidata
which should be used across several namei() and VOPs.  Right now it is NOP.

Reviewed by:	mckusick
Discussed with:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D30041
2021-06-23 23:46:15 +03:00
Warner Losh
ddfc9c4c59 newbus: Move from bus_child_{pnpinfo,location}_src to bus_child_{pnpinfo,location} with sbuf
Now that the upper layers all go through a layer to tie into these
information functions that translates an sbuf into char * and len. The
current interface suffers issues of what to do in cases of truncation,
etc. Instead, migrate all these functions to using struct sbuf and these
issues go away. The caller is also in charge of any memory allocation
and/or expansion that's needed during this process.

Create a bus_generic_child_{pnpinfo,location} and make it default. It
just returns success. This is for those busses that have no information
for these items. Migrate the now-empty routines to using this as
appropriate.

Document these new interfaces with man pages, and oversight from before.

Reviewed by:		jhb, bcr
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D29937
2021-06-22 20:52:06 -06:00
Edward Tomasz Napierala
06250515cf imgact_elf: compute auxv buffer size instead of using magic value
The new buffer is somewhat larger, but there should be no functional
changes.

Reviewed By:	kib, imp
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D30821
2021-06-21 17:07:07 +01:00
Colin Percival
fe51b5a76d kern_tslog: Include tslog data from loader
The i386 loader (and hopefully others to come) now passes tslog data
as a "preloaded module".  Include this in the data returned by the
debug.tslog sysctl.

Reviewed by:	kevans
2021-06-20 20:09:47 -07:00
Warner Losh
0a99422970 Move mips and arm to 1000Hz by default.
armv6 and armv7 systems already were 1000Hz. The other armv5 were a
mix of 100 and 1000. This changes them to 1000. Should there be
issues, we can add options HZ=100 to the systems that have bad
performance at the drop of a hat.

mips is a lot more complicated. But most of the systems are already
1000HZ. The hardware exceptions are all fast enough to run at
1000Hz. MALTA is our primary emulator, and history has shown emulators
tend to like 100Hz better, so run those systems at 100Hz. As with arm,
any system that shows a huge performance regression can reverted to
100Hz easily.

This was going to be committed well in advance of the 13 branch, but
it was delayed and forgotten til now.

Discussed on:	#bsdmips ages ago
Sponsored by:	Netflix
2021-06-16 20:00:14 -06:00
John Baldwin
faf0224ff2 ktls: Don't mark existing received mbufs notready for TOE TLS.
The TOE driver might receive decrypted TLS records that are enqueued
to the socket buffer after ktls_try_toe() returns and before
ktls_enable_rx() locks the receive buffer to call sb_mark_notready().
In that case, sb_mark_notready() would incorrectly treat the decrypted
TLS record as an encrypted record and schedule it for decryption.
This always resulted in the connection being dropped as the data in
the control message did not look like a valid TLS header.

To fix, don't try to handle software decryption of existing buffers in
the socket buffer for TOE TLS in ktls_enable_rx().  If a TOE TLS
driver needs to decrypt existing data in the socket buffer, the driver
will need to manage that in its tod_alloc_tls_session method.

Sponsored by:	Chelsio Communications
2021-06-15 17:45:21 -07:00
Konstantin Belousov
a12e901a5a Add a knob to disable dequeueing SIGCHLD on waiting for live process
It seems that Linux does not dequeue siginfo for SIGCHLD when wait*(2)
reports status of the running process.  In particular, sigwaitinfo(2)
and other signal querying syscalls can observe the siginfo after wait.

FreeBSD dequeued siginfo from the beginning, so we cannot change the
default ABI to be more compatible.  Still, add a knob to enable to
change to the other behavior for debugging purposes.

Reported by:	dchagin
Reviewed by:	dchagin, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30675
2021-06-16 02:00:19 +03:00
Konstantin Belousov
bc38762474 Add a knob to not drop signal with default ignored or ignored actions
Traditionally, BSD drops signals with the default action during send,
not even putting them to the destination process queue.  This semantic
is not shared with other operating systems (Linux), which do queue
such signals.  In particular, sigtimedwait(2) and related syscalls can
observe the delivery.

Add a global knob kern.sig_discard_ign which can be set to false to force
enqueuing of the signals with default action.  Also add an ABI flag to
indicate that signals should be queued.

Note that it is not practical to run with the knob turned on, because almost
all software that care about the delivery of such signals, is aware of the
difference, and misbehaves if the signals are actually queued.  The purpose
of the knob as is is to allow for easier diagnostic of the programs that
need the adjustments, to confirm the cause of problem.

Reported by:	dchagin
Reviewed by:	dchagin, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30675
2021-06-16 02:00:19 +03:00
Konstantin Belousov
acced8b043 sigwait: add comment explaining EINTR/ERESTART details
Reviewed by:	dchagin, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30675
2021-06-16 02:00:19 +03:00
Konstantin Belousov
afb36e289c sigwait(2) and sigtimedwait(2) must not be restarted.
Reported by:	dchagin
Reviewed by:	dchagin, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30675
2021-06-16 02:00:18 +03:00
Mark Johnston
a100217489 Consistently use the SOCKBUF_MTX() and SOCK_MTX() macros
This makes it easier to change the socket locking protocols.  No
functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-06-14 17:32:32 -04:00
Mark Johnston
f4bb1869dd Consistently use the SOLISTENING() macro
Some code was using it already, but in many places we were testing
SO_ACCEPTCONN directly.  As a small step towards fixing some bugs
involving synchronization with listen(2), make the kernel consistently
use SOLISTENING().  No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-06-14 17:32:27 -04:00
Andrew Gallatin
ed5e13cfc2 ktls: Fix interaction with RATELIMIT
uipc_ktls.c was missing opt_ratelimit.h, so it was
never noticing that RATELIMIT was enabled.  Once it was
enabled, it failed to compile as  ktls_modify_txrtlmt()
had accrued a compilation error when it was not being
compiled in.

Sponsored by: Netflix
2021-06-14 10:51:16 -04:00
Dmitry Chagin
e884512ad1 Split kern_poll() on two counterparts.
The kern_poll_kfds() operates on clear kernel data, kfds points to an
array in the kernel, while kern_poll() operates on user supplied pollfd.
Move nfds check to kern_poll_maxfds().

No functional changes, it's for future use in the Linux emulation layer.

Reviewd by:		kib
Differential Revision:	https://reviews.freebsd.org/D30690
MFC after:		2 weeks
2021-06-10 15:11:25 +03:00
Dmitry Chagin
f570a6723e Fix copyright, remove "all rights reserved".
The eventfd code was written by me, rdivacky@ copyrigth applicable only
to epoll part of the Linuxulator code. Roman is ok to retire his copyright
from sys/kern/sys_eventfd.c and 'All rights reserved.' lines from
sys/compat/linux/linux_event.[c|h] and sys/kern/sys_eventfd.c files.

Reviewed by:		kib, emaste
Approved by:		rdivacky
Differential Revision:	https://reviews.freebsd.org/D30677
MFC after:		2 weeks
2021-06-08 08:18:00 +03:00
Mark Johnston
887c753c9f Fix handling of D_GIANTOK
It was meant to suppress only the printf(), not the subsequent injection
of Giant-protected thunks for various file operations.

Fixes:		fbeb4ccac9
Reported by:	pho
Tested by:	pho
MFC after:	6 days
Pointy hat:	markj
2021-06-07 16:45:50 -04:00
Mark Johnston
fbeb4ccac9 Suppress D_NEEDGIANT warnings for some drivers
During boot we warn that the kbd and openfirm drivers are Giant-locked
and may be deleted.  Generally, the warning helps signal that certain
old drivers are not being maintained and are subject to removal, but
this doesn't really apply to certain drivers which are harder to
detangle from Giant.

Add a flag, D_GIANTOK, that devices can specify to suppress the
misleading warning.  Use it in the kbd and openfirm drivers.

Reviewed by:	imp, jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30649
2021-06-06 16:44:46 -04:00
Konstantin Belousov
2d423f7671 sysent: allow ABI to disable setid on exec.
Reviewed by:	dchagin
Tested by:	trasz
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28154
2021-06-06 21:42:52 +03:00
Konstantin Belousov
19e6043a44 kern_exec.c: Add execve_nosetid() helper
Reviewed by:	dchagin
Tested by:	trasz
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28154
2021-06-06 21:42:41 +03:00
Jason A. Harmening
59409cb90f Add a generic mechanism for preventing forced unmount
This is aimed at preventing stacked filesystems like nullfs and unionfs
from "losing" their lower mounts due to forced unmount.  Otherwise,
VFS operations that are passed through to the lower filesystem(s) may
crash or otherwise cause unpredictable behavior.

Introduce two new functions: vfs_pin_from_vp() and vfs_unpin().
which are intended to be called on the lower mount(s) when the stacked
filesystem is mounted and unmounted, respectively.
Much as registration in the mnt_uppers list previously did, pinning
will prevent even forced unmount of the lower FS and will allow the
stacked FS to freely operate on the lower mount either by direct
use of the struct mount* or indirect use through a properly-referenced
vnode's v_mount field.

vfs_pin_from_vp() is modeled after vfs_ref_from_vp() in that it uses
the mount interlock coupled with re-checking vp->v_mount to ensure
that it will fail in the face of a pending unmount request, even if
the concurrent unmount fully completes.

Adopt these new functions in both nullfs and unionfs.

Reviewed By:	kib, markj
Differential Revision: https://reviews.freebsd.org/D30401
2021-06-05 18:20:36 -07:00
wiklam
43521b46fc Correcting comment about "sched_interact_score".
Reviewed by:	jrtc@, imp@
Pull Request:	https://github.com/freebsd/freebsd-src/pull/431

Sponsored by:		Netflix
2021-06-02 21:50:57 -06:00
Warner Losh
9f3d1a98dd regen after tweaks to getgroups and setgroups
Sponsored by:		Netflix
2021-06-02 13:24:50 -06:00
Moritz Buhl
4bc2174a1b kern: fail getgroup and setgroup with negative int
Found using
https://github.com/NetBSD/src/blob/trunk/tests/lib/libc/sys/t_getgroups.c

getgroups/setgroups want an int and therefore casting it to u_int
resulted in `getgroups(-1, ...)` not returning -1 / errno = EINVAL.

imp@ updated syscall.master and made changes markj@ suggested

PR:			189941
Tested by:		imp@
Reviewed by:		markj@
Pull Request:		https://github.com/freebsd/freebsd-src/pull/407
Differential Revision:	https://reviews.freebsd.org/D30617
2021-06-02 13:22:57 -06:00
Mateusz Guzik
c9f8dcda85 kqueue: replace kq_ncallouts loop with atomic_fetchadd 2021-06-02 15:14:58 +00:00
Rich Ercolani
a19ae1b099 vfs: fix MNT_SYNCHRONOUS check in vn_write
ca1ce50b2b ("vfs: add more safety against concurrent forced
unmount to vn_write") has a side effect of only checking MNT_SYNCHRONOUS
if O_FSYNC is set.

Reviewed By: mjg
Differential Revision: https://reviews.freebsd.org/D30610
2021-06-02 13:42:02 +00:00
Kyle Evans
2d741f33bd kern: ether_gen_addr: randomize on default hostuuid, too
Currently, this will still hash the default (all zero) hostuuid and
potentially arrive at a MAC address that has a high chance of collision
if another interface of the same name appears in the same broadcast
domain on another host without a hostuuid, e.g., some virtual machine
setups.

Instead of using the default hostuuid, just treat it as a failure and
generate a random LA unicast MAC address.

Reviewed by:	bz, gbe, imp, kbowling, kp
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29788
2021-06-01 22:59:21 -05:00
Mark Johnston
283e60fb31 ktrace: Fix an inverted comparison added in commit f3851b235
Fixes:		f3851b235 ("ktrace: Fix a race with fork()")
Reported by:	dchagin, phk
2021-06-01 09:15:35 -04:00
Konstantin Belousov
d3f7975fcb thread_reap_barrier(): remove unused variable
Noted by:	alc
Sponsored by:	Mellanox Technologies/NVidia Networking
MFC after:	1 week
2021-05-31 23:03:42 +03:00
Konstantin Belousov
f62c7e54e9 Add thread_reap_barrier()
Reviewed by:	hselasky,markj
Sponsored by:	Mellanox Technologies/NVidia Networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30468
2021-05-31 18:09:22 +03:00
Konstantin Belousov
3a68546d23 quisce_cpus(): add special handling for PDROP
Currently passing PDROP to the quisce_cpus() function does not make sense.
Add special meaning for it, by not waiting for the idle thread to schedule.

Also avoid allocating u_int[MAXCPU] on the stack.

Reviewed by:	hselasky, markj
Sponsored by:	Mellanox Technologies/NVidia Networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30468
2021-05-31 18:09:22 +03:00
Konstantin Belousov
845d77974b kern_thread.c: wrap too long lines
Reviewed by:	hselasky, markj
Sponsored by:	Mellanox Technologies/NVidia Networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30468
2021-05-31 18:09:22 +03:00
Konstantin Belousov
e266a0f7f0 kern linker: do not allow more than one kldload and kldunload syscalls simultaneously
kld_sx is dropped e.g. for executing sysinits, which allows user
to initiate kldunload while module is not yet fully initialized.

Reviewed by:	markj
Differential revision:	https://reviews.freebsd.org/D30456
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-05-31 18:09:22 +03:00
Konstantin Belousov
27006229f7 vinvalbuf: do not panic if we were unable to flush dirty buffers
Return EBUSY instead and let caller to handle the issue.

For vgone()/vnode reclamation, caller first does vinvalbuf(V_SAVE),
which return EBUSY in case dirty buffers where not flushed. Then caller
calls vinvalbuf(0) due to non-zero return, which gets rid of all dirty
buffers without dependencies.

PR:	238565
Reviewed by:	asomers, mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30555
2021-05-31 01:20:53 +03:00
Jason A. Harmening
a4b07a2701 VFS_QUOTACTL(9): allow implementation to indicate busy state changes
Instead of requiring all implementations of vfs_quotactl to unbusy
the mount for Q_QUOTAON and Q_QUOTAOFF, add an "mp_busy" in/out param
to VFS_QUOTACTL(9).  The implementation may then indicate to the caller
whether it needed to unbusy the mount.

Also, add stbool.h to libprocstat modules which #define _KERNEL
before including sys/mount.h.  Otherwise they'll pull in sys/types.h
before defining _KERNEL and therefore won't have the bool definition
they need for mp_busy.

Reviewed By:	kib, markj
Differential Revision: https://reviews.freebsd.org/D30556
2021-05-30 14:53:47 -07:00
Jason A. Harmening
271fcf1c28 Revert commits 6d3e78ad6c and 54256e7954
Parts of libprocstat like to pretend they're kernel components for the
sake of including mount.h, and including sys/types.h in the _KERNEL
case doesn't fix the build for some reason.  Revert both the
VFS_QUOTACTL() change and the follow-up "fix" for now.
2021-05-29 17:48:02 -07:00
Mateusz Guzik
3cf75ca220 vfs: retire unused vn_seqc_write_begin_unheld* 2021-05-29 22:04:09 +00:00
Mateusz Guzik
d81aefa8b7 vfs: use the sentinel trick in locked lookup path parsing 2021-05-29 22:04:09 +00:00
Mateusz Guzik
478c52f1e3 vfs: slightly rework vn_rlimit_fsize 2021-05-29 22:04:09 +00:00
Mateusz Guzik
9bfddb3ac4 fd: use PROC_WAIT_UNLOCKED when clearing p_fd/p_pd 2021-05-29 22:04:09 +00:00
Jason A. Harmening
6d3e78ad6c VFS_QUOTACTL(9): allow implementation to indicate busy state changes
Instead of requiring all implementations of vfs_quotactl to unbusy
the mount for Q_QUOTAON and Q_QUOTAOFF, add an "mp_busy" in/out param
to VFS_QUOTACTL(9).  The implementation may then indicate to the caller
whether it needed to unbusy the mount.

Reviewed By:	kib, markj
Differential Revision: https://reviews.freebsd.org/D30218
2021-05-29 14:05:39 -07:00
Mark Johnston
f3851b235b ktrace: Fix a race with fork()
ktrace(2) may toggle trace points in any of
1. a single process
2. all members of a process group
3. all descendents of the processes in 1 or 2

In the first two cases, we do not permit the operation if the process is
being forked or not visible. However, in case 3 we did not enforce this
restriction for descendents. As a result, the assertions about the child
in ktrprocfork() may be violated.

Move these checks into ktrops() so that they are applied consistently.

Allow KTROP_CLEAR for nascent processes. Otherwise, there is a window
where we cannot clear trace points for a nascent child if they are
inherited from the parent.

Reported by:	syzbot+d96676592978f137e05c@syzkaller.appspotmail.com
Reported by:	syzbot+7c98fcf84a4439f2817f@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30481
2021-05-27 15:52:20 -04:00
Mark Johnston
e00bae5c18 kevent: Prohibit negative change and event list lengths
Previously, a negative change list length would be treated the same as
an empty change list.  A negative event list length would result in
bogus copyouts.  Make kevent(2) return EINVAL for both cases so that
application bugs are more easily found, and to be more robust against
future changes to kevent internals.

Reviewed by:	imp, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30480
2021-05-27 15:52:20 -04:00
Mark Johnston
f885100773 ktrace: Handle negative array sizes in ktrstructarray
ktrstructarray() may be used to create copies of kevent(2) change and
event arrays.  It is called before parameter validation is done and so
should check for bogus array lengths before allocating a copy.

Reported by:	syzkaller
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30479
2021-05-27 15:52:20 -04:00
Edward Tomasz Napierala
905d192d6f Unstaticize parts of coredumping code
This makes it possible to call __elfN(size_segments) and __elfN(puthdr)
from Linux coredump code.

Reviewed By:	kib
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D30455
2021-05-26 11:51:57 +01:00
John Baldwin
6b313a3a60 Include the trailer in the original dst_iov.
This avoids creating a duplicate copy on the stack just to
append the trailer.

Reviewed by:	gallatin, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D30139
2021-05-25 16:59:19 -07:00
John Baldwin
21e3c1fbe2 Assume OCF is the only KTLS software backend.
This removes support for loadable software backends.  The KTLS OCF
support is now always included in kernels with KERN_TLS and the
ktls_ocf.ko module has been removed.  The software encryption routines
now take an mbuf directly and use the TLS mbuf as the crypto buffer
when possible.

Bump __FreeBSD_version for software backends in ports.

Reviewed by:	gallatin, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D30138
2021-05-25 16:59:19 -07:00
John Baldwin
883a0196b6 crypto: Add a new type of crypto buffer for a single mbuf.
This is intended for use in KTLS transmit where each TLS record is
described by a single mbuf that is itself queued in the socket buffer.
Using the existing CRYPTO_BUF_MBUF would result in
bus_dmamap_load_crp() walking additional mbufs in the socket buffer
that are not relevant, but generating a S/G list that potentially
exceeds the limit of the tag (while also wasting CPU cycles).

Reviewed by:	markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D30136
2021-05-25 16:59:18 -07:00
John Baldwin
6663f8a23e sglist: Add sglist_append_single_mbuf().
This function appends the contents of a single mbuf to an sglist
rather than an entire mbuf chain.

Reviewed by:	gallatin, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D30135
2021-05-25 16:59:18 -07:00
John Baldwin
aa341db39b Rename m_unmappedtouio() to m_unmapped_uiomove().
This function doesn't only copy data into a uio but instead is a
variant of uiomove() similar to uiomove_fromphys().

Reviewed by:	gallatin, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D30444
2021-05-25 16:59:18 -07:00
John Baldwin
3f9dac85cc Extend m_copyback() to support unmapped mbufs.
Reviewed by:	gallatin, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D30133
2021-05-25 16:59:18 -07:00
John Baldwin
3c7a01d773 Extend m_apply() to support unmapped mbufs.
m_apply() invokes the callback function separately on each segment of
an unmapped mbuf: the TLS header, individual pages, and the TLS
trailer.

Reviewed by:	markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D30132
2021-05-25 16:59:18 -07:00
Edward Tomasz Napierala
3b9971c8da Clean up some of the core dumping code.
No functional changes.

Reviewed By:	kib
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D30397
2021-05-25 16:30:32 +01:00
Konstantin Belousov
fd3ac06f45 ptrace: add an option to not kill debuggees on debugger exit
Requested by:	markj
Reviewed by:	jhb (previous version)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differrential revision:	https://reviews.freebsd.org/D30351
2021-05-25 18:22:34 +03:00
Konstantin Belousov
d7a7ea5be6 sys_process.c: extract ptrace_unsuspend()
Reviewed by:	jhb
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differrential revision:	https://reviews.freebsd.org/D30351
2021-05-25 18:22:27 +03:00
Mateusz Guzik
a269183875 vfs: elide vnode locking when it is only needed for audit if possible 2021-05-23 19:37:16 +00:00
Mark Johnston
6f6cd1e8e8 ktrace: Remove vrele() at the end of ktr_writerequest()
As of commit fc369a353 we no longer ref the vnode when writing a record.
Drop the corresponding vrele() call in the error case.

Fixes:	fc369a353 ("ktrace: fix a race between writes and close")
Reported by:	syzbot+9b96ea7a5ff8917d3fe4@syzkaller.appspotmail.com
Reported by:	syzbot+6120ebbb354cd52e5107@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	6 days
Differential Revision:	https://reviews.freebsd.org/D30404
2021-05-23 14:13:01 -04:00
Mateusz Guzik
e2ab16b1a6 lockprof: move panic check after inspecting the state 2021-05-23 17:55:27 +00:00
Mateusz Guzik
6a467cc5e1 lockprof: pass lock type as an argument instead of reading the spin flag 2021-05-23 17:55:27 +00:00
Hans Petter Selasky
ef0f7ae934 The old thread priority must be stored as part of the EPOCH(9) tracker.
Else recursive use of EPOCH(9) may cause the wrong priority to be restored.

Bump the __FreeBSD_version due to changing the thread and epoch tracker
structure.

Differential Revision:	https://reviews.freebsd.org/D30375
Reviewed by:	markj@
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-05-23 10:53:25 +02:00
Mateusz Guzik
138f78e94b umtx: convert umtxq_lock to a macro
Then LOCK_PROFILING starts reporting callers instead of the inline.
2021-05-22 21:01:05 +00:00
Mateusz Guzik
e71d5c7331 Fix limit testing after 1762f674cc ktrace commit.
The previous:

if ((uoff_t)uio->uio_offset + uio->uio_resid > lim)
	signal(....);

was replaced with:

if ((uoff_t)uio->uio_offset + uio->uio_resid < lim)
	return;
signal(....);

Making (uoff_t)uio->uio_offset + uio->uio_resid == lim trip over the
limit, when it did not previously.

Unbreaks running 13.0 buildworld.
2021-05-22 20:18:21 +00:00
Konstantin Belousov
fc369a353b ktrace: fix a race between writes and close
It was possible that termination of ktrace session occured during some
record write, in which case write occured after the close of the vnode.
Use ktr_io_params refcounting to avoid this situation, by taking the
reference on the structure instead of vnode.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30400
2021-05-22 23:14:13 +03:00
Mateusz Guzik
48235c377f Fix a braino in previous.
Instead of trying to partially ifdef out ktrace handling, define the
missing identifier to 0. Without this fix lack of ktrace in the kernel
also means there is no SIGXFSZ signal delivery.
2021-05-22 19:53:40 +00:00
Mateusz Guzik
154f0ecc10 Fix tinderbox build after 1762f674cc ktrace commit. 2021-05-22 19:41:19 +00:00
Mateusz Guzik
a0842e69aa lockprof: add contested-only profiling
This allows tracking all wait times with much smaller runtime impact.

For example when doing -j 104 buildkernel on tmpfs:

no profiling:	2921.70s user 282.72s system 6598% cpu 48.562 total
all acquires:	2926.87s user 350.53s system 6656% cpu 49.237 total
contested only:	2919.64s user 290.31s system 6583% cpu 48.756 total
2021-05-22 19:28:37 +00:00
Mateusz Guzik
fca5cfd584 lockprof: retire lock_prof_skipcount
The implementation uses a global variable for *ALL* calls, defeating the
point of sampling in the first place. Remove it as it clearly remains
unused.
2021-05-22 19:28:37 +00:00
Mateusz Guzik
cf74b2be53 vfs: retire the now unused vnlru_free routine 2021-05-22 18:42:30 +00:00
Mark Johnston
e4b16f2fb1 ktrace: Avoid recursion in namei()
sys_ktrace() calls namei(), which may call ktrnamei().  But sys_ktrace()
also calls ktrace_enter() first, so if the caller is itself being
traced, the assertion in ktrace_enter() is triggered.  And, ktrnamei()
does not check for recursion like most other ktrace ops do.

Fix the bug by simply deferring the ktrace_enter() call.

Also make the parameter to ktrnamei() const and convert to ANSI.

Reported by:	syzbot+d0a4de45e58d3c08af4b@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30340
2021-05-22 12:07:32 -04:00
Konstantin Belousov
f784da883f Move mnt_maxsymlinklen into appropriate fs mount data structures
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
X-MFC-Note:	struct mount layout
Differential revision:	https://reviews.freebsd.org/D30325
2021-05-22 15:16:09 +03:00
Konstantin Belousov
ea2b64c241 ktrace: add a kern.ktrace.filesize_limit_signal knob
When enabled, writes to ktrace.out that exceed the max file size limit
cause SIGXFSZ as it should be, but note that the limit is taken from
the process that initiated ktrace.   When disabled, write is blocked,
but signal is not send.

Note that in either case ktrace for the affected process is stopped.

Requested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30257
2021-05-22 15:16:09 +03:00
Konstantin Belousov
02645b886b ktrace: use the limit of the trace initiator for file size limit on writes
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30257
2021-05-22 15:16:09 +03:00
Konstantin Belousov
1762f674cc ktrace: pack all ktrace parameters into allocated structure ktr_io_params
Ref-count the ktr_io_params structure instead of vnode/cred.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30257
2021-05-22 15:16:08 +03:00
Konstantin Belousov
a6144f713c ktrace: do not stop tracing other processes if our cannot write to this vnode
Other processes might still be able to write, make the decision to stop
based on the per-process situation.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30257
2021-05-22 15:16:08 +03:00
Konstantin Belousov
9bb84c23e7 accounting: explicitly mark the exiting thread as doing accounting
and use the mark to stop applying file size limits on the write of
the accounting record.  This allows to remove hack to clear process
limits in acct_process(), and avoids the bug with the clearing being
ineffective because limits are also cached in the thread structure.

Reported and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30257
2021-05-22 15:16:08 +03:00
Konstantin Belousov
70c05850e2 kern_descrip.c: Style
Wrap too long lines.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30257
2021-05-22 15:16:08 +03:00
Konstantin Belousov
d713bf7927 vn_need_pageq_flush(): simplify
There is no need to own vnode interlock, since v_object is type stable
and can only change to/from NULL, and no other checks in the function
access fields protected by the interlock.  Remove the need variable, the
result of the test is directly usable as return value.

Tested by:	mav, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-05-22 12:29:44 +03:00
Edward Tomasz Napierala
33621dfc19 Refactor core dumping code a bit
This makes it possible to use core_write(), core_output(),
and sbuf_drain_core_output(), in Linux coredump code.  Moving
them out of imgact_elf.c is necessary because of the weird way
it's being built.

Reviewed By:	kib
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D30369
2021-05-22 09:59:00 +01:00
Mark Johnston
916c61a5ed Fix handling of errors from pru_send(PRUS_NOTREADY)
PRUS_NOTREADY indicates that the caller has not yet populated the chain
with data, and so it is not ready for transmission.  This is used by
sendfile (for async I/O) and KTLS (for encryption).  In particular, if
pru_send returns an error, the caller is responsible for freeing the
chain since other implicit references to the data buffers exist.

For async sendfile, it happens that an error will only be returned if
the connection was dropped, in which case tcp_usr_ready() will handle
freeing the chain.  But since KTLS can be used in conjunction with the
regular socket I/O system calls, many more error cases - which do not
result in the connection being dropped - are reachable.  In these cases,
KTLS was effectively assuming success.

So:
- Change sosend_generic() to free the mbuf chain if
  pru_send(PRUS_NOTREADY) fails.  Nothing else owns a reference to the
  chain at that point.
- Similarly, in vn_sendfile() change the !async I/O && KTLS case to free
  the chain.
- If async I/O is still outstanding when pru_send fails in
  vn_sendfile(), set an error in the sfio structure so that the
  connection is aborted and the mbuf chain is freed.

Reviewed by:	gallatin, tuexen
Discussed with:	jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30349
2021-05-21 17:45:19 -04:00
Hans Petter Selasky
c82c200622 Accessing the epoch structure should happen after the INIT_CHECK().
Else the epoch pointer may be NULL.

MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-05-21 11:21:32 +02:00
Hans Petter Selasky
f33168351b Properly define EPOCH(9) function macro.
No functional change intended.

MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-05-21 11:21:32 +02:00
Hans Petter Selasky
cc9bb7a9b8 Rework for-loop in EPOCH(9) to reduce indentation level.
No functional change intended.

MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-05-21 11:21:32 +02:00
Lv Yunlong
b295c5ddce socket: Release cred reference later in sodealloc()
We dereference so->so_cred to update the per-uid socket buffer
accounting, so the crfree() call must be deferred until after that
point.

PR:		255869
MFC after:	1 week
2021-05-18 15:25:40 -04:00
Konstantin Belousov
8cf912b017 ttydev_write: prevent stops while terminal is busied
Since busy state is checked by all blocked writes, stopping a process
which waits in ttydisc_write() causes cascade.  Utilize sigdeferstop()
to avoid the issue.

Submitted by:	Jakub Piecuch <j.piecuch96@gmail.com>
PR:	255816
MFC after:	1 week
2021-05-18 20:52:03 +03:00
Mateusz Guzik
cc6f46ac2f vfs: refactor vdrop
In particular move vunlazy into its own routine.
2021-05-18 15:30:28 +00:00
Mateusz Guzik
715fcc0d34 vfs: change vn_freevnodes_* prefix to idiomatic vfs_freevnodes_* 2021-05-18 15:30:28 +00:00
Colin Percival
b6be9566d2 Fix buffer overflow in preloaded hostuuid cleaning
When a module of type "hostuuid" is provided by the loader,
prison0_init strips any trailing whitespace and ASCII control
characters by (a) adjusting the buffer length, and (b) zeroing out
the characters in question, before storing it as the system's
hostuuid.

The buffer length adjustment was correct, but the zeroing overwrote
one byte higher in memory than intended -- in the typical case,
zeroing one byte past the end of the hostuuid buffer.  Due to the
layout of buffers passed by the boot loader to the kernel, this will
be the first byte of a subsequent buffer.

This was *probably* harmless; prison0_init runs after preloaded kernel
modules have been linked and after the preloaded /boot/entropy cache
has been processed, so in both cases having the first byte overwritten
will not cause problems.  We cannot however rule out the possibility
that other objects which are preloaded by the loader could suffer from
having the first byte overwritten.

Since the zeroing does not in fact serve any purpose, remove it and
trim trailing whitespace and ASCII control characters by adjusting
the buffer length alone.

Fixes:		c3188289 Preload hostuuid for early-boot use
Reviewed by:	kevans, markj
MFC after:	3 days
2021-05-17 20:07:49 -07:00
Colin Percival
330f110bf1 Fix 'hostuuid: preload data malformed' warning
If the preloaded hostuuid value is invalid and verbose booting is
enabled, a warning is printed.  This printf had two bugs:

1. It was missing a trailing \n character.
2. The malformed UUID is printed with %s even though it is not known
to be NUL-terminated.

This commit adds the missing \n and uses %.*s with the (already known)
length of the preloaded UUID to ensure that we don't read past the end
of the buffer.

Reported by:	kevans
Fixes:		c3188289 Preload hostuuid for early-boot use
MFC after:	3 days
2021-05-17 20:07:49 -07:00
Kirk McKusick
9a2fac6ba6 Fix handling of embedded symbolic links (and history lesson).
The original filesystem release (4.2BSD) had no embedded sysmlinks.
Historically symbolic links were just a different type of file, so
the content of the symbolic link was contained in a single disk block
fragment. We observed that most symbolic links were short enough that
they could fit in the area of the inode that normally holds the block
pointers. So we created embedded symlinks where the content of the
link was held in the inode's pointer area thus avoiding the need to
seek and read a data fragment and reducing the pressure on the block
cache. At the time we had only UFS1 with 32-bit block pointers,
so the test for a fastlink was:

	di_size < (NDADDR + NIADDR) * sizeof(daddr_t)

(where daddr_t would be ufs1_daddr_t today).

When embedded symlinks were added, a spare field in the superblock
with a known zero value became fs_maxsymlinklen. New filesystems
set this field to (NDADDR + NIADDR) * sizeof(daddr_t). Embedded
symlinks were assumed when di_size < fs->fs_maxsymlinklen. Thus
filesystems that preceeded this change always read from blocks
(since fs->fs_maxsymlinklen == 0) and newer ones used embedded
symlinks if they fit. Similarly symlinks created on pre-embedded
symlink filesystems always spill into blocks while newer ones will
embed if they fit.

At the same time that the embedded symbolic links were added, the
on-disk directory structure was changed splitting the former
u_int16_t d_namlen into u_int8_t d_type and u_int8_t d_namlen.
Thus fs_maxsymlinklen <= 0 (as used by the OFSFMT() macro) can
be used to distinguish old directory formats. In retrospect that
should have just been an added flag, but we did not realize we
needed to know about that change until it was already in production.

Code was split into ufs/ffs so that the log structured filesystem could
use ufs functionality while doing its own disk layout. This meant
that no ffs superblock fields could be used in the ufs code. Thus
ffs superblock fields that were needed in ufs code had to be copied
to fields in the mount structure. Since ufs_readlink needed to know
if a link was embedded, fs_maxlinklen gets copied to mnt_maxsymlinklen.

The kernel panic that arose to making this fix was triggered when a
disk error created an inode of type symlink with no allocated data
blocks but a large size. When readlink was called the uiomove was
attempted which segment faulted.

static int
ufs_readlink(ap)
	struct vop_readlink_args /* {
		struct vnode *a_vp;
		struct uio *a_uio;
		struct ucred *a_cred;
	} */ *ap;
{
	struct vnode *vp = ap->a_vp;
	struct inode *ip = VTOI(vp);
	doff_t isize;

	isize = ip->i_size;
	if ((isize < vp->v_mount->mnt_maxsymlinklen) ||
	    DIP(ip, i_blocks) == 0) { /* XXX - for old fastlink support */
		return (uiomove(SHORTLINK(ip), isize, ap->a_uio));
	}
	return (VOP_READ(vp, ap->a_uio, 0, ap->a_cred));
}

The second part of the "if" statement that adds

	DIP(ip, i_blocks) == 0) { /* XXX - for old fastlink support */

is problematic. It never appeared in BSD released by Berkeley because
as noted above mnt_maxsymlinklen is 0 for old format filesystems, so
will always fall through to the VOP_READ as it should. I had to dig
back through `git blame' to find that Rodney Grimes added it as
part of ``The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.''
He must have brought it across from an earlier FreeBSD. Unfortunately
the source-control logs for FreeBSD up to the merger with the
AT&T-blessed 4.4BSD-Lite conversion were destroyed as part of the
agreement to let FreeBSD remain unencumbered, so I cannot pin-point
where that line got added on the FreeBSD side.

The one change needed here is that mnt_maxsymlinklen is declared as
an `int' and should be changed to be `u_int64_t'.

This discovery led us to check out the code that deletes symbolic
links. Specifically

	if (vp->v_type == VLNK &&
	    (ip->i_size < vp->v_mount->mnt_maxsymlinklen ||
	     datablocks == 0)) {
		if (length != 0)
			panic("ffs_truncate: partial truncate of symlink");
		bzero(SHORTLINK(ip), (u_int)ip->i_size);
		ip->i_size = 0;
		DIP_SET(ip, i_size, 0);
		UFS_INODE_SET_FLAG(ip, IN_SIZEMOD | IN_CHANGE | IN_UPDATE);
		if (needextclean)
			goto extclean;
		return (ffs_update(vp, waitforupdate));
	}

Here too our broken symlink inode with no data blocks allocated
and a large size will segment fault as we are incorrectly using the
test that we have no data blocks to decide that it is an embdedded
symbolic link and attempting to bzero past the end of the inode.
The test for datablocks == 0 is unnecessary as the test for
ip->i_size < vp->v_mount->mnt_maxsymlinklen will do the right
thing in all cases.

The test for datablocks == 0 was added by David Greenman in this commit:

Author: David Greenman <dg@FreeBSD.org>
Date:   Tue Aug 2 13:51:05 1994 +0000

    Completed (hopefully) the kernel support for old style "fastlinks".

    Notes:
	svn path=/head/; revision=1821

I am guessing that he likely earlier added the incorrect test in the
ufs_readlink code.

I asked David if he had any recollection of why he made this change.
Amazingly, he still had a recollection of why he had made a one-line
change more than twenty years ago. And unsurpisingly it was because
he had been stuck between a rock and a hard place.

FreeBSD was up to 1.1.5 before the switch to the 4.4BSD-Lite code
base. Prior to that, there were three years of development in all
areas of the kernel, including the filesystem code, from the combined
set of people including Bill Jolitz, Patchkit contributors, and
FreeBSD Project members. The compatibility issue at hand was caused
by the FASTLINKS patches from Curt Mayer. In merging in the 4.4BSD-Lite
changes David had to find a way to provide compatibility with both
the changes that had been made in FreeBSD 1.1.5 and with 4.4BSD-Lite.
He felt that these changes would provide compatibility with both systems.

In his words:
``My recollection is that the 'FASTLINKS' symlinks support in
FreeBSD-1.x, as implemented by Curt Mayer, worked differently than
4.4BSD. He used a spare field in the inode to duplicately store the
length. When the 4.4BSD-Lite merge was done, the optimized symlinks
support for existing filesystems (those that were initialized in
FreeBSD-1.x) were broken due to the FFS on-disk structure of
4.4BSD-Lite differing from FreeBSD-1.x. My commit was needed to
restore the backward compatibility with FreeBSD-1.x filesystems.
I think it was the best that could be done in the somewhat urgent
circumstances of the post Berkeley-USL settlement. Also, regarding
Rod's massive commit with little explanation, some context: John
Dyson and I did the initial re-port of the 4.4BSD-Lite kernel to
the 386 platform in just 10 days. It was by far the most intense
hacking effort of my life. In addition to the porting of tons of
FreeBSD-1 code, I think we wrote more than 30,000 lines of new code
in that time to deal with the missing pieces and architectural
changes of 4.4BSD-Lite. We didn't make many notes along the way.
There was a lot of pressure to get something out to the rest of the
developer community as fast as possible, so detailed discrete commits
didn't happen - it all came as a giant wad, which is why Rod's
commit message was worded the way it was.''

Reported by:  Chuck Silvers
Tested by:    Chuck Silvers
History by:   David Greenman Lawrence
MFC after:    1 week
Sponsored by: Netflix
2021-05-16 17:04:11 -07:00
Mateusz Guzik
852088f6af vfs: add missing atomic conversion to writecount adjustment
Fixes:	("vfs: lockless writecount adjustment in set/unset text")
2021-05-14 17:42:05 +02:00
Mateusz Guzik
ca1ce50b2b vfs: add more safety against concurrent forced unmount to vn_write
1. stop re-reading ->v_mount (can become NULL)
2. stop re-reading ->v_type (can change to VBAD)
2021-05-14 14:22:22 +00:00
Mateusz Guzik
b5fb9ae687 vfs: lockless writecount adjustment in set/unset text
... for cases where this is not the first/last exec.
2021-05-14 14:22:21 +00:00
Mark Johnston
2cca77ee01 kqueue timer: Remove detached knotes from the process stop queue
There are some scenarios where a timer event may be detached when it is
on the process' kqueue timer stop queue.  If kqtimer_proc_continue() is
called after that point, it will iterate over the queue and access freed
timer structures.

It is also possible, at least in a multithreaded program, for a stopped
timer event to be scheduled without removing it from the process' stop
queue.  Ensure that we do not doubly enqueue the event structure in this
case.

Reported by:	syzbot+cea0931bb4e34cd728bd@syzkaller.appspotmail.com
Reported by:	syzbot+9e1a2f3734652015998c@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30251
2021-05-14 10:08:14 -04:00
Ed Maste
2c9764f36b regen syscall files after d51198d63b63 2021-05-13 14:09:58 -04:00
Mark Johnston
8b3c4231ab posix timers: Check for overflow when converting to ns
Disallow a time or timer period value when the conversion to nanoseconds
would overflow.  Otherwise it is possible to trigger a divison by zero
in realtime_expire_l(), where we compute the number of overruns by
dividing by the timer interval.

Fixes:	7995dae9 ("posix timers: Improve the overrun calculation")
Reported by:	syzbot+5ab360bd3d3e3c5a6e0e@syzkaller.appspotmail.com
Reported by:	syzbot+157b74ff493140d86eac@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30233
2021-05-13 08:34:03 -04:00
Mark Johnston
9246b3090c fork: Suspend other threads if both RFPROC and RFMEM are not set
Otherwise, a multithreaded parent process may trigger races in
vm_forkproc() if one thread calls rfork() with RFMEM set and another
calls rfork() without RFMEM.

Also simplify vm_forkproc() a bit, vmspace_unshare() already checks to
see if the address space is shared.

Reported by:	syzbot+0aa7c2bec74c4066c36f@syzkaller.appspotmail.com
Reported by:	syzbot+ea84cb06937afeae609d@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30220
2021-05-13 08:33:23 -04:00
Mateusz Guzik
cef8a95acb vfs: fix vnode use count leak in O_EMPTY_PATH support
The vnode returned by namei_setup is already referenced.

Reported by:	pho
2021-05-13 09:39:27 +00:00
Konstantin Belousov
6de3cf14c4 vn_open_cred(): disallow O_CREAT | O_EMPTY_PATH
This combination does not make sense, and cannot be satisfied by lookup.
In particular, lookup cannot supply dvp, it only can directly return vp.

Reported and reviewed by:	markj using syzkaller
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2021-05-13 02:32:04 +03:00
Mark Johnston
d8acd2681b Fix mbuf leaks in various pru_send implementations
The various protocol implementations are not very consistent about
freeing mbufs in error paths.  In general, all protocols must free both
"m" and "control" upon an error, except if PRUS_NOTREADY is specified
(this is only implemented by TCP and unix(4) and requires further work
not handled in this diff), in which case "control" still must be freed.

This diff plugs various leaks in the pru_send implementations.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30151
2021-05-12 13:00:09 -04:00
Mateusz Guzik
12288bd999 cache: fix lockless absolute symlink traversal to non-fp mounts
Said lookups would incorrectly fail with EOPNOTSUP.

Reported by:	kib
2021-05-11 04:30:12 +00:00
Mark Johnston
c8bbb1272c vfs: Fix error handling in vn_fullpath_hardlink()
vn_fullpath_any_smr() will return a positive error number if the
caller-supplied buffer isn't big enough.  In this case the error must be
propagated up, otherwise we may copy out uninitialized bytes.

Reported by:	syzkaller+KMSAN
Reviewed by:	mjg, kib
MFC aftr:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30198
2021-05-10 20:22:27 -04:00
Konstantin Belousov
5e7cdf1817 openat(2): add O_EMPTY_PATH
It reopens the passed file descriptor, checking the file backing vnode'
current access rights against open mode. In particular, this flag allows
to convert file descriptor opened with O_PATH, into operable file
descriptor, assuming permissions allow that.

Reviewed by:	markj
Tested by:	Andrew Walker <awalker@ixsystems.com>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30148
2021-05-11 02:39:24 +03:00
Konstantin Belousov
d474440ab3 Constify vm_pager-related virtual tables.
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30070
2021-05-07 17:08:03 +03:00
Mark Johnston
9a7c2de364 realloc: Fix KASAN(9) shadow map updates
When copying from the old buffer to the new buffer, we don't know the
requested size of the old allocation, but only the size of the
allocation provided by UMA.  This value is "alloc".  Because the copy
may access bytes in the old allocation's red zone, we must mark the full
allocation valid in the shadow map.  Do so using the correct size.

Reported by:	kp
Tested by:	kp
Sponsored by:	The FreeBSD Foundation
2021-05-05 17:12:51 -04:00
Warner Losh
a512d0ab00 kern: clarify boot time
In FreeBSD, the current time is computed from uptime + boottime. Uptime
is a continuous, smooth function that's monotonically increasing. To
effect changes to the current time, boottime is adjusted.  boottime is
mutable and shouldn't be cached against future need. Document the
current implementation, with the caveat that we may stop stepping
boottime on resume in the future and will step uptime instead (noted in
the commit message, but not in the code).

Sponsored by:		Netflix
Reviewed by:		phk, rpokala
Differential Revision:	https://reviews.freebsd.org/D30116
2021-05-05 12:32:13 -06:00
Elliott Mitchell
a3c7da3d08 kern/intr: declare interrupt vectors unsigned
These should never get values large enough for sign to matter, but one
of them becoming negative could cause problems.

MFC after:	1 week
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D29327
2021-05-03 13:24:30 -04:00
Mark Johnston
2b2d77e720 VOP_STAT: Provide a default value for va_gen
Some filesystems, e.g., pseudofs and the NFSv3 client, do not provide
one.

Reviewed by:	kib
Reported by:	KMSAN
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30091
2021-05-03 13:24:30 -04:00
Mark Johnston
cdfcfc607a smp: Initialize arg->cpus sooner in smp_rendezvous_cpus_retry()
Otherwise, if !smp_started is true, then smp_rendezvous_cpus_done() will
harmlessly perform an atomic RMW on an uninitialized variable.

Reported by:	KMSAN
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-05-03 13:24:30 -04:00
Konstantin Belousov
7cb40543e9 filt_timerexpire: do not iterate over the interval
User-supplied data might make this loop too time-consuming. Divide
directly, and handle both the possibility that we were woken up earlier,
and arithmetic overflows/underflows from the calculation.

Reported and tested by:	pho (previous version)
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30069
2021-05-03 19:49:54 +03:00
Konstantin Belousov
87a64872cd Add ptrace(PT_COREDUMP)
It writes the core of live stopped process to the file descriptor
provided as an argument.

Based on the initial version from https://reviews.freebsd.org/D29691,
submitted by Michał Górny <mgorny@gentoo.org>.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:18:26 +03:00
Konstantin Belousov
68d311b666 ptracestop: mark threads suspended there with the new TDB_SSWITCH flag
This way threads in ptracestop can be discovered by debugger

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:18:25 +03:00
Konstantin Belousov
9ebf9100ba ptrace: do not allow for parallel ptrace requests
Set a new P2_PTRACEREQ flag around the request Wait for the target     .
process P2_PTRACEREQ flag to clear before setting ours                 .

Otherwise, we rely on the moment that the process lock is not dropped
until the stopped target state is important.  This is going to be no
longer true after some future change.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:16:30 +03:00
Konstantin Belousov
54c8baa021 kern_ptrace(): extract code to determine ptrace eligibility into helper
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:13:48 +03:00
Konstantin Belousov
2bd0506c8d kern_ptrace: change type of proctree_locked to bool
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:13:48 +03:00
Konstantin Belousov
af928fded0 Add thread_run_flash() helper
It unsuspends single suspended thread, passed as the argument.
It is up to the caller to arrange the target thread to suspend later,
since the state of the process is not changed from stopped.  In particular,
the unsuspended thread must not leave to userspace, since boundary code
is not prepared to this situation.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:13:47 +03:00
Konstantin Belousov
15465a2c25 Add sleepq_remove_nested()
The helper removes the thread from a sleep queue, assuming that it would
need to sleep. The sleepq_remove_nested() function is intended for quite
special case, where suspended thread from traced stopped process is
temporary unsuspended to do some work on behalf of the debugger in the
target context, and this work might require sleep.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:13:47 +03:00
Konstantin Belousov
86ffb3d1a0 ELF coredump: define several useful flags for the coredump operations
- SVC_ALL request dumping all map entries, including those marked as
  non-dumpable
- SVC_NOCOMPRESS disallows compressing the dump regardless of the sysctl
  policy
- SVC_PC_COREDUMP is provided for future use by userspace core dump
  request

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:13:47 +03:00
Konstantin Belousov
5bc3c61780 imgact_elf: consistently pass flags from coredump down to helper functions
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29955
2021-05-03 19:13:47 +03:00
Rick Macklem
4f592683c3 copy_file_range(2): improve copying of a large hole to EOF
PR#255523 reported that a file copy for a file with a large hole
to EOF on ZFS ran slowly over NFSv4.2.
The problem was that vn_generic_copy_file_range() would
loop around reading the hole's data and then see it is all
0s. It was coded this way since UFS always allocates a data
block near the end of the file, such that a hole to EOF never exists.

This patch modifies vn_generic_copy_file_range() to check for a
ENXIO returned from VOP_IOCTL(..FIOSEEKDATA..) and handle that
case as a hole to EOF. asomers@ confirms that it works for his
ZFS test case.

PR:	255523
Tested by:	asomers
Reviewed by:	asomers
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D30076
2021-05-02 16:04:27 -07:00
Konstantin Belousov
2082565798 O_PATH: disable kqfilter for fifos
Filter on fifos is real filter for the object, and not a filesystem
events filter like EVFILT_VNODE.

Reported by:	markj using syzkaller
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2021-04-30 17:43:45 +03:00
Mark Johnston
20e3b9d8bd kasan: Use vm_offset_t for the first parameter to kasan_shadow_map()
No functional change intended.

Sponsored by:	The FreeBSD Foundation
2021-04-29 11:39:02 -04:00
Mateusz Guzik
074abaccfa cache: remove incomplete lockless lockout support during resize
This is already properly handled thanks to 2 step hash replacement.
2021-04-28 19:53:25 +00:00
Mark Johnston
d1e9441583 pipe: Avoid calling selrecord() on a closing pipe
pipe_poll() may add the calling thread to the selinfo lists of both ends
of a pipe.  It is ok to do this for the local end, since we know we hold
a reference on the file and so the local end is not closed.  It is not
ok to do this for the remote end, which may already be closed and have
called seldrain().  In this scenario, when the polling thread wakes up,
it may end up referencing a freed selinfo.

Guard the selrecord() call appropriately.

Reviewed by:	kib
Reported by:	syzkaller+KASAN
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D30016
2021-04-28 10:43:29 -04:00
Thomas Munro
3aaaa2efde poll(2): Add POLLRDHUP.
Teach poll(2) to support Linux-style POLLRDHUP events for sockets, if
requested.  Triggered when the remote peer shuts down writing or closes
its end.

Reviewed by:	kib
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D29757
2021-04-28 23:00:31 +12:00
Mark Johnston
409ab7e109 imgact_elf: Ensure that the return value in parse_notes is initialized
parse_notes relies on the caller-supplied callback to initialize "res".
Two callbacks are used in practice, brandnote_cb and note_fctl_cb, and
the latter fails to initialize res.  Fix it.

In the worst case, the bug would cause the inner loop of check_note to
examine more program headers than necessary, and the note header usually
comes last anyway.

Reviewed by:	kib
Reported by:	KMSAN
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29986
2021-04-26 14:53:16 -04:00
Edward Tomasz Napierala
5d1d844a77 kern_linkat: modify to accept AT_ flags instead of FOLLOW/NOFOLLOW
This makes this API match other kern_xxxat() functions.

Reviewed By:	kib
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D29776
2021-04-25 14:13:12 +01:00
Robert Watson
af14713d49 Support run-time configuration of the PIPE_MINDIRECT threshold.
PIPE_MINDIRECT determines at what (blocking) write size one-copy
optimizations are applied in pipe(2) I/O.  That threshold hasn't
been tuned since the 1990s when this code was originally
committed, and allowing run-time reconfiguration will make it
easier to assess whether contemporary microarchitectures would
prefer a different threshold.

(On our local RPi4 baords, the 8k default would ideally be at least
32k, but it's not clear how generalizable that observation is.)

MFC after:	3 weeks
Reviewers:	jrtc27, arichardson
Differential Revision: https://reviews.freebsd.org/D29819
2021-04-24 20:04:28 +01:00
Mark Johnston
8e8f1cc9bb Re-enable network ioctls in capability mode
This reverts a portion of 274579831b ("capsicum: Limit socket
operations in capability mode") as at least rtsol and dhcpcd rely on
being able to configure network interfaces while in capability mode.

Reported by:	bapt, Greg V
Sponsored by:	The FreeBSD Foundation
2021-04-23 09:22:49 -04:00
Warner Losh
df456a1fcf newbus: style nit (align comments)
Sponsored by:		Netflix
2021-04-21 15:37:24 -06:00
Warner Losh
1eebd6158c newbus: Optimize/Simplify kobj_class_compile_common a little
"i" is not used in this loop at all. There's no need to initialize and
increment it.

Reviewed by:		markj@
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D29898
2021-04-21 15:37:24 -06:00
Konstantin Belousov
54f98c4dbf vn_open_vnode(): handle error when fp == NULL
If VOP_ADD_WRITECOUNT() or adv locking failed, so VOP_CLOSE() needs to
be called, we cannot use fp fo_close() when there is no fp.  This occurs
when e.g. kernel code directly calls vn_open() instead of the open(2)
syscall.

In this case, VOP_CLOSE() can be called directly, after possible lock
upgrade.

Reported by:	nvass@gmx.com
PR:	255119
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29830
2021-04-21 18:06:51 +03:00
Konstantin Belousov
ecfbddf0cd sysctl vm.objects: report backing object and swap use
For anonymous objects, provide a handle kvo_me naming the object,
and report the handle of the backing object.  This allows userspace
to deconstruct the shadow chain.  Right now the handle is the address
of the object in KVA, but this is not guaranteed.

For the same anonymous objects, report the swap space used for actually
swapped out pages, in kvo_swapped field.  I do not believe that it is
useful to report full 64bit counter there, so only uint32_t value is
returned, clamped to the max.

For kinfo_vmentry, report anonymous object handle backing the entry,
so that the shadow chain for the specific mapping can be deconstructed.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29771
2021-04-19 21:32:01 +03:00
Konstantin Belousov
4342ba184c sysctl_handle_string: do not malloc when SYSCTL_IN cannot fault
In particular, this avoids malloc(9) calls when from early tunable handling,
with no working malloc yet.

Reported and tested by:	mav
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-04-19 21:32:01 +03:00
Konstantin Belousov
578c26f31c linkat(2): check NIRES_EMPTYPATH on the first fd arg
Reported by:	arichardson
Reviewed by:	markj
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29834
2021-04-19 21:32:01 +03:00
Warner Losh
571a1a64b1 Minor style tidy: if( -> if (
Fix a few 'if(' to be 'if (' in a few places, per style(9) and
overwhelming usage in the rest of the kernel / tree.

MFC After:		3 days
Sponsored by:		Netflix
2021-04-18 11:19:15 -06:00
Warner Losh
f1f9870668 Minor style cleanup
We prefer 'while (0)' to 'while(0)' according to grep and stlye(9)'s
space after keyword rule. Remove a few stragglers of the latter.
Many of these usages were inconsistent within the file.

MFC After:		3 days
Sponsored by:		Netflix
2021-04-18 11:14:17 -06:00
Konstantin Belousov
bbf7a4e878 O_PATH: allow vnode kevent filter on such files
if VREAD access is checked as allowed during open

Requested by:	wulf
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29323
2021-04-15 12:49:18 +03:00
Konstantin Belousov
f9b923af34 O_PATH: Allow to open symlink
When O_NOFOLLOW is specified, namei() returns the symlink itself.  In
this case, open(O_PATH) should be allowed, to denote the location of symlink
itself.

Prevent O_EXEC in this case, execve(2) code is not ready to try to execute
symlinks.

Reported by:	wulf
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29323
2021-04-15 12:49:09 +03:00
Konstantin Belousov
a5970a529c Make files opened with O_PATH to not block non-forced unmount
by only keeping hold count on the vnode, instead of the use count.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29323
2021-04-15 12:48:27 +03:00
Konstantin Belousov
8d9ed174f3 open(2): Implement O_PATH
Reviewed by:	markj
Tested by:	pho
Discussed with:	walker.aj325_gmail.com, wulf
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29323
2021-04-15 12:48:24 +03:00
Konstantin Belousov
509124b626 Add AT_EMPTY_PATH for several *at(2) syscalls
It is currently allowed to fchownat(2), fchmodat(2), fchflagsat(2),
utimensat(2), fstatat(2), and linkat(2).

For linkat(2), PRIV_VFS_FHOPEN privilege is required to exercise the flag.
It allows to link any open file.

Requested by:	trasz
Tested by:	pho, trasz
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29111
2021-04-15 12:48:11 +03:00
Konstantin Belousov
437c241d0c vfs_vnops.c: Make vn_statfile() non-static
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29323
2021-04-15 12:47:56 +03:00
Konstantin Belousov
42be0a7b10 Style.
Add missed spaces, wrap long lines.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29323
2021-04-15 12:47:46 +03:00
Mateusz Guzik
4f0279e064 cache: extend mismatch vnode assert print to include the name 2021-04-15 07:55:43 +00:00
Mark Johnston
29bb6c19f0 domainset: Define additional global policies
Add global definitions for first-touch and interleave policies.  The
former may be useful for UMA, which implements a similar policy without
using domainset iterators.

No functional change intended.

Reviewed by:	mav
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29104
2021-04-14 13:03:33 -04:00
Konstantin Belousov
75c5cf7a72 filt_timerexpire: avoid process lock recursion
Found by:	syzkaller
Reported and reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29746
2021-04-14 10:53:28 +03:00
Konstantin Belousov
5cc1d19941 realtimer_expire: avoid proc lock recursion when called from itimer_proc_continue()
It is fine to drop the process lock there, process cannot exit until its
timers are cleared.

Found by:	syzkaller
Reported and reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29746
2021-04-14 10:53:19 +03:00
Konstantin Belousov
116f26f947 sbuf_uionew(): sbuf_new() takes int as length
and length should be not less than SBUF_MINSIZE

Reported and tested by:	pho
Noted and reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D29752
2021-04-14 10:23:20 +03:00
Mark Johnston
06a53ecf24 malloc: Add state transitions for KASAN
- Reuse some REDZONE bits to keep track of the requested and allocated
  sizes, and use that to provide red zones.
- As in UMA, disable memory trashing to avoid unnecessary CPU overhead.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29461
2021-04-13 17:42:21 -04:00
Mark Johnston
f1c3adefd9 execve: Mark exec argument buffers
We cache mapped execve argument buffers to avoid the overhead of TLB
shootdowns.  Mark them invalid when they are freed to the cache.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29460
2021-04-13 17:42:21 -04:00
Mark Johnston
b261bb4057 vfs: Add KASAN state transitions for vnodes
vnodes are a bit special in that they may exist on per-CPU lists even
while free.  Add a KASAN-only destructor that poisons regions of each
vnode that are not expected to be accessed after a free.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29459
2021-04-13 17:42:21 -04:00
Mark Johnston
6faf45b34b amd64: Implement a KASAN shadow map
The idea behind KASAN is to use a region of memory to track the validity
of buffers in the kernel map.  This region is the shadow map.  The
compiler inserts calls to the KASAN runtime for every emitted load
and store, and the runtime uses the shadow map to decide whether the
access is valid.  Various kernel allocators call kasan_mark() to update
the shadow map.

Since the shadow map tracks only accesses to the kernel map, accesses to
other kernel maps are not validated by KASAN.  UMA_MD_SMALL_ALLOC is
disabled when KASAN is configured to reduce usage of the direct map.
Currently we have no mechanism to completely eliminate uses of the
direct map, so KASAN's coverage is not comprehensive.

The shadow map uses one byte per eight bytes in the kernel map.  In
pmap_bootstrap() we create an initial set of page tables for the kernel
and preloaded data.

When pmap_growkernel() is called, we call kasan_shadow_map() to extend
the shadow map.  kasan_shadow_map() uses pmap_kasan_enter() to allocate
memory for the shadow region and map it.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29417
2021-04-13 17:42:20 -04:00
Mark Johnston
38da497a4d Add the KASAN runtime
KASAN enables the use of LLVM's AddressSanitizer in the kernel.  This
feature makes use of compiler instrumentation to validate memory
accesses in the kernel and detect several types of bugs, including
use-after-frees and out-of-bounds accesses.  It is particularly
effective when combined with test suites or syzkaller.  KASAN has high
CPU and memory usage overhead and so is not suited for production
environments.

The runtime and pmap maintain a shadow of the kernel map to store
information about the validity of memory mapped at a given kernel
address.

The runtime implements a number of functions defined by the compiler
ABI.  These are prefixed by __asan.  The compiler emits calls to
__asan_load*() and __asan_store*() around memory accesses, and the
runtime consults the shadow map to determine whether a given access is
valid.

kasan_mark() is called by various kernel allocators to update state in
the shadow map.  Updates to those allocators will come in subsequent
commits.

The runtime also defines various interceptors.  Some low-level routines
are implemented in assembly and are thus not amenable to compiler
instrumentation.  To handle this, the runtime implements these routines
on behalf of the rest of the kernel.  The sanitizer implementation
validates memory accesses manually before handing off to the real
implementation.

The sanitizer in a KASAN-configured kernel can be disabled by setting
the loader tunable debug.kasan.disable=1.

Obtained from:	NetBSD
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29416
2021-04-13 17:42:20 -04:00
Mitchell Horne
2816bd8442 rmlock(9): add an RM_DUPOK flag
Allows for duplicate locks to be acquired without witness complaining.
Similar flags exists already for rwlock(9) and sx(9).

Reviewed by:	markj
MFC after:	3 days
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
NetApp PR:	52
Differential Revision:	https://reviews.freebsd.org/D29683n
2021-04-12 11:42:21 -03:00
Mark Johnston
dfff37765c Rename struct device to struct _device
types.h defines device_t as a typedef of struct device *.  struct device
is defined in subr_bus.c and almost all of the kernel uses device_t.
The LinuxKPI also defines a struct device, so type confusion can occur.

This causes bugs and ambiguity for debugging tools.  Rename the FreeBSD
struct device to struct _device.

Reviewed by:	gbe (man pages)
Reviewed by:	rpokala, imp, jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29676
2021-04-12 09:32:30 -04:00
Andrew Turner
5d2d599d3f Create VM_MEMATTR_DEVICE on all architectures
This is intended to be used with memory mapped IO, e.g. from
bus_space_map with no flags, or pmap_mapdev.

Use this new memory type in the map request configured by
resource_init_map_request, and in pciconf.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D29692
2021-04-12 06:15:31 +00:00
Konstantin Belousov
a091c35323 ptrace: restructure comments around reparenting on PT_DETACH
style code, and use {} for both branches.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-04-11 14:44:30 +03:00
Konstantin Belousov
9d7e450b64 ptrace: remove dead call to FIX_SSTEP()
It was an alias for procfs_fix_sstep() long time ago.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-04-11 14:44:30 +03:00
Piotr Pawel Stefaniak
a212f56d10 Balance parentheses in sysctl descriptions 2021-04-11 10:30:55 +02:00
Konstantin Belousov
2fd1ffefaa Stop arming kqueue timers on knote owner suspend or terminate
This way, even if the process specified very tight reschedule
intervals, it should be stoppable/killable.

Reported and reviewed by:	markj
Tested by:	markj, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29106
2021-04-09 23:43:51 +03:00
Konstantin Belousov
533e5057ed Add helper for kqueue timers callout scheduling
Reviewed by:	markj
Tested by:	markj, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29106
2021-04-09 23:42:56 +03:00
Konstantin Belousov
4d27d8d2f3 Stop arming realtime posix process timers on suspend or terminate
Reported and reviewed by:	markj
Tested by:	markj, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29106
2021-04-09 23:42:51 +03:00
Konstantin Belousov
dc47fdf131 Stop arming periodic process timers on suspend or terminate
Reported and reviewed by:	markj
Tested by:	markj, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29106
2021-04-09 23:42:44 +03:00
Mateusz Guzik
72b3b5a941 vfs: replace vfs_smr_quiesce with vfs_smr_synchronize
This ends up using a smr specific method.

Suggested by:	markj
Tested by:	pho
2021-04-08 11:14:45 +00:00
Mark Johnston
0f07c234ca Remove more remnants of sio(4)
Reviewed by:	imp
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29626
2021-04-07 14:33:02 -04:00
Mark Johnston
274579831b capsicum: Limit socket operations in capability mode
Capsicum did not prevent certain privileged networking operations,
specifically creation of raw sockets and network configuration ioctls.
However, these facilities can be used to circumvent some of the
restrictions that capability mode is supposed to enforce.

Add capability mode checks to disallow network configuration ioctls and
creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET
internet sockets.

Reviewed by:	oshogbo
Discussed with:	emaste
Reported by:	manu
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D29423
2021-04-07 14:32:56 -04:00
Mateusz Guzik
13b3862ee8 cache: update an assert on CACHE_FPL_STATUS_ABORTED
Since symlink support it can get upgraded to CACHE_FPL_STATUS_DESTROYED.

Reported by:	bdrewery
2021-04-06 22:31:58 +02:00
Mark Johnston
2425f5e912 mount: Disallow mounting over a jail root
Discussed with:	jamie
Approved by:	so
Security:	CVE-2020-25584
Security:	FreeBSD-SA-21:10.jail_mount
2021-04-06 14:49:36 -04:00
Edward Tomasz Napierala
7f6157f7fd lock_delay(9): improve interaction with restrict_starvation
After e7a5b3bd05, the la->delay value was adjusted after
being set by the starvation_limit code block, which is wrong.

Reported By:	avg
Reviewed By:	avg
Fixes:		e7a5b3bd05
Sponsored By:	NetApp, Inc.
Sponsored By:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D29513
2021-04-03 13:08:53 +01:00
Mark Johnston
52a99c72b5 sendfile: Fix error initialization in sendfile_getobj()
Reviewed by:	chs, kib
Reported by:	jhb
Fixes:		faa998f6ff
MFC after:	1 day
Differential Revision:	https://reviews.freebsd.org/D29540
2021-04-02 17:42:38 -04:00
Richard Scheffenegger
cad4fd0365 Make sbuf_drain safe for external use
While sbuf_drain was an internal function, two
KASSERTS checked the sanity of it being called.
However, an external caller may be ignorant if
there is any data to drain, or if an error has
already accumulated. Be nice and return immediately
with the accumulated error.

MFC after: 2 weeks
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29544
2021-04-02 20:12:11 +02:00
Konstantin Belousov
aa3ea612be x86: remove gcov kernel support
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D29529
2021-04-02 15:41:51 +03:00
Mateusz Guzik
f79bd71def cache: add high level overview
Differential Revision:	https://reviews.freebsd.org/D28675
2021-04-02 05:11:05 +02:00
Mateusz Guzik
dc532884d5 cache: fix resizing in face of lockless lookup
Reported by:	pho
Tested by:	pho
2021-04-02 05:11:05 +02:00
Lawrence Stewart
1eb402e47a stats(3): Improve t-digest merging of samples which result in mu adjustment underflow.
Allow the calculation of the mu adjustment factor to underflow instead of
rejecting the VOI sample from the digest and logging an error. This trades off
some (currently unquantified) additional centroid error in exchange for better
fidelity of the distribution's density, which is the right trade off at the
moment until follow up work to better handle and track accumulated error can be
undertaken.

Obtained from:	Netflix
MFC after:	immediately
2021-04-02 13:17:53 +11:00
Richard Scheffenegger
c804c8f2c5 Export sbuf_drain to orchestrate lock and drain action
While exporting large amounts of data to a sysctl
request, datastructures may need to be locked.

Exporting the sbuf_drain function allows the
coordination between drain events and held
locks, to avoid stalls.

PR:		254333
Reviewed By:	jhb
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29481
2021-03-31 19:17:37 +02:00
Konstantin Belousov
e243367b64 mbuf: add a way to mark flowid as calculated from the internal headers
In some settings offload might calculate hash from decapsulated packet.
Reserve a bit in packet header rsstype to indicate that.

Add m_adj_decap() that acts similarly to m_adj, but also either clear
flowid if it is not marked as inner, or transfer it to the decapsulated
header, clearing inner indicator. It depends on the internals of m_adj()
that reuses the argument packet header for the result.

Use m_adj_decap() for decapsulating vxlan(4) and gif(4) input packets.

Reviewed by:	ae, hselasky, np
Sponsored by:	Nvidia Networking / Mellanox Technologies
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28773
2021-03-31 14:38:26 +03:00
Mark Johnston
3428b6c050 Fix several dev_clone callbacks to avoid out-of-bounds reads
Use strncmp() instead of bcmp(), so that we don't have to find the
minimum of the string lengths before comparing.

Reviewed by:	kib
Reported by:	KASAN
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29463
2021-03-28 11:08:36 -04:00
Mark Johnston
71c160a8f6 vfs: Add an assertion around name length limits
Some filesystems assume that they can copy a name component, with length
bounded by NAME_MAX, into a dirent buffer of size MAXNAMLEN.  These
constants have the same value; add a compile-time assertion to that
effect.

Reported by:	Alexey Kulaev <alex.qart@gmail.com>
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29431
2021-03-27 13:45:19 -04:00
Mark Johnston
653a437c04 accept_filter: Fix filter parameter handling
For filters which implement accf_create, the setsockopt(2) handler
caches the filter name in the socket, but it also incorrectly frees the
buffer containing the copy, leaving a dangling pointer.  Note that no
accept filters provided in the base system are susceptible to this, as
they don't implement accf_create.

Reported by:	Alexey Kulaev <alex.qart@gmail.com>
Discussed with:	emaste
Security:	kernel use-after-free
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2021-03-25 17:55:46 -04:00
Mark Johnston
ec8f1ea8d5 Generalize sanitizer interceptors for memory and string routines
Similar to commit 3ead60236f ("Generalize bus_space(9) and atomic(9)
sanitizer interceptors"), use a more generic scheme for interposing
sanitizer implementations of routines like memcpy().

No functional change intended.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2021-03-24 19:46:22 -04:00
Mark Johnston
3ead60236f Generalize bus_space(9) and atomic(9) sanitizer interceptors
Make it easy to define interceptors for new sanitizer runtimes, rather
than assuming KCSAN.  Lay a bit of groundwork for KASAN and KMSAN.

When a sanitizer is compiled in, atomic(9) and bus_space(9) definitions
in atomic_san.h are used by default instead of the inline
implementations in the platform's atomic.h.  These definitions are
implemented in the sanitizer runtime, which includes
machine/{atomic,bus}.h with SAN_RUNTIME defined to pull in the actual
implementations.

No functional change intended.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
2021-03-22 22:21:53 -04:00
Adrian Chadd
25bfa44860 Add device and ifnet logging methods, similar to device_printf / if_printf
* device_printf() is effectively a printf
* if_printf() is effectively a LOG_INFO

This allows subsystems to log device/netif stuff using different log levels,
rather than having to invent their own way to prefix unit/netif  names.

Differential Revision: https://reviews.freebsd.org/D29320
Reviewed by: imp
2021-03-22 00:02:34 +00:00
Alex Richardson
6ceacebdf5 Unbreak MSG_CMSG_CLOEXEC
MSG_CMSG_CLOEXEC has not been working since 2015 (SVN r284380) because
_finstall expects O_CLOEXEC and not UF_EXCLOSE as the flags argument.
This was probably not noticed because we don't have a test for this flag
so this commit adds one. I found this problem because one of the
libwayland tests was failing.

Fixes:		ea31808c3b ("fd: move out actual fp installation to _finstall")
MFC after:	3 days
Reviewed By:	mjg, kib
Differential Revision: https://reviews.freebsd.org/D29328
2021-03-18 20:52:20 +00:00
Mateusz Guzik
e9272225e6 vfs: fix vnlru marker handling for filtered/unfiltered cases
The global list has a marker with an invariant that free vnodes are
placed somewhere past that. A caller which performs filtering (like ZFS)
can move said marker all the way to the end, across free vnodes which
don't match. Then a caller which does not perform filtering will fail to
find them. This makes vn_alloc_hard sleep for 1 second instead of
reclaiming, resulting in significant stalls.

Fix the problem by requiring an explicit marker by callers which do
filtering.

As a temporary measure extend vnlru_free to restart if it fails to
reclaim anything.

Big thanks go to the reporter for testing several iterations of the
patch.

Reported by:	Yamagi <lists yamagi.org>
Tested by:	Yamagi <lists yamagi.org>
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D29324
2021-03-18 14:59:03 +00:00
Kyle Evans
f187d6dfbf base: remove if_wg(4) and associated utilities, manpage
After length decisions, we've decided that the if_wg(4) driver and
related work is not yet ready to live in the tree.  This driver has
larger security implications than many, and thus will be held to
more scrutiny than other drivers.

Please also see the related message sent to the freebsd-hackers@
and freebsd-arch@ lists by Kyle Evans <kevans@FreeBSD.org> on
2021/03/16, with the subject line "Removing WireGuard Support From Base"
for additional context.
2021-03-17 09:14:48 -05:00
Mark Johnston
4aa157dd5b link_elf_obj: Add a case missing from 5e6989ba4f
Fixes:		5e6989ba4f
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
2021-03-16 15:01:41 -04:00
Kyle Evans
74ae3f3e33 if_wg: import latest fixup work from the wireguard-freebsd project
This is the culmination of about a week of work from three developers to
fix a number of functional and security issues.  This patch consists of
work done by the following folks:

- Jason A. Donenfeld <Jason@zx2c4.com>
- Matt Dunwoodie <ncon@noconroy.net>
- Kyle Evans <kevans@FreeBSD.org>

Notable changes include:
- Packets are now correctly staged for processing once the handshake has
  completed, resulting in less packet loss in the interim.
- Various race conditions have been resolved, particularly w.r.t. socket
  and packet lifetime (panics)
- Various tests have been added to assure correct functionality and
  tooling conformance
- Many security issues have been addressed
- if_wg now maintains jail-friendly semantics: sockets are created in
  the interface's home vnet so that it can act as the sole network
  connection for a jail
- if_wg no longer fails to remove peer allowed-ips of 0.0.0.0/0
- if_wg now exports via ioctl a format that is future proof and
  complete.  It is additionally supported by the upstream
  wireguard-tools (which we plan to merge in to base soon)
- if_wg now conforms to the WireGuard protocol and is more closely
  aligned with security auditing guidelines

Note that the driver has been rebased away from using iflib.  iflib
poses a number of challenges for a cloned device trying to operate in a
vnet that are non-trivial to solve and adds complexity to the
implementation for little gain.

The crypto implementation that was previously added to the tree was a
super complex integration of what previously appeared in an old out of
tree Linux module, which has been reduced to crypto.c containing simple
boring reference implementations.  This is part of a near-to-mid term
goal to work with FreeBSD kernel crypto folks and take advantage of or
improve accelerated crypto already offered elsewhere.

There's additional test suite effort underway out-of-tree taking
advantage of the aforementioned jail-friendly semantics to test a number
of real-world topologies, based on netns.sh.

Also note that this is still a work in progress; work going further will
be much smaller in nature.

MFC after:	1 month (maybe)
2021-03-14 23:52:04 -05:00
Jonathan T. Looney
dbec10e088 Fetch the sigfastblock value in syscalls that wait for signals
We have seen several cases of processes which have become "stuck" in
kern_sigsuspend(). When this occurs, the kernel's td_sigblock_val
is set to 0x10 (one block outstanding) and the userspace copy of the
word is set to 0 (unblocked). Because the kernel's cached value
shows that signals are blocked, kern_sigsuspend() blocks almost all
signals, which means the process hangs indefinitely in sigsuspend().

It is not entirely clear what is causing this condition to occur.
However, it seems to make sense to add some protection against this
case by fetching the latest sigfastblock value from userspace for
syscalls which will sleep waiting for signals. Here, the change is
applied to kern_sigsuspend() and kern_sigtimedwait().

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D29225
2021-03-12 18:14:17 +00:00
John Baldwin
640d54045b Set TDP_KTHREAD before calling cpu_fork() and cpu_copy_thread().
This permits these routines to use special logic for initializing MD
kthread state.

For the kproc case, this required moving the logic to set these flags
from kproc_create() into do_fork().

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D29207
2021-03-12 09:48:20 -08:00
Konstantin Belousov
44691b33cc vlrureclaim: only skip vnode with resident pages if it own the pages
Nullfs vnode which shares vm_object and pages with the lower vnode should
not be exempt from the reclaim just because lower vnode cached a lot.
Their reclamation is actually very cheap and should be preferred over
real fs vnodes, but this change is already useful.

Reported and tested by:	pho
Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D29178
2021-03-12 13:31:08 +02:00
Warner Losh
e52368365d config_intrhook: provide config_intrhook_drain
config_intrhook_drain will remove the hook from the list as
config_intrhook_disestablish does if the hook hasn't been called.  If it has,
config_intrhook_drain will wait for the hook to be disestablished in the normal
course (or expedited, it's up to the driver to decide how and when
to call config_intrhook_disestablish).

This is intended for removable devices that use config_intrhook and might be
attached early in boot, but that may be removed before the kernel can call the
config_intrhook or before it ends. To prevent all races, the detach routine will
need to call config_intrhook_train.

Sponsored by:		Netflix, Inc
Reviewed by:		jhb, mav, gde (in D29006 for man page)
Differential Revision:	https://reviews.freebsd.org/D29005
2021-03-11 09:45:10 -07:00
Hans Petter Selasky
6eb60f5b7f Use the word "LinuxKPI" instead of "Linux compatibility", to not confuse with
user-space Linux compatibility support. No functional change.

MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-03-10 12:35:16 +01:00
Kyle Evans
1ae20f7c70 kern: malloc: fix panic on M_WAITOK during THREAD_NO_SLEEPING()
Simple condition flip; we wanted to panic here after epoch_trace_list().

Reviewed by:	glebius, markj
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D29125
2021-03-09 05:16:39 -06:00
Warner Losh
88a5591203 config_intrhook: Move from TAILQ to STAILQ and padding
config_intrhook doesn't need to be a two-pointer TAILQ. We rarely add/delete
from this and so those need not be optimized. Instaed, use the one-pointer
STAILQ plus a uintptr_t to be used as a flags word. This will allow these
changes to be MFC'd to 12 and 13 to fix a race in removable devices.

Feedback from: jhb
Reviewed by: mav
Differential Revision:	https://reviews.freebsd.org/D29004
2021-03-08 15:59:00 -07:00
Mark Johnston
435c7cfb24 Rename _cscan_atomic.h and _cscan_bus.h to atomic_san.h and bus_san.h
Other kernel sanitizers (KMSAN, KASAN) require interceptors as well, so
put these in a more generic place as a step towards importing the other
sanitizers.

No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29103
2021-03-08 12:39:06 -05:00
Mark Johnston
7995dae9d3 posix timers: Improve the overrun calculation
timer_settime(2) may be used to configure a timeout in the past.  If
the timer is also periodic, we also try to compute the number of timer
overruns that occurred between the initial timeout and the time at which
the timer fired.  This is done in a loop which iterates once per period
between the initial timeout and now.  If the period is small and the
initial timeout was a long time ago, this loop can take forever to run,
so the system is effectively DOSed.

Replace the loop with a more direct calculation of
(now - initial timeout) / period to compute the number of overruns.

Reported by:	syzkaller
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29093
2021-03-08 12:39:06 -05:00
Mark Johnston
60d12ef952 posix timers: Sprinkle some style fixes
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-03-08 12:39:06 -05:00
Mark Johnston
8ff2b41c05 posix timers: Declare unexported functions as static
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-03-08 12:39:06 -05:00
Konstantin Belousov
56b9bee63a Make kern.timecounter.hardware tunable
Noted and reviewed by:	kevans
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D29122
2021-03-08 03:48:21 +02:00
Hans Petter Selasky
c743a6bd4f Implement mallocarray_domainset(9) variant of mallocarray(9).
Reviewed by:	kib @
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2021-03-06 11:38:55 +01:00
Konstantin Belousov
ead7697f04 Restore AT_RESOLVE_BENEATH support for funlinkat(2)/unlinkat(2).
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-03-06 07:24:18 +02:00
Mark Johnston
89b650872b ktls: Hide initialization message behind bootverbose
We don't typically print anything when a subsystem initializes itself,
and KTLS is currently disabled by default anyway.

Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29097
2021-03-05 13:11:02 -05:00
Mark Johnston
5e6989ba4f link_elf_obj: Handle init_array sections in KLDs
Reuse existing handling for .ctors, print a warning if multiple
constructor sections are present.   Destructors are not handled as of
yet.

This is required for KASAN.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29049
2021-03-04 10:07:10 -05:00
Mark Johnston
49f6925ca3 ktls: Cache output buffers for software encryption
Maintain a cache of physically contiguous runs of pages for use as
output buffers when software encryption is configured and in-place
encryption is not possible.  This makes allocation and free cheaper
since in the common case we avoid touching the vm_page structures for
the buffer, and fewer calls into UMA are needed.  gallatin@ reports a
~10% absolute decrease in CPU usage with sendfile/KTLS on a Xeon after
this change.

It is possible that we will not be able to allocate these buffers if
physical memory is fragmented.  To avoid frequently calling into the
physical memory allocator in this scenario, rate-limit allocation
attempts after a failure.  In the failure case we fall back to the old
behaviour of allocating a page at a time.

N.B.: this scheme could be simplified, either by simply using malloc()
and looking up the PAs of the pages backing the buffer, or by falling
back to page by page allocation and creating a mapping in the cache
zone.  This requires some way to save a mapping of an M_EXTPG page array
in the mbuf, though.  m_data is not really appropriate.  The second
approach may be possible by saving the mapping in the plinks union of
the first vm_page structure of the array, but this would force a vm_page
access when freeing an mbuf.

Reviewed by:	gallatin, jhb
Tested by:	gallatin
Sponsored by:	Ampere Computing
Submitted by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D28556
2021-03-03 17:34:01 -05:00
Alan Somers
ab63da3564 Speed up geom_stats_resync in the presence of many devices
The old code had a O(n) loop, where n is the size of /dev/devstat.
Multiply that by another O(n) loop in devstat_mmap for a total of
O(n^2).

This change adds DIOCGMEDIASIZE support to /dev/devstat so userland can
quickly determine the right amount of memory to map, eliminating the
O(n) loop in userland.

This change decreases the time to run "gstat -bI0.001" with 16,384 md
devices from 29.7s to 4.2s.

Also, fix a memory leak first reported as PR 203097.

Sponsored by:	Axcient
Reviewed by:	mav, imp
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D28968
2021-03-02 18:33:45 -07:00
Robert Wing
0cc746f193 filt_fsevent: only record interested events
Respect filter-specific flags for the EVFILT_FS filter.

When a kevent is registered with the EVFILT_FS filter, it is always
triggered when an EVFILT_FS event occurs, regardless of the
filter-specific flags used. Fix that.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D28974
2021-03-02 14:19:22 -09:00
Konstantin Belousov
28cd3a673e O_RELATIVE_BENEATH: return ENOTCAPABLE instead of EINVAL for abs path
Requested and reviewed by:	markj
Tested by:	arichardson,  pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:21:40 +02:00
Konstantin Belousov
49c98a4bf3 nameicap_check_dotdot: trim tracker on check
Tracker should contain exactly the path from the starting directory to
the current lookup point. Otherwise we might not detect some cases of
dotdot escape. Consequently, if we are walking up the tree by dotdot
lookup, we must remove an entries below the walked directory.

Reviewed by:	markj
Tested by:	arichardson, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:21:35 +02:00
Konstantin Belousov
e8a2862aa0 Add nameicap_cleanup_from(), to clean tracker list starting from some element
Reviewed by:	markj
Tested by:	arichardson, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:21:30 +02:00
Konstantin Belousov
2388ad7c29 nameicap_tracker_add: avoid duplicates in the tracker list
Reviewed by:	markj
Tested by:	arichardson, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:21:23 +02:00
Konstantin Belousov
59e7494281 Do not call nameicap_tracker_add() for dotdot case.
Reviewed by:	markj
Tested by:	arichardson, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:21:14 +02:00
Konstantin Belousov
20e91ca36a open(2): Remove O_BENEATH and AT_BENEATH
with the reasoning that the flags did not worked properly, and were not
shipped in a release.

O_RESOLVE_BENEATH is kept as useful.

Reviewed by:	markj
Tested by:	arichardson, pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D28907
2021-03-02 20:16:55 +02:00
Kyle Evans
60c4ec806d jail: allow root to implicitly widen its cpuset to attach
The default behavior for attaching processes to jails is that the jail's
cpuset augments the attaching processes, so that it cannot be used to
escalate a user's ability to take advantage of more CPUs than the
administrator wanted them to.

This is problematic when root needs to manage jails that have disjoint
sets with whatever process is attaching, as this would otherwise result
in a deadlock. Therefore, if we did not have an appropriate common
subset of cpus/domains for our new policy, we now allow the process to
simply take on the jail set *if* it has the privilege to widen its mask
anyways.

With the new logic, root can still usefully cpuset a process that
attaches to a jail with the desire of maintaining the set it was given
pre-attachment while still retaining the ability to manage child jails
without jumping through hoops.

A test has been added to demonstrate the issue; cpuset of a process
down to just the first CPU and attempting to attach to a jail without
access to any of the same CPUs previously resulted in EDEADLK and now
results in taking on the jail's mask for privileged users.

PR:		253724
Reviewed by:	jamie (also discussed with)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28952
2021-03-01 12:38:31 -06:00
Rick Macklem
a5f9fe2bab copy_file_range(2): Fix for small values of input file offset and len
r366302 broke copy_file_range(2) for small values of
input file offset and len.

It was possible for rem to be greater than len and then
"len - rem" was a large value, since both variables are
unsigned.

Reported by: koobs, Pablo <pablogsal gmail com> (Python)
Reviewed by:	asomers, koobs
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D28981
2021-03-01 06:31:10 -08:00
Konstantin Belousov
b5449c92b4 Use atomic_interrupt_fence() instead of bare __compiler_membar()
for the which which definitely use membar to sync with interrupt handlers.

libc and rtld uses of __compiler_membar() seems to want compiler barriers
proper.

The barrier in sched_unpin_lite() after td_pinned decrement seems to be not
needed and removed, instead of convertion.

Reviewed by:	markj
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28956
2021-02-28 01:27:29 +02:00
Mateusz Guzik
1239a72221 cache: temporarily drop the assert that dvp != vp when adding an entry
Historically it was allowed for any names, but arguably should never be
even attempted. Allow it again since there is a release pending and
allowing it is bug-compatible with previous behavior.

Reported by:	otis
2021-02-27 22:29:50 +00:00
Jamie Gritton
589e4c1df4 jail: Add safety around prison_deref() flags.
do_jail_attach() now only uses the PD_XXX flags that refer to lock
status, so make sure that something else like PD_KILL doesn't slip
through.

Add a KASSERT() in prison_deref() to catch any further PD_KILL misuse.
2021-02-25 20:10:42 -08:00
Jamie Gritton
108a9384e9 jail: Fix locking on an early jail_set error.
I had locked allprison_lock without immediately setting PD_LIST_LOCKED.
2021-02-25 19:52:58 -08:00