Commit Graph

18718 Commits

Author SHA1 Message Date
Mark Johnston
a8aa6f1f78 socket: Avoid clearing SS_ISCONNECTING if soconnect() fails
This behaviour appears to date from the 4.4 BSD import.  It has two
problems:

1. The update to so_state is not protected by the socket lock, so
   concurrent updates to so_state may be lost.
2. Suppose two threads race to call connect(2) on a socket, and one
   succeeds while the other fails.  Then the failing thread may
   incorrectly clear SS_ISCONNECTING, confusing the state machine.

Simply remove the update.  It does not appear to be necessary:
pru_connect implementations which call soisconnecting() only do so after
all failure modes have been handled.  For instance, tcp_connect() and
tcp6_connect() will never return an error after calling soisconnected().
However, we cannot correctly assert that SS_ISCONNECTED is not set after
an error from soconnect() since the socket lock is not held across the
pru_connect call, so a concurrent connect(2) may have set the flag.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31699
2021-09-07 17:12:09 -04:00
Mark Johnston
523d58aad1 socket: Remove unneeded SOLISTENING checks
Now that SOCK_IO_*_LOCK() checks for listening sockets, we can eliminate
some racy SOLISTENING() checks.  No functional change intended.

Reviewed by:	tuexen
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31660
2021-09-07 17:12:09 -04:00
Mark Johnston
bd4a39cc93 socket: Properly interlock when transitioning to a listening socket
Currently, most protocols implement pru_listen with something like the
following:

	SOCK_LOCK(so);
	error = solisten_proto_check(so);
	if (error) {
		SOCK_UNLOCK(so);
		return (error);
	}
	solisten_proto(so);
	SOCK_UNLOCK(so);

solisten_proto_check() fails if the socket is connected or connecting.
However, the socket lock is not used during I/O, so this pattern is
racy.

The change modifies solisten_proto_check() to additionally acquire
socket buffer locks, and the calling thread holds them until
solisten_proto() or solisten_proto_abort() is called.  Now that the
socket buffer locks are preserved across a listen(2), this change allows
socket I/O paths to properly interlock with listen(2).

This fixes a large number of syzbot reports, only one is listed below
and the rest will be dup'ed to it.

Reported by:	syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com
Reviewed by:	tuexen, gallatin
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31659
2021-09-07 17:11:43 -04:00
Mark Johnston
f94acf52a4 socket: Rename sb(un)lock() and interlock with listen(2)
In preparation for moving sockbuf locks into the containing socket,
provide alternative macros for the sockbuf I/O locks:
SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK().  These operate on a
socket rather than a socket buffer.  Note that these locks are used only
to prevent concurrent readers and writters from interleaving I/O.

When locking for I/O, return an error if the socket is a listening
socket.  Currently the check is racy since the sockbuf sx locks are
destroyed during the transition to a listening socket, but that will no
longer be true after some follow-up changes.

Modify a few places to check for errors from
sblock()/SOCK_IO_(SEND|RECV)_LOCK() where they were not before.  In
particular, add checks to sendfile() and sorflush().

Reviewed by:	tuexen, gallatin
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31657
2021-09-07 15:06:48 -04:00
Warner Losh
ecfbb2e302 genoffset.sh: Use 10 X's instead of 5 for pick mkdtemp implementations
Linux fails to build now because the mkdtemp in the bootstrapped
environment wants 6 or more X's. Use 10 out of an abundance of caution.

Sponsored by:		Netflix
Reviewed by:		arichards
Differential Revision:	https://reviews.freebsd.org/D31863
2021-09-07 10:08:51 -06:00
Konstantin Belousov
98168a6e6c kqueue: drain kqueue taskqueue if syscall tickled it
Otherwise return from the syscall and next syscall, which could be
kevent(2) on the kqueue that should be notified, races with the kqueue
taskqueue thread, and potentially misses the wakeup.  This is reliably
visible when kevent(2) only peeks into events using zeroed timeout.

PR:	258310
Reported by:	arichardson, Jan Kokemüller <jan.kokemueller@gmail.com>
Reviewed by:	arichardson, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31858
2021-09-07 02:43:34 +03:00
Colin Percival
bd11e253a9 Add _sleep to TSLOG
Most of the nvme initialization time in my tests is being spent here
(via pause_sbt).
2021-09-05 12:50:15 -07:00
Colin Percival
7347dfce01 Add run_interrupt_driven_config_hooks to TSLOG
The 'intr_config_hooks' SYSINIT is now taking a nontrivial amount of
time in my profiling; run_interrupt_driven_config_hooks is responsible
for most of it, so this adds useful information to the resulting
flamecharts.
2021-09-05 12:45:29 -07:00
Alexander Motin
a264594d4f Unify console output.
Without this change when virtual console enabled depending on buffer
presence and state different parts of output go to different consoles.

MFC after:	1 month
2021-09-03 23:13:42 -04:00
Alexander Motin
bd6085c6ae Re-implement virtual console (constty).
Protect conscallout with tty lock instead of Giant.  In addition to
Giant removal it also closes race on console unset.

Introduce additional lock to protect against concurrent console sets.

Remove consbuf free on console unset as unsafe, making impossible to
change buffer size after first allocation.  Instead increase default
buffer size from 8KB to 64KB and processing rate from 5Hz to 10-15Hz
to make the output more smooth.

MFC after:	1 month
2021-09-03 22:18:51 -04:00
Alexander Motin
4730a8972b callout(9): Allow spin locks use with callout_init_mtx().
Implement lock_spin()/unlock_spin() lock class methods, moving the
assertion to _sleep() instead.  Change assertions in callout(9) to
allow spin locks for both regular and C_DIRECT_EXEC cases. In case of
C_DIRECT_EXEC callouts spin locks are the only locks allowed actually.

As the first use case allow taskqueue_enqueue_timeout() use on fast
task queues.  It actually becomes more efficient due to avoided extra
context switches in callout(9) thanks to C_DIRECT_EXEC.

MFC after:	2 weeks
Reviewed by:	hselasky
Differential Revision:	https://reviews.freebsd.org/D31778
2021-09-02 21:16:46 -04:00
Konstantin Belousov
5cc82c563e cluster_write(): do not access buffer after it is released
The issue was reported by
Alexander Lochmann <alexander.lochmann@tu-dortmund.de>,
who found the problem by performing lock analysis using LockDoc,
see https://doi.org/10.1145/3302424.3303948.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31780
2021-09-02 21:36:33 +03:00
Mateusz Guzik
6352bbf7be vmem: disable debug.vmem_check by default
It has a prohibitive performance impact when running real workloads.

Note this only affects kernels with DIAGNOSTIC.

Reviewed by:	markj
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D31784
2021-09-02 18:28:45 +00:00
Brooks Davis
6bc90e8acf syscalls.master: correct formatting issues
Reviewed by:	kevans, emaste
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D31351
2021-09-01 21:58:22 +01:00
Brooks Davis
df501bac69 syscalls.master: switch to CAPENABLED flags
Switch the main syscall table to use CAPENABLED flags rather than
capabilities.conf.  This avoid synchronization issues between
syscalls.master and capabilities.conf (e.g. when renaming a syscall
during development).

For now, move capabilities.conf to sys/compat/freebsd32 and use it
there.  Use of sys/compat/freebsd32/syscalls.master should be replaced
by makesyscalls.lua enhancements to allow the main one to be used.

This change results in no changes to generated files after running
`make sysent`.

Reviewed by:	kevans, emaste
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D31350
2021-09-01 21:58:16 +01:00
Brooks Davis
6945df3fff makesyscalls.lua: add a CAPENABLED flag
The CAPENABLED flag indicates that the syscall can be used in capsicum
capability mode.  It is intended to replace capabilities.conf.

Reviewed by:	kevans, emaste
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D31349
2021-09-01 21:58:06 +01:00
Mark Johnston
c511383de7 kevent: Fix races between timer detach and kqtimer_proc_continue()
- When detaching a knote, we need to double check the enqueued flag
  after acquiring the process lock, as kqtimer_proc_continue() may have
  toggled it.
- kqtimer_proc_continue() could in principle reschedule a stopped
  callout after filt_timerdetach() drains the callout.  So, we need to
  re-check.

Reported by:	syzbot+4a4cebb3ec07892cb040@syzkaller.appspotmail.com
Reported by:	syzbot+a9c04bc76078a3b7dd8d@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31772
2021-09-01 14:18:58 -04:00
Ka Ho Ng
92bb74fd4f vfs: Use file_cred for VOP_DEALLOCATE in vn_deallocate if non-NULL
This changes vn_deallocate() to match the behavior of vn_rdwr() when
picking which ucred to use. That is, vn_deallocate() uses file_cred for
making VOP call if it is non-NULL, or use active_cred otherwise.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D31712
2021-09-01 20:19:08 +08:00
Alexander Motin
706b1a5724 Align taskqueue_enqueue_timeout() to hardclock.
It is done for all other KPIs using HZ, but was missed here.

MFC after:	2 weeks
2021-08-31 23:50:35 -04:00
Mark Johnston
3138392a46 itimer: Serialize access to the p_itimers array
Fix the following race between itimer_proc_continue() and process exit.

itimer_proc_continue() may be called via realitexpire(), the real
interval timer.  Note that exit1() drains this timer _after_ draining
and freeing itimers.  Moreover, itimers_exit() is called without the
process lock held; it only acquires the proc lock when deleting
individual itimers, so once they are drained we free p->p_itimers
without any synchronization.  Thus, itimer_proc_continue() may load a
non-NULL p->p_itimers array and iterate over it after it has been freed.

Fix the problem by using the process lock when clearing p->p_itimers, to
synchronize with itimer_proc_continue().  Formally, accesses to this
field should be protected by the process lock anyway, and since the
array is allocated lazily this will not incur any overhead in the common
case.

Reported by:	syzbot+c40aa8bf54fe333fc50b@syzkaller.appspotmail.com
Reported by:	syzbot+929be2f32503bbc3844f@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31759
2021-08-31 16:38:05 -04:00
John Baldwin
470e851c4b ktls: Support asynchronous dispatch of AEAD ciphers.
KTLS OCF support was originally targeted at software backends that
used host CPU cycles to encrypt TLS records.  As a result, each KTLS
worker thread queued a single TLS record at a time and waited for it
to be encrypted before processing another TLS record.  This works well
for software backends but limits throughput on OCF drivers for
coprocessors that support asynchronous operation such as qat(4) or
ccr(4).  This change uses an alternate function (ktls_encrypt_async)
when encrypt TLS records via a coprocessor.  This function queues TLS
records for encryption and returns.  It defers the work done after a
TLS record has been encrypted (such as marking the mbufs ready) to a
callback invoked asynchronously by the coprocessor driver when a
record has been encrypted.

- Add a struct ktls_ocf_state that holds the per-request state stored
  on the stack for synchronous requests.  Asynchronous requests malloc
  this structure while synchronous requests continue to allocate this
  structure on the stack.

- Add a ktls_encrypt_async() variant of ktls_encrypt() which does not
  perform request completion after dispatching a request to OCF.
  Instead, the ktls_ocf backends invoke ktls_encrypt_cb() when a TLS
  record request completes for an asynchronous request.

- Flag AEAD software TLS sessions as async if the backend driver
  selected by OCF is an async driver.

- Pull code to create and dispatch an OCF request out of
  ktls_encrypt() into a new ktls_encrypt_one() function used by both
  ktls_encrypt() and ktls_encrypt_async().

- Pull code to "finish" the VM page shuffling for a file-backed TLS
  record into a helper function ktls_finish_noanon() used by both
  ktls_encrypt() and ktls_encrypt_cb().

Reviewed by:	markj
Tested on:	ccr(4) (jhb), qat(4) (markj)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D31665
2021-08-30 13:11:52 -07:00
Andrew Turner
b792434150 Create sys/reg.h for the common code previously in machine/reg.h
Move the common kernel function signatures from machine/reg.h to a new
sys/reg.h. This is in preperation for adding PT_GETREGSET to ptrace(2).

Reviewed by:	imp, markj
Sponsored by:	DARPA, AFRL (original work)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19830
2021-08-30 12:50:53 +01:00
Ka Ho Ng
a58e222b3b vfs: yield in vn_deallocate_impl() loop
Yield at the end of each loop iteration if there are remaining works as
indicated by the value of *len updated by VOP_DEALLOCATE. Without this,
when calling vop_stddeallocate to zero a large region, the
implementation only zerofills a relatively small chunk and returns.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D31705
2021-08-29 16:26:00 +08:00
Mark Johnston
7326e8589c fsetown: Avoid process group lock recursion
Restore the pre-1d874ba4f8ba behaviour of disassociating the current
SIGIO recipient before looking up the specified process or process
group.  This avoids a lock recursion in the scenario where a process
group is configured to receive SIGIO for an fd when it has already been
so configured.

Reported by:	pho
Tested by:	pho
Reviewed by:	kib
MFC after:	3 days
2021-08-28 15:50:44 -04:00
Rick Macklem
da779f262c vfs_default: Change vop_stddeallocate() from static to global
A future commit to the NFS client uses vop_stddeallocate() for
cases where the NFS server does not support a Deallocate operation.
Change vop_stddeallocate() from static to global so that it can
be called by the NFS client.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D31640
2021-08-27 18:25:44 -07:00
Konstantin Belousov
f19063ab02 vfs_hash_rehash(): require the vnode to be exclusively locked
Rehash updates v_hash.  Also, rehash moves the vnode to different hash
bucket, which should be noticed in vfs_hash_get() after sleeping for
the vnode lock.

Reviewed by:	mckusick, rmacklem
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31464
2021-08-27 18:39:45 +03:00
Konstantin Belousov
7c1e4aab79 vfs_hash_insert: ensure that predicate is true
After vnode lock, recheck v_hash. When vfs_hash_insert() is used with
a predicate, recheck it after the selected vnode is locked. Since
vfs_hash_lock is dropped, vnode could be rehashed during the sleep for
the vnode lock, which could go unnoticed there.

Reported and tested by:	pho
Reviewed by:	mckusick, rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31464
2021-08-27 18:39:45 +03:00
Mark Johnston
091869def9 connect: Use soconnectat() unconditionally in kern_connect()
soconnect(...) is equivalent to soconnectat(AT_FDCWD, ...), so rely on
this to save a branch.  No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-08-27 08:32:07 -04:00
Mateusz Guzik
f1e2cc1c66 vfs: drop dedicated sysinit for mountlist_mtx
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2021-08-26 20:52:03 +02:00
Mateusz Guzik
0d28d014c8 vfs: refactor kern_unmount
Split unmounting by path and id in preparation for other changes.

Sponsored by:	Rubicon Communications, LLC ("Netgate")
2021-08-26 13:58:28 +02:00
Mateusz Guzik
7b2561b46b vfs: stop open-coding vfs_getvfs in kern_unmount
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2021-08-26 11:38:31 +00:00
Mark Johnston
a507a40f3b fsetown: Simplify error handling
No functional change intended.

Suggested by:	kib
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31671
2021-08-25 16:20:07 -04:00
Mark Johnston
1d874ba4f8 fsetown: Fix process lookup bugs
- pget()/pfind() will acquire the PID hash bucket locks, which are
  sleepable sx locks, but this means that the sigio mutex cannot be held
  while calling these functions.  Instead, use pget() to hold the
  process, after which we lock the sigio and proc locks, respectively.
- funsetownlst() assumes that processes cannot be registered for SIGIO
  once they have P_WEXIT set.  However, pfind() will happily return
  exiting processes, breaking the invariant.  Add an explicit check for
  P_WEXIT in fsetown() to fix this. [1]

Fixes:	f52979098d ("Fix a pair of races in SIGIO registration")
Reported by:	syzkaller [1]
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31661
2021-08-25 16:18:10 -04:00
Ka Ho Ng
9e202d036d fspacectl(2): Changes on rmsr.r_offset's minimum value returned
rmsr.r_offset now is set to rqsr.r_offset plus the number of bytes
zeroed before hitting the end-of-file. After this change rmsr.r_offset
no longer contains the EOF when the requested operation range is
completely beyond the end-of-file. Instead in such case rmsr.r_offset is
equal to rqsr.r_offset.  Callers can obtain the number of bytes zeroed
by subtracting rqsr.r_offset from rmsr.r_offset.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D31677
2021-08-26 00:03:37 +08:00
Ka Ho Ng
5c1428d2c4 uipc_shm: Handle offset on shm_size as if it is beyond shm_size
This avoids any unnecessary works in such case.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	markj, kib
Differential Revision:	https://reviews.freebsd.org/D31655
2021-08-24 23:49:18 +08:00
Ka Ho Ng
1eaa36523c fspacectl(2): Clarifies the return values
rmacklem@ spotted two things in the system call:
- Upon returning from a successful operation, vop_stddeallocate can
  update rmsr.r_offset to a value greater than file size. This behavior,
  although being harmless, can be confusing.
- The EINVAL return value for rqsr.r_offset + rqsr.r_len > OFF_MAX is
  undocumented.

This commit has the following changes:
- vop_stddeallocate and shm_deallocate to bound the the affected area
  further by the file size.
- The EINVAL case for rqsr.r_offset + rqsr.r_len > OFF_MAX is
  documented.
- The fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9)'s return
  len is explicitly documented the be the value 0, and the return offset
  is restricted to be the smallest of off + len and current file size
  suggested by kib@. This semantic allows callers to interact better
  with potential file size growth after the call.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	imp, kib
Differential Revision:	https://reviews.freebsd.org/D31604
2021-08-24 17:08:28 +08:00
Mateusz Guzik
b65ad70195 cache: retire cache_fast_revlookup sysctl
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2021-08-23 15:31:44 +02:00
Mateusz Guzik
7fd856ba07 vfs: s/__unused/__diagused in crossmp_*
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2021-08-23 15:23:42 +02:00
Mateusz Guzik
614faa3269 vfs: fix cache-relatecd LOR introduced in the previous change
Reported by:	kib
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2021-08-22 16:20:07 +00:00
Thomas Munro
f30a1ae8d5 lio_listio(2): Allow LIO_READV and LIO_WRITEV.
Allow multiple vector IOs to be started with one system call.
aio_readv() and aio_writev() already used these opcodes under the
covers.  This commit makes them available to user space.

Being non-standard extensions, they're only visible if __BSD_VISIBLE is
defined, like the functions.

Reviewed by:    asomers, kib
MFC after:      2 weeks
Differential Revision:  https://reviews.freebsd.org/D31627
2021-08-22 23:00:42 +12:00
Jason A. Harmening
e81e71b0e9 Use interruptible wait for blocking recursive unmounts
Now that we allow recursive unmount attempts to be abandoned upon
exceeding the retry limit, we should avoid leaving an unkillable
thread when a synchronous unmount request was issued against the
base filesystem.

Reviewed by:	kib (earlier revision), mkusick
Differential Revision:  https://reviews.freebsd.org/D31450
2021-08-20 13:21:56 -07:00
Jason A. Harmening
a8c732f4e5 VFS: add retry limit and delay for failed recursive unmounts
A forcible unmount attempt may fail due to a transient condition, but
it may also fail due to some issue in the filesystem implementation
that will indefinitely prevent successful unmount.  In such a case,
the retry logic in the recursive unmount facility will cause the
deferred unmount taskqueue to execute constantly.

Avoid this scenario by imposing a retry limit, with a default value
of 10, beyond which the recursive unmount facility will emit a log
message and give up.  Additionally, introduce a grace period, with
a default value of 1s, between successive unmount retries on the
same mount.

Create a new sysctl node, vfs.deferred_unmount, to export the total
number of failed recursive unmount attempts since boot, and to allow
the retry limit and retry grace period to be tuned.

Reviewed by:	kib (earlier revision), mkusick
Differential Revision:  https://reviews.freebsd.org/D31450
2021-08-20 13:20:50 -07:00
Mateusz Guzik
5d75ffdd0c vfs: remove an unused variable from nameicap_tracker_add
Reported by cc --analyze

Sponsored by:	Rubicon Communications, LLC ("Netgate")
2021-08-20 17:52:24 +00:00
Mateusz Guzik
dbc689cdef vfs: use vn_lock_pair to avoid establishing an ordering on mount
This fixes some of the LORs seen on mount/unmount.

Complete fix will require taking care of unmount as well.

Reviewed by:	kib
Tested by:	pho (previous version)
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D31611
2021-08-20 17:52:24 +00:00
Kyle Evans
d7e1bdfeba uipc: avoid circular pr_{slow,fast}timos
domain_init() gets reinvoked for each vnet on a system, so we must not
alter global state.  Practically speaking, we were creating circular
lists and tying up a softclock thread into an infinite loop.

The breakage here was most easily observed by simply creating a jail
in a new vnet and watching the system suddenly become erratic.

Reported by:	markj
Fixes:	e0a17c3f06 ("uipc: create dedicated lists for fast ...")
Pointy hat:	kevans
2021-08-18 12:46:54 -05:00
Kristof Provost
07edc89c39 witness: remove ifnet_rw
This lock no longer exists. It was removed in
a60100fdfc (if: Remove ifnet_rwlock, 2020-11-25)

Reviewed by:		mjg
Pointed out by:		Dheeraj Kandula <dheerajk@netapp.com>
Different Revision:	https://reviews.freebsd.org/D31585
2021-08-18 08:51:26 +02:00
Kristof Provost
a051ca72e2 Introduce m_get3()
Introduce m_get3() which is similar to m_get2(), but can allocate up to
MJUM16BYTES bytes (m_get2() can only allocate up to MJUMPAGESIZE).

This simplifies the bpf improvement in f13da24715.

Suggested by:	glebius
Differential Revision:	https://reviews.freebsd.org/D31455
2021-08-18 08:48:27 +02:00
Mateusz Guzik
e0a17c3f06 uipc: create dedicated lists for fast and slow timeout callbacks
This avoids having to walk all possible protocols only to check if they
have one (vast majority does not).

Original patch by kevans@.

Reviewed by:	kevans
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2021-08-17 21:56:05 +02:00
Mark Johnston
c4feb1ab0a sigtimedwait: Use a unique wait channel for sleeping
When a sigtimedwait(2) caller goes to sleep, it uses a wait channel of
p->p_sigacts with the proc lock as the interlock.  However, p_sigacts
can be shared between processes if a child is created with
rfork(RFSIGSHARE | RFPROC).  Thus we can end up with two threads
sleeping on the same wait channel using different locks, which is not
permitted.

Fix the problem simply by using a process-unique wait channel, following
the example of sigsuspend.  The actual wait channel value is irrelevant
here, sleeping threads are awoken using sleepq_abort().

Reported by:	syzbot+8c417afabadb50bb8827@syzkaller.appspotmail.com
Reported by:	syzbot+1d89fc2a9ef92ef64fa8@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31563
2021-08-16 15:11:15 -04:00
John Baldwin
d16cb228c1 ktls: Fix accounting for TLS 1.0 empty fragments.
TLS 1.0 empty fragment mbufs have no payload and thus m_epg_npgs is
zero.  However, these mbufs need to occupy a "unit" of space for the
purposes of M_NOTREADY tracking similar to regular mbufs.  Previously
this was done for the page count returned from ktls_frame() and passed
to ktls_enqueue() as well as the page count passed to pru_ready().

However, sbready() and mb_free_notready() only use m_epg_nrdy to
determine the number of "units" of space in an M_EXT mbuf, so when a
TLS 1.0 fragment was marked ready it would mark one unit of the next
mbuf in the socket buffer as ready as well.  To fix, set m_epg_nrdy to
1 for empty fragments.  This actually simplifies the code as now only
ktls_frame() has to handle TLS 1.0 fragments explicitly and the rest
of the KTLS functions can just use m_epg_nrdy.

Reviewed by:	gallatin
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D31536
2021-08-16 10:42:46 -07:00
Konstantin Belousov
81b895a95b pipe_paircreate(): do not leak pipepair memory on error
Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2021-08-16 17:08:44 +03:00
Kyle Evans
29e400e994 domain: make it safer to add domains post-domainfinalize
I can see two concerns for adding domains after domainfinalize:

1.) The slow/fast callouts have already been setup.
2.) Userland could create a socket while we're in the middle of
  initialization.

We can address #1 fairly easily by tracking whether the domain's been
initialized for at least the default vnet. There are still some concerns
about the callbacks being invoked while a vnet is in the process of
being created/destroyed, but this is a pre-existing issue that the
callbacks must coordinate anyways.

We should also address #2, but technically this has been an issue
anyways because we don't assert on post-domainfinalize additions; we
don't seem to hit it in practice.

Future work can fix that up to make sure we don't find partially
constructed domains, but care must be taken to make sure that at least,
e.g., the usages of pffindproto in ip_input.c can still find them.

Differential Revision:	https://reviews.freebsd.org/D25459
2021-08-16 00:59:56 -05:00
Kyle Evans
239aebee61 domain: give domains a chance to probe for availability
This gives any given domain a chance to indicate that it's not actually
supported on the current system. If dom_probe isn't supplied, we assume
the domain is universally applicable as most of them are. Keeping
fully-initialized and registered domains around that physically can't
work on a large majority of FreeBSD deployments is sub-optimal and leads
to errors that aren't consistent with the reality of why the socket
can't be created (e.g. ESOCKTNOSUPPORT) because such scenario has to be
caught upon pru_attach, at which point kicking back the more-appropriate
EAFNOSUPPORT would seem weird.

The initial consumer of this will be hvsock, which is only available on
HyperV guests.

Reviewed by:	cem (earlier version), bcr (manpages)
Differential Revision:	https://reviews.freebsd.org/D25062
2021-08-16 00:59:56 -05:00
Konstantin Belousov
9446d9e88f fstatat(2): handle non-vnode file descriptors for AT_EMPTY_PATH
Set NIRES_EMPTYPATH earlies, to have use of EMPTYPATH recorded even if
we are going to return error.  When namei_setup() refused to accept dirfd,
which is not of the vnode type, and indicated by ENOTDIR error return,
fall back to kern_fstat(dirfd).

Reported by:	dchagin
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31530
2021-08-14 00:17:18 +03:00
Ka Ho Ng
454bc887f2 uipc_shm: Implements fspacectl(2) support
This implements fspacectl(2) support on shared memory objects. The
semantic of SPACECTL_DEALLOC is equivalent to clearing the backing
store and free the pages within the affected range. If the call
succeeds, subsequent reads on the affected range return all zero.

tests/sys/posixshm/posixshm_tests.c is expanded to include a
fspacectl(2) functional test.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	kevans, kib
Differential Revision:	https://reviews.freebsd.org/D31490
2021-08-12 23:04:18 +08:00
Ka Ho Ng
a638dc4ebc vfs: Add ioflag to VOP_DEALLOCATE(9)
The addition of ioflag allows callers passing
IO_SYNC/IO_DATASYNC/IO_DIRECT down to the file system implementation.
The vop_stddeallocate fallback implementation is updated to pass the
ioflag to the file system implementation. vn_deallocate(9) internally is
also changed to pass ioflag to the VOP_DEALLOCATE call.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D31500
2021-08-12 23:03:49 +08:00
Ka Ho Ng
c15384f896 vfs: Add get_write_ioflag helper to calculate ioflag
Converted vn_write to use this helper.

Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D31513
2021-08-12 17:35:34 +08:00
Dmitry Chagin
71854d9b2b fork: Remove the unnecessary spaces.
MFC after:		2 weeks
2021-08-12 11:58:17 +03:00
Dmitry Chagin
de8374df28 fork: Allow ABI to specify fork return values for child.
At least Linux x86 ABI's does not use carry bit and expects that the dx register
is preserved. For this add a new sv_set_fork_retval hook and call it from cpu_fork().

Add a short comment about touching dx in x86_set_fork_retval(), for more details
see phab comments from kib@ and imp@.

Reviewed by:		kib
Differential revision:	https://reviews.freebsd.org/D31472
MFC after:		2 weeks
2021-08-12 11:45:25 +03:00
Eric van Gyzen
13a58148de netdump: send key before dump, in case dump fails
Previously, if an encrypted netdump failed, such as due to a timeout or
network failure, the key was not saved, so a partial dump was
completely useless.

Send the key first, so the partial dump can be decrypted, because even a
partial dump can be useful.

Reviewed by:	bdrewery, markj
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D31453
2021-08-11 10:54:56 -05:00
Mark Johnston
10a8e93da1 kmsan: Export kmsan_mark_mbuf() and kmsan_mark_bio()
Sponsored by:	The FreeBSD Foundation
2021-08-11 16:33:41 -04:00
Andrew Gallatin
95c51fafa4 ktls: Init reset tag task for cloned sessions
When cloning a ktls session (which is needed when we need to
switch output NICs for a NIC TLS session), we need to also
init the reset task, like we do when creating a new tls session.

Reviewed by: jhb
Sponsored by: Netflix
2021-08-11 14:06:43 -04:00
Mitchell Horne
4ccaa87f69 kdb: Handle process enumeration before procinit()
Make kdb_thr_first() and kdb_thr_next() return sane values if the
allproc list and pidhashtbl haven't been initialized yet. This can
happen if the debugger is entered very early on, for example with the
'-d' boot flag.

This allows remote gdb to attach at such a time, and fixes some ddb
commands like 'show threads'.

Be explicit about the static initialization of these variables. This
part has no functional change.

Reviewed by:	markj, imp (previous version)
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D31495
2021-08-11 14:44:22 -03:00
Ka Ho Ng
4a9b832a2a vfs: Rename ioflg to ioflag in vn_deallocate
This includes a style fix around ioflag checking as well.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib, bcr
Differential Revision:	https://reviews.freebsd.org/D31505
2021-08-11 17:45:47 +08:00
Alexander Motin
67f508db84 Mark some sysctls as CTLFLAG_MPSAFE.
MFC after:	2 weeks
2021-08-10 22:18:26 -04:00
Mark Johnston
100949103a uma: Add KMSAN hooks
For now, just hook the allocation path: upon allocation, items are
marked as initialized (absent M_ZERO).  Some zones are exempted from
this when it would otherwise raise false positives.

Use kmsan_orig() to update the origin map for UMA and malloc(9)
allocations.  This allows KMSAN to print the return address when an
uninitialized UMA item is implicated in a report.  For example:
  panic: MSan: Uninitialized UMA memory from m_getm2+0x7fe

Sponsored by:	The FreeBSD Foundation
2021-08-10 21:27:54 -04:00
Mark Johnston
693c9516fa busdma: Add KMSAN integration
Sanitizer instrumentation of course cannot automatically update shadow
state when devices write to host memory.  KMSAN thus hooks into busdma,
both to update shadow state after a device write, and to verify that the
kernel does not publish uninitalized bytes to devices.

To implement this, when KMSAN is configured, each dmamap embeds a memory
descriptor describing the region currently loaded into the map.
bus_dmamap_sync() uses the operation flags to determine whether to
validate the loaded region or to mark it as initialized in the shadow
map.

Note that in cases where the amount of data written is less than the
buffer size, the entire buffer is marked initialized even when it is
not.  For example, if a NIC writes a 128B packet into a 2KB buffer, the
entire buffer will be marked initialized, but subsequent accesses past
the first 128 bytes are likely caused by bugs.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31338
2021-08-10 21:27:54 -04:00
Mark Johnston
b0f71f1bc5 amd64: Add MD bits for KMSAN
Interrupt and exception handlers must call kmsan_intr_enter() prior to
calling any C code.  This is because the KMSAN runtime maintains some
TLS in order to track initialization state of function parameters and
return values across function calls.  Then, to ensure that this state is
kept consistent in the face of asynchronous kernel-mode excpeptions, the
runtime uses a stack of TLS blocks, and kmsan_intr_enter() and
kmsan_intr_leave() push and pop that stack, respectively.

Use these functions in amd64 interrupt and exception handlers.  Note
that handlers for user->kernel transitions need not be annotated.

Also ensure that trap frames pushed by the CPU and by handlers are
marked as initialized before they are used.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31467
2021-08-10 21:27:53 -04:00
Mark Johnston
8978608832 amd64: Populate the KMSAN shadow maps and integrate with the VM
- During boot, allocate PDP pages for the shadow maps.  The region above
  KERNBASE is currently not shadowed.
- Create a dummy shadow for the vm page array.  For now, this array is
  not protected by the shadow map to help reduce kernel memory usage.
- Grow shadows when growing the kernel map.
- Increase the default kernel stack size when KMSAN is enabled.  As with
  KASAN, sanitizer instrumentation appears to create stack frames large
  enough that the default value is not sufficient.
- Disable UMA's use of the direct map when KMSAN is configured.  KMSAN
  cannot validate the direct map.
- Disable unmapped I/O when KMSAN configured.
- Lower the limit on paging buffers when KMSAN is configured.  Each
  buffer has a static MAXPHYS-sized allocation of KVA, which in turn
  eats 2*MAXPHYS of space in the shadow map.

Reviewed by:	alc, kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31295
2021-08-10 21:27:53 -04:00
Mark Johnston
5dda15adbc kern: Ensure that thread-local KMSAN state is available
Sponsored by:	The FreeBSD Foundation
2021-08-10 21:27:53 -04:00
Mark Johnston
a422084abb Add the KMSAN runtime
KMSAN enables the use of LLVM's MemorySanitizer in the kernel.  This
enables precise detection of uses of uninitialized memory.  As with
KASAN, this feature has substantial runtime overhead and is intended to
be used as part of some automated testing regime.

The runtime maintains a pair of shadow maps.  One is used to track the
state of memory in the kernel map at bit-granularity: a bit in the
kernel map is initialized when the corresponding shadow bit is clear,
and is uninitialized otherwise.  The second shadow map stores
information about the origin of uninitialized regions of the kernel map,
simplifying debugging.

KMSAN relies on being able to intercept certain functions which cannot
be instrumented by the compiler.  KMSAN thus implements interceptors
which manually update shadow state and in some cases explicitly check
for uninitialized bytes.  For instance, all calls to copyout() are
subject to such checks.

The runtime exports several functions which can be used to verify the
shadow map for a given buffer.  Helpers provide the same functionality
for a few structures commonly used for I/O, such as CAM CCBs, BIOs and
mbufs.  These are handy when debugging a KMSAN report whose
proximate and root causes are far away from each other.

Obtained from:	NetBSD
Sponsored by:	The FreeBSD Foundation
2021-08-10 21:27:53 -04:00
Mark Johnston
eca9ac5a32 vfs: Avoid a comparison with an uninitialized field in setutimes()
Some filesystems, e.g., devfs, do not populate va_birthtime in their
GETATTR implementations.  To handle this, make sure that va_birthtime is
initialized to the quasi-standard value of { VNOVAL, 0 } before calling
VOP_GETATTR.

Reported by:	KMSAN
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31468
2021-08-09 13:27:20 -04:00
Alexander Motin
696fca3fd4 Optimize res_find().
When the device name is provided, we can simply run strncmp() for each
line to quickly skip unrelated ones, that is much faster than sscanf()
and only then strcmp().

MFC after:	2 weeks
2021-08-08 21:54:49 -04:00
Ed Maste
9feff969a0 Remove "All Rights Reserved" from FreeBSD Foundation sys/ copyrights
These ones were unambiguous cases where the Foundation was the only
listed copyright holder (in the associated license block).

Sponsored by:	The FreeBSD Foundation
2021-08-08 10:42:24 -04:00
Mateusz Guzik
b30e7cb7fa cache: add OPENREAD and OPENWRITE to fast path lookup 2021-08-07 13:02:38 +02:00
Rick Macklem
c18c74a87c namei: Add cn_flags bits for OPENREAD and OPENWRITE
VOP_LOOKUP() is called with cn_flags bits ISLASTCN and ISOPEN
to indicate that the lookup is for the last component of a pathname
when doing open.

If the cn_flags also indicates if the open is for Reading, Writing or Both,
the NFSv4 client can do an NFSv4 Open operation in the same compound
RPC as Lookup, often avoiding the additional Open RPC now done when
VOP_OPEN() is called.

This patch defines two new cn_flags bits called OPENREAD and OPENWRITE
and sets these in open2nameif() based on FREAD, FWRITE flag bits.
This will allow a subsequent patch to the NFSv4 client to do the Open
operation in the same RPC as Lookup.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D31431
2021-08-06 18:41:11 -07:00
Andrew Gallatin
09066b9866 ktls: Use the new PNOLOCK flag
Use the new PNOLOCK flag to tsleep() to indicate that
we are managing potential races, and don't need to
sleep with a lock, or have a backstop timeout.

Reviewed by: jhb
Sponsored by: Netflix
2021-08-05 17:19:12 -04:00
Andrew Gallatin
1b97a054f3 tsleep: Add a PNOLOCK flag
Add a PNOLOCK flag so that, in the race circumstance where
wakeup races are externally mitigated, tsleep() can be
called with a sleep time of 0 without triggering an
an assertion.

Reviewed by: jhb
Sponsored by: Netflix
2021-08-05 17:16:30 -04:00
Andrew Gallatin
2694c869ff ktls: fix a panic with INVARIANTS
98215005b7 introduced a new
thread that uses tsleep(..0) to sleep forever.  This hit
an assert due to sleeping with a 0 timeout.

So spell "forever" using SBT_MAX instead, which does not
trigger the assert.

Pointy hat to: gallatin
Pointed out by: emaste
Sponsored by: Netflix
2021-08-05 13:09:06 -04:00
Ka Ho Ng
da9fe3529b Regen after 0dc332bff2 2021-08-05 23:22:02 +08:00
Ka Ho Ng
0dc332bff2 Add fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9).
fspacectl(2) is a system call to provide space management support to
userspace applications. VOP_DEALLOCATE(9) is a VOP call to perform the
deallocation. vn_deallocate(9) is a public KPI for kmods' use.

The purpose of proposing a new system call, a KPI and a VOP call is to
allow bhyve or other hypervisor monitors to emulate the behavior of SCSI
UNMAP/NVMe DEALLOCATE on a plain file.

fspacectl(2) comprises of cmd and flags parameters to specify the
space management operation to be performed. Currently cmd has to be
SPACECTL_DEALLOC, and flags has to be 0.

fo_fspacectl is added to fileops.
VOP_DEALLOCATE(9) is added as a new VOP call. A trivial implementation
of VOP_DEALLOCATE(9) is provided.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D28347
2021-08-05 23:20:42 +08:00
Ka Ho Ng
abbb57d5a6 vfs: Introduce vn_bmap_seekhole_locked()
vn_bmap_seekhole_locked() is factored out version of vn_bmap_seekhole().
This variant requires shared vnode lock being held around the call.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib, markj
Differential Revision:	https://reviews.freebsd.org/D31404
2021-08-05 22:52:26 +08:00
Andrew Gallatin
98215005b7 ktls: start a thread to keep the 16k ktls buffer zone populated
Ktls recently received an optimization where we allocate 16k
physically contiguous crypto destination buffers. This provides a
large (more than 5%) reduction in CPU use in our
workload. However, after several days of uptime, the performance
benefit disappears because we have frequent allocation failures
from the ktls buffer zone.

It turns out that when load drops off, the ktls buffer zone is
trimmed, and some 16k buffers are freed back to the OS. When load
picks back up again, re-allocating those 16k buffers fails after
some number of days of uptime because physical memory has become
fragmented. This causes allocations to fail, because they are
intentionally done without M_NORECLAIM, so as to avoid pausing
the ktls crytpo work thread while the VM system defragments
memory.

To work around this, this change starts one thread per VM domain
to allocate ktls buffers with M_NORECLAIM, as we don't care if
this thread is paused while memory is defragged. The thread then
frees the buffers back into the ktls buffer zone, thus allowing
future allocations to succeed.

Note that waking up the thread is intentionally racy, but neither
of the races really matter. In the worst case, we could have
either spurious wakeups or we could have to wait 1 second until
the next rate-limited allocation failure to wake up the thread.

This patch has been in use at Netflix on a handful of servers,
and seems to fix the issue.

Differential Revision: https://reviews.freebsd.org/D31260
Reviewed by: jhb, markj,  (jtl, rrs, and dhw reviewed earlier version)
Sponsored by: Netflix
2021-08-05 10:19:12 -04:00
John Baldwin
c51e4962a3 Document kern.log_wakeups_per_second.
PR:		148680
MFC after:	2 weeks
2021-08-04 11:50:34 -07:00
Konstantin Belousov
0ef5eee9d9 Add vn_lktype_write()
and remove repetetive code that calculates vnode locking type for write.

Reviewed by:	khng, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31405
2021-08-04 19:40:13 +03:00
Kyle Evans
04cc0c393c malloc(9): provide missing malloc_aligned implementation
Pointy hat:	kevans
Fixes:	6162cf885c ("malloc(9): Document/complete aligned variants")
2021-08-02 21:12:39 -05:00
Eric van Gyzen
428624130a Fix lockstat:::thread-spin dtrace probe with LOCK_PROFILING
The spinning start time is missing from the calculation due to a
misplaced #endif.  Return the #endif where it's supposed to be.

Submitted by:	Alexander Alexeev <aalexeev@isilon.com>
Reviewed by:	bdrewery, mjg
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D31384
2021-08-02 14:44:23 -05:00
Adam Fenn
8ca384eb1d devclass_alloc_unit: move "at" hint test to after device-in-use test
Only perform this expensive operation when the unit number is a
potential candidate (i.e. not already in use), thereby reducing device
scan time on systems with many devices, unit numbers, and drivers.

Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
X-NetApp-PR:	#61
Differential Revision:	https://reviews.freebsd.org/D31381
2021-08-02 11:27:17 -05:00
Alexander Motin
ca34553b6f sched_ule(4): Pre-seed sched_random().
I don't think it changes anything, but why not.

While there, make cpu_search_highest() use all 8 lower load bits for
noise, since it does not use cs_prefer and the code is not shared
with cpu_search_lowest() any more.

MFC after:	1 month
2021-08-02 10:55:28 -04:00
Alexander Motin
8bb173fb5b sched_ule(4): Use trylock when stealing load.
On some load patterns it is possible for several CPUs to try steal
thread from the same CPU despite randomization introduced.  It may
cause significant lock contention when holding one queue lock idle
thread tries to acquire another one.  Use of trylock on the remote
queue allows both reduce the contention and handle lock ordering
easier.  If we can't get lock inside tdq_trysteal() we just return,
allowing tdq_idled() handle it.  If it happens in tdq_idled(), then
we repeat search for load skipping this CPU.

On 2-socket 80-thread Xeon system I am observing dramatic reduction
of the lock spinning time when doing random uncached 4KB reads from
12 ZVOLs, while IOPS increase from 327K to 403K.

MFC after:	1 month
2021-08-01 22:42:01 -04:00
Alexander Motin
2668bb2add sched_ule(4): Reduce duplicate search for load.
When sched_highest() called for some CPU group returns nothing, idle
thread calls it for the parent CPU group.  But the parent CPU group
also includes the CPU group we've just searched, and unless there is
a race going on, it is unlikely we find anything new this time.

Avoid the double search in case of parent group having only two sub-
groups (the most prominent case). Instead of escalating to the parent
group run the next search over the sibling subgroup and escalate two
levels up after if that fail too.  In case of more than two siblings
the difference is less significant, while searching the parent group
can result in better decision if we find several candidate CPUs.

On 2-socket 40-core Xeon system I am measuring ~25% reduction of CPU
time spent inside cpu_search_highest() in both SMT (2x20x2) and non-
SMT (2x20) cases.

MFC after:	1 month
2021-08-01 22:07:51 -04:00
Mark Johnston
6f179693c5 Add interceptors for atomic operations on userspace memory
Implement them for KASAN.  KCSAN interceptors are left unimplemented for
now.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-07-29 21:14:36 -04:00
Mark Johnston
a90d053b84 Simplify kernel sanitizer interceptors
KASAN and KCSAN implement interceptors for various primitive operations
that are not instrumented by the compiler.  KMSAN requires them as well.
Rather than adding new cases for each sanitizer which requires
interceptors, implement the following protocol:
- When interceptor definitions are required, define
  SAN_NEEDS_INTERCEPTORS and SANITIZER_INTERCEPTOR_PREFIX.
- In headers that declare functions which need to be intercepted by a
  sanitizer runtime, use SANITIZER_INTERCEPTOR_PREFIX to provide
  declarations.
- When SAN_RUNTIME is defined, do not redefine the names of intercepted
  functions.  This is typically the case in files which implement
  sanitizer runtimes but is also needed in, for example, files which
  define ifunc selectors for intercepted operations.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-07-29 21:13:32 -04:00
Mark Johnston
9e575fadf4 link_elf_obj: Invoke fini callbacks
This is required for KASAN: when a module is unloaded, poisoned regions
(e.g., pad areas between global variables) are left as such, so if they
are reused as KLDs are loaded, false positives can arise.

Reported by:	pho, Jenkins
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31339
2021-07-29 09:46:25 -04:00
Dmitry Chagin
9e32efa79b umtx: Split do_unlock_pi on two counterparts.
The umtx_pi_frop() will be used by Linux emulation layer.

Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D31238
MFC after:		2 weeks
2021-07-29 12:47:39 +03:00
Dmitry Chagin
09f55e6002 umtx: Expose some of the pi umtx structures and API to the rest of the kernel.
Differential Revision:	https://reviews.freebsd.org/D31237
MFC after:		2 weeks
2021-07-29 12:46:58 +03:00
Dmitry Chagin
8e4d22c01d umtx: Add umtxq_requeue Linux emulation layer extension.
Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D31235
MFC after:		2 weeks
2021-07-29 12:43:07 +03:00
Dmitry Chagin
7caa29115b umtx: Add bitset conditional wakeup functionality.
The bitset is a Linux emulation layer extension. This 32-bit mask, in which at
least one bit must be set, is used to select which threads should be woken up.

The bitset is stored in the umtx_q structure, which is used to enqueue the waiter
into the umtx waitqueue. Put the bitset into the hole, that appeared on LP64 due
to data alignment, to prevent the growth of the struct umtx_q.

Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D31234
MFC after:		2 weeks
2021-07-29 12:42:49 +03:00
Dmitry Chagin
1fdcc87cfd umtx: Expose some of the umtx structures and API to the rest of the kernel.
Differential Revision:	https://reviews.freebsd.org/D31233
MFC after:		2 weeks
2021-07-29 12:42:17 +03:00
Dmitry Chagin
307a3dd35c umtx: Expose struct abs_timeout to the rest of the kernel.
Add umtx_ prefix to all abs_timeout facility and add declaration for it.
For consistency with others abs_timeout mark inline abs_timeout_init2.

Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D31249
MFC after:		2 weeks
2021-07-29 12:41:58 +03:00
Dmitry Chagin
af29f39958 umtx: Split umtx.h on two counterparts.
To prevent umtx.h polluting by future changes split it on two headers:
umtx.h - ABI header for userspace;
umtxvar.h - the kernel staff.

While here fix umtx_key_match style.

Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D31248
MFC after:		2 weeks
2021-07-29 12:41:29 +03:00
Kyle Evans
e3707726c1 kern: remove deprecated makesyscalls.sh
makesyscalls was rewritten in Lua and introduced in d3276301ab.  In the
time since, no objections have risen and a warning was introduced long
ago on invocation of makesyscalls.sh that it would be removed before
FreeBSD 13. Belatedly follow through on that.
2021-07-28 22:22:23 -05:00
Alexander Motin
aefe0a8c32 Refactor/optimize cpu_search_*().
Remove cpu_search_both(), unused for many years.  Without it there is
less sense for the trick of compiling common cpu_search() into separate
cpu_search_lowest() and cpu_search_highest(), so split them completely,
making code more readable.  While there, split iteration over children
groups and CPUs, complicating code for very small deduplication.

Stop passing cpuset_t arguments by value and avoid some manipulations.
Since MAXCPU bump from 64 to 256, what was a single register turned
into 32-byte memory array, requiring memory allocation and accesses.
Splitting struct cpu_search into parameter and result parts allows to
even more reduce stack usage, since the first can be passed through
on recursion.

Remove CPU_FFS() from the hot paths, precalculating first and last CPU
for each CPU group in advance during initialization.  Again, it was
not a problem for 64 CPUs before, but for 256 FFS needs much more code.

With these changes on 80-thread system doing ~260K uncached ZFS reads
per second I observe ~30% reduction of time spent in cpu_search_*().

MFC after:	1 month
2021-07-28 22:00:29 -04:00
Warner Losh
824897a3ae genoffset: simplify and rewrite in sh
genoffset used the fully generic ASSYM macro to generate the offsets
needed for the thread_lite structure. However, since these are offsets
into a structure, they will always be necessarily small and positive. As
such, just create a simple character array of the right size and use a
naming convention such that we can recover the field name, structure
name and type. Use nm -t d and sort -n to sort these into order, then
loop over the resutls to generate the thread_lite structure.

MFC After:		2 weeks
Reviewed by:		kib, markj (earlier versions)
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D31203
2021-07-28 13:50:09 -06:00
Warner Losh
46dd3ef033 genassym.sh: Fix two minor issues found by shellcheck
o Remove redunant $ in $(( )) expression.
o Quote arg passed to work so paths with spaces, etc will work.

MFC After:		2 weeks
Reviewed by:		kib
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D31335
2021-07-28 13:49:16 -06:00
Roy Marples
7045b1603b socket: Implement SO_RERROR
SO_RERROR indicates that receive buffer overflows should be handled as
errors. Historically receive buffer overflows have been ignored and
programs could not tell if they missed messages or messages had been
truncated because of overflows. Since programs historically do not
expect to get receive overflow errors, this behavior is not the
default.

This is really really important for programs that use route(4) to keep
in sync with the system. If we loose a message then we need to reload
the full system state, otherwise the behaviour from that point is
undefined and can lead to chasing bogus bug reports.

Reviewed by:	philip (network), kbowling (transport), gbe (manpages)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D26652
2021-07-28 09:35:09 -07:00
Konstantin Belousov
273728b125 Regen 2021-07-28 13:21:22 +03:00
Konstantin Belousov
9b6b793bd7 Revert most of ce42e79310
to restore ABI compatibility for pre-10.x binaries.

It restores _umtx_lock() and _umtx_unlock() syscalls, and UMTX_OP_LOCK/
UMTX_OP_UNLOCK umtx_op(2) operations. UMUTEX_ERROR_CHECK flag is left
out for now, I do not think it makes a difference.

PR:	218571
Reviewed by:	brooks (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31220
2021-07-28 13:21:12 +03:00
John Baldwin
be79f30d6c m_dup: Handle unmapped mbufs as an input mbuf.
Use m_copydata() instead of a direct bcopy() when copying data out of
a source mbuf into a newly-allocated mbuf.

PR:	      256610
Reported by:  Niels Bakker <niels=freebsd@bakker.net>
Reviewed by:  markj
MFC after:    2 weeks
2021-07-26 14:09:16 -07:00
Jason A. Harmening
2bc16e8aaf VFS: remove MNTK_MARKER
We no longer allow upper filesystems to be unregistered from the base
mount while vfs_notify_upper() or any other upper operation is pending.
New upper mounts can still be registered during this period, but they
will be added at the end of the upper mount tailq.  We therefore no
longer need to allocate marker nodes during vfs_notify_upper() to keep
our place in the iteration.

Reviewed by:	kib, mckusick
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D31016
2021-07-24 12:52:32 -07:00
Jason A. Harmening
c746ed724d Allow stacked filesystems to be recursively unmounted
In certain emergency cases such as media failure or removal, UFS will
initiate a forced unmount in order to prevent dirty buffers from
accumulating against the no-longer-usable filesystem.  The presence
of a stacked filesystem such as nullfs or unionfs above the UFS mount
will prevent this forced unmount from succeeding.

This change addreses the situation by allowing stacked filesystems to
be recursively unmounted on a taskqueue thread when the MNT_RECURSE
flag is specified to dounmount().  This call will block until all upper
mounts have been removed unless the caller specifies the MNT_DEFERRED
flag to indicate the base filesystem should also be unmounted from the
taskqueue.

To achieve this, the recently-added vfs_pin_from_vp()/vfs_unpin() KPIs
have been combined with the existing 'mnt_uppers' list used by nullfs
and renamed to vfs_register_upper_from_vp()/vfs_unregister_upper().
The format of the mnt_uppers list has also been changed to accommodate
filesystems such as unionfs in which a given mount may be stacked atop
more than one lower mount.  Additionally, management of lower FS
reclaim/unlink notifications has been split into a separate list
managed by a separate set of KPIs, as registration of an upper FS no
longer implies interest in these notifications.

Reviewed by:	kib, mckusick
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D31016
2021-07-24 12:52:00 -07:00
Warner Losh
6475667f7b devctl: don't publish the mount options
Mount options aren't solely ASCII strings. In addition, experience to
date suggests that the mount options are much less useful than was
originally supposed and the mount flags suffice to make decisions. Drop
the reporting of options for the mount/remount/unmount events.

Reviewed by:		markj
Reported by:		KASAN
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D31287
2021-07-24 09:03:53 -06:00
Mark Johnston
ebf9886654 imgact_elf: Avoid redefining suword()
Otherwise this interferes with the definition for sanitizer
interceptors.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-07-23 15:40:54 -04:00
Mark Johnston
048cd371f3 vfs: Initialize "lastfail" in vfs_mountroot_wait()
This variable is only used to rate-limit "Root mount waiting for: ..."
messages using ppsratecheck().

Reported by:	KMSAN
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-07-23 12:04:02 -04:00
Mark Johnston
ea3fbe0707 KASAN: Disable checking before triggering a panic
KASAN hooks will not generate reports if panicstr != NULL, but then
there is a window after the initial panic() call where another report
may be raised.  This can happen if a false positive occurs; to simplify
debugging of such problems, avoid recursing.

Sponsored by:	The FreeBSD Foundation
2021-07-23 10:47:14 -04:00
Mark Johnston
0dcef81de9 Add required sysctl name length checks to various handlers
Reported by:	KMSAN
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-07-23 10:47:13 -04:00
Mark Johnston
cae3f9dd01 select: Define select_flags[] as const
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-07-23 10:47:13 -04:00
Mark Johnston
90959dd1e5 acct: Zero pad bytes in accounting records
Reported by:	KMSAN
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-07-23 10:29:57 -04:00
Mark Johnston
5c18bf9d5f ktrace: Zero request structures when populating the pool
Otherwise uninitialized pad bytes may be copied into the ktrace log
file.

Reported by:	KMSAN
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-07-23 10:29:53 -04:00
Alan Somers
6c95065590 Escape any '.' characters in sysctl node names
ZFS creates some sysctl nodes that include a pool name, and '.' is an
allowed character in pool names.  But it's the separator in the sysctl
tree, so it can't be included in a sysctl name.  Replace it with "%25".
Handily, "%" is illegal in ZFS pool names, so there's no ambiguity
there.

PR:		257316
MFC after:	3 weeks
Sponsored by:	Axcient
Reviewed by:	freqlabs
Differential Revision: https://reviews.freebsd.org/D31265
2021-07-22 10:22:48 -06:00
Kyle Evans
23ecfa9d5b kern: mountroot: avoid fd leak in .md parsing
parse_dir_md() opens /dev/mdctl but only closes the resulting fd on
success, not upon failure of the ioctl or when we exceed the md unit
max.

Reviewed by:	kib (slightly previous version)
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
X-NetApp-PR:	#62
Differential Revision:	https://reviews.freebsd.org/D31229
2021-07-21 10:18:09 -05:00
Edward Tomasz Napierala
a40cf4175c Implement unprivileged chroot
This builds on recently introduced NO_NEW_PRIVS flag to implement
unprivileged chroot, enabled by `security.bsd.unprivileged_chroot`.
It allows non-root processes to chroot(2), provided they have the
NO_NEW_PRIVS flag set.

The chroot(8) utility gets a new flag, -n, which sets NO_NEW_PRIVS
before chrooting.

Reviewed By:	kib
Sponsored By:	EPSRC
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D30130
2021-07-20 08:57:53 +00:00
Dmitry Chagin
1ca6b15bbd Drop "All rights reserved" from my copyright statements.
Add email and fixup years while here.

Reviewed by:		imp
Differential Revision:	https://reviews.freebsd.org/D30912
MFC after:		2 weeks
2021-07-20 10:05:50 +03:00
Dmitry Chagin
5fd9cd53d2 linux(4): Modify sv_onexec hook to return an error.
Temporary add stubs to the Linux emulation layer which calls the existing hook.

Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D30911
MFC after:		2 weeks
2021-07-20 09:56:25 +03:00
Dmitry Chagin
62ba4cd340 Call sv_onexec hook after the process VA is created.
For future use in the Linux emulation layer call sv_onexec hook right after
the new process address space is created. It's safe, as sv_onexec used only
by Linux abi and linux_on_exec() does not depend on a state of process VA.

Reviewed by:		kib
Differential revision:	https://reviews.freebsd.org/D30899
MFC after:		2 weeks
2021-07-20 09:55:14 +03:00
Dmitry Chagin
b39fa4770d Remove bogus cast from exec_sysvec_init().
Reviewed by:		kib
Differential Revision:	https://reviews.freebsd.org/D30910
MFC after:		2 weeks
2021-07-20 09:54:09 +03:00
Dmitry Chagin
21629e2a45 Modify exec_sysvec_init() to allow non-native abi to setup their sysentvecs.
For future use in the Linux emulation layer modify the exec_sysvec_init()
to allow non-native abi to fill sv_timekeep_base and sv_shared_page_obj.

Reviewed by:		kib
Differential revision:	https://reviews.freebsd.org/D30898
MFC after:		2 weeks
2021-07-20 09:53:21 +03:00
Kyle Evans
db0f264393 kenv: allow listing of static kernel environments
The early environment is typically cleared, so these new options
need the PRESERVE_EARLY_KENV kernel config(8) option. These environments
are reported as missing by kenv(1) if the option is not present in the
running kernel.

Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D30835
2021-07-18 23:06:19 -05:00
Kyle Evans
7a129c973b kern: add an option for preserving the early kenv
Some downstream configurations do not store secrets in the
early (loader/static) environments and desire a way to preserve these
for diagnostic reasons.  Provide an option to do so.

Reviewed by:	imp, jhb (earlier version)
Differential Revision:	https://reviews.freebsd.org/D30834
2021-07-18 23:05:48 -05:00
David Chisnall
cf98bc28d3 Pass the syscall number to capsicum permission-denied signals
The syscall number is stored in the same register as the syscall return
on amd64 (and possibly other architectures) and so it is impossible to
recover in the signal handler after the call has returned.  This small
tweak delivers it in the `si_value` field of the signal, which is
sufficient to catch capability violations and emulate them with a call
to a more-privileged process in the signal handler.

This reapplies 3a522ba1bc with a fix for
the static assertion failure on i386.

Approved by:	markj (mentor)

Reviewed by:	kib, bcr (manpages)

Differential Revision: https://reviews.freebsd.org/D29185
2021-07-16 18:06:44 +01:00
Mark Johnston
c1aff72cfa callout: Make cc_cpu local to kern_timeout.c
No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-07-15 22:41:10 -04:00
Mark Johnston
2e5f615295 lio_listio: Don't post a completion notification if none was requested
One is allowed to use LIO_NOWAIT without specifying a sigevent.  In this
case, lj->lioj_signal is left uninitialized, but several code paths
examine liov_signal.sigev_notify to figure out which notification to
post.  Unconditionally initialize that field to SIGEV_NONE.

Add a dumb test case which triggers the bug.

Reported by:	KMSAN+syzkaller
Reviewed by:	asomers
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31197
2021-07-15 22:41:10 -04:00
Konstantin Belousov
0bdb2cbf9d procctl(PROC_ASLR_STATUS): fix vmspace leak
Reported by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2021-07-15 03:02:50 +03:00
Mark Johnston
2783335cae blist: Correct the node count computed in blist_create()
Commit bb4a27f927 added the ability to allocate a span of blocks
crossing a meta node boundary.  To ensure that blst_next_leaf_alloc()
does not walk past the end of the tree, an extra all-zero meta node
needs to be present at the end of the allocation, and
blst_next_leaf_alloc() is implemented such that the presence of this
node terminates the search.

blist_create() computes the number of nodes required.  It had two
problems:
1. When the size of the blist is a power of BLIST_RADIX, we would
   unnecessarily allocate an extra level in the tree.
2. When the size of the blist is a multiple of BLIST_RADIX, we would
   fail to allocate a terminator node.  In this case,
   blst_next_leaf_alloc() could scan beyond the bounds of the
   allocation.  This was found using KASAN.

Modify blist_create() to handle these cases correctly.

Reported by:	pho
Reviewed by:	dougm
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D31158
2021-07-13 17:47:27 -04:00
Mark Johnston
45e2357113 malloc: Pass the allocation size to malloc_large() by value
Its callers do not make use the modified size that malloc_large() was
returning, so there's no need to pass a pointer.  No functional change
intended.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-07-13 17:47:02 -04:00
Mateusz Guzik
844aa31c6d cache: add cache_enter_time_flags 2021-07-12 07:03:14 +02:00
David Chisnall
d2b558281a Revert "Pass the syscall number to capsicum permission-denied signals"
This broke the i386 build.

This reverts commit 3a522ba1bc.
2021-07-10 20:26:01 +01:00
David Chisnall
3a522ba1bc Pass the syscall number to capsicum permission-denied signals
The syscall number is stored in the same register as the syscall return
on amd64 (and possibly other architectures) and so it is impossible to
recover in the signal handler after the call has returned.  This small
tweak delivers it in the `si_value` field of the signal, which is
sufficient to catch capability violations and emulate them with a call
to a more-privileged process in the signal handler.

Approved by:	markj (mentor)

Reviewed by:	kib, bcr (manpages)

Differential Revision: https://reviews.freebsd.org/D29185
2021-07-10 17:19:52 +01:00
Alexander Motin
63ca9ea4f3 Use sleepq_signal(SLEEPQ_DROP) in cv_signal().
Same as wakeup_one()/wakeup_any() commit before it reduces the lock
hold time and so contention.

MFC after:	1 week
2021-07-09 20:57:58 -04:00
Mark Johnston
588c7a06df KASAN: Implement __asan_unregister_globals()
It will be called during KLD unload to unpoison the redzones following
global variables.  Otherwise, virtual address ranges previously used for
a KLD may be left tainted, triggering false positives when they are
recycled.

Reported by:	pho
Sponsored by:	The FreeBSD Foundation
2021-07-09 20:38:50 -04:00
Michal Meloun
e88c3b1b02 intrng: remove now redundant shadow variable.
Should not be a functional change.

Submitted by: 	ehem_freebsd@m5p.com
Discussed in:	https://reviews.freebsd.org/D29310
MFC after:	4 weeks
2021-07-08 08:46:41 +02:00
Michal Meloun
a49f208d94 intrng: Releasing interrupt source should clear interrupt table full state.
The first release of an interrupt in a situation where the interrupt table
is full should schedule a full table check the next time an interrupt is
allocated. A full check is necessary to ensure maximum separation between
the order of allocation and the order of release.

Submitted by:	ehem_freebsd@m5p.com (initial version)
Discussed in:	https://reviews.freebsd.org/D29310
MFC after:	4 weeks
2021-07-08 08:16:46 +02:00
Andrew Gallatin
4150a5a87e ktls: fix NOINET build
Reported by: mjguzik
Sponsored by: Netflix
2021-07-07 10:40:02 -04:00
Randall Stewart
d7955cc0ff tcp: HPTS performance enhancements
HPTS drives both rack and bbr, and yet there have been many complaints
about performance. This bit of work restructures hpts to help reduce CPU
overhead. It does this by now instead of relying on the timer/callout to
drive it instead use user return from a system call as well as lro flushes
to drive hpts. The timer becomes a backstop that dynamically adjusts
based on how "late" we are.

Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31083
2021-07-07 07:22:35 -04:00
Konstantin Belousov
28a66fc3da Do not call FreeBSD-ABI specific code for all ABIs
Use sysentvec hooks to only call umtx_thread_exit/umtx_exec, which handle
robust mutexes, for native FreeBSD ABI.  Similarly, there is no sense
in calling sigfastblock_clear() for non-native ABIs.

Requested by:	dchagin
Reviewed by:	dchagin, markj (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D30987
2021-07-07 14:12:07 +03:00
Konstantin Belousov
55976ce11a Move sv_onexit() sysentvec hook slightly later
after itimers are stopped.  This makes it more usable for e.g. native FreeBSD
ABI sysentvecs.

Reviewed by:	dchagin, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D30987
2021-07-07 14:12:07 +03:00
Konstantin Belousov
71ab344524 Add sv_onexec_old() sysent hook for exec event
Unlike sv_onexec(), it is called from the old (pre-exec) sysentvec structure.
The old vmspace for the process is still intact during the call.

Reviewed by:	dchagin, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D30987
2021-07-07 14:12:07 +03:00
Mateusz Guzik
c2c34ee540 mbuf: add m_get_raw and m_gethdr_raw
The intent is to eliminate the MT_NOINIT flag and consequently a branch
from the constructor.

Reviewed by:	gallatin
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D31080
2021-07-07 11:05:46 +00:00
Mateusz Guzik
0a718a6e6e mbuf: replace all direct uma_zfree(zone_mbuf) calls with m_free_raw
Reviewed by:	donner
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D31082
2021-07-07 11:05:46 +00:00
Andrew Gallatin
28d0a740dd ktls: auto-disable ifnet (inline hw) kTLS
Ifnet (inline) hw kTLS NICs typically keep state within
a TLS record, so that when transmitting in-order,
they can continue encryption on each segment sent without
DMA'ing extra state from the host.

This breaks down when transmits are out of order (eg,
TCP retransmits).  In this case, the NIC must re-DMA
the entire TLS record up to and including the segment
being retransmitted.  This means that when re-transmitting
the last 1448 byte segment of a TLS record, the NIC will
have to re-DMA the entire 16KB TLS record. This can lead
to the NIC running out of PCIe bus bandwidth well before
it saturates the network link if a lot of TCP connections have
a high retransmoit rate.

This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct),
where TCP connections with higher retransmit rate will be
switched to SW kTLS so as to conserve PCIe bandwidth.

Reviewed by:	hselasky, markj, rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D30908
2021-07-06 10:28:32 -04:00
Jessica Clarke
55c57a7811 rman: Remove an outdated comment that no longer applies
Since commit 2dd1bdf183 in 2016 the r_start and r_end fields have been
rman_res_t, which was briefly unsigned long, but commit da1b038af9
changed the typedef to be uintmax_t instead. C99 is also something we
assume these days.

Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D30808
2021-07-05 16:15:03 +01:00
Mateusz Guzik
904a08f342 ktls: switch bare zone_mbuf use to m_free_raw
Reviewed by:	gallatin
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D30955
2021-07-02 08:30:22 +00:00
Mateusz Guzik
05462babd4 mbuf: add m_free_raw to be used instead of directly calling uma_zfree
The intent is to remove all direct zone_mbuf consumers so that ctor/dtor
from that zone can be reimplemented as wrappers around uma, avoiding an
indirect function call.

Reviewed by:	kbowling
Discussed with:	gallatin
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D30959
2021-07-02 08:30:22 +00:00
Mateusz Guzik
fb32c8dbeb iflib: retire MB_DTOR_SKIP
The flag was added in 2016 but remains unused.

Reviewed by:	kbowling
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D30958
2021-07-02 08:30:22 +00:00
Edward Tomasz Napierala
db8d680ebe procctl(2): add PROC_NO_NEW_PRIVS_CTL, PROC_NO_NEW_PRIVS_STATUS
This introduces a new, per-process flag, "NO_NEW_PRIVS", which
is inherited, preserved on exec, and cannot be cleared.  The flag,
when set, makes subsequent execs ignore any SUID and SGID bits,
instead executing those binaries as if they not set.

The main purpose of the flag is implementation of Linux
PROC_SET_NO_NEW_PRIVS prctl(2), and possibly also unpriviledged
chroot.

Reviewed By:	kib
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D30939
2021-07-01 09:42:07 +01:00
Dmitry Chagin
5d9f790191 Eliminate p_elf_machine from struct proc.
Instead of p_elf_machine use machine member of the Elf_Brandinfo which is now
cached in the struct proc at p_elf_brandinfo member.

Note to MFC: D30918, KBI

Reviewed by:		kib, markj
Differential Revision:	https://reviews.freebsd.org/D30926
MFC after:		2 weeks
2021-06-29 20:18:29 +03:00
Dmitry Chagin
615f22b2fb Add a link to the Elf_Brandinfo into the struc proc.
To allow the ABI to make a dicision based on the Brandinfo add a link
to the Elf_Brandinfo into the struct proc. Add a note that the high 8 bits
of Elf_Brandinfo flags is private to the ABI.

Note to MFC: it breaks KBI.

Reviewed by:		kib, markj
Differential Revision:	https://reviews.freebsd.org/D30918
MFC after:		2 weeks
2021-06-29 20:15:08 +03:00
Edward Tomasz Napierala
435754a59e Add infrastructure required for Linux coredump support
This adds `sv_elf_core_osabi`, `sv_elf_core_abi_vendor`,
and `sv_elf_core_prepare_notes` fields to `struct sysentvec`,
and modifies imgact_elf.c to make use of them instead
of hardcoding FreeBSD-specific values.  It also updates all
of the ABI definitions to preserve current behaviour.

This makes it possible to implement non-native ELF coredump
support without unnecessary code duplication.  It will be used
for Linux coredumps.

Reviewed By:	kib
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D30921
2021-06-29 08:49:12 +01:00
Edward Tomasz Napierala
61b4c62718 imgact_elf.c: style, remove unnecessary casts
Remove unnecessary type casts and redundant brackets.
No functional changes.

Suggested By:	kib
Reviewed By:	kib
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D30841
2021-06-27 17:05:59 +01:00
Alexander Motin
6df35af4d8 Allow sleepq_signal() to drop the lock.
Introduce SLEEPQ_DROP sleepq_signal() flag, allowing one to drop the
sleep queue chain lock before returning.  Reduced lock scope allows
significantly reduce lock contention inside taskqueue_enqueue() for
ZFS worker threads doing ~350K disk reads/s on 40-thread system.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2021-06-25 14:12:21 -04:00
Konstantin Belousov
802cf4ab0e namei: add NDPREINIT() macro
Its intent is to do the initialization of the future part of struct nameidata
which should be used across several namei() and VOPs.  Right now it is NOP.

Reviewed by:	mckusick
Discussed with:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D30041
2021-06-23 23:46:15 +03:00
Warner Losh
ddfc9c4c59 newbus: Move from bus_child_{pnpinfo,location}_src to bus_child_{pnpinfo,location} with sbuf
Now that the upper layers all go through a layer to tie into these
information functions that translates an sbuf into char * and len. The
current interface suffers issues of what to do in cases of truncation,
etc. Instead, migrate all these functions to using struct sbuf and these
issues go away. The caller is also in charge of any memory allocation
and/or expansion that's needed during this process.

Create a bus_generic_child_{pnpinfo,location} and make it default. It
just returns success. This is for those busses that have no information
for these items. Migrate the now-empty routines to using this as
appropriate.

Document these new interfaces with man pages, and oversight from before.

Reviewed by:		jhb, bcr
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D29937
2021-06-22 20:52:06 -06:00
Edward Tomasz Napierala
06250515cf imgact_elf: compute auxv buffer size instead of using magic value
The new buffer is somewhat larger, but there should be no functional
changes.

Reviewed By:	kib, imp
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D30821
2021-06-21 17:07:07 +01:00
Colin Percival
fe51b5a76d kern_tslog: Include tslog data from loader
The i386 loader (and hopefully others to come) now passes tslog data
as a "preloaded module".  Include this in the data returned by the
debug.tslog sysctl.

Reviewed by:	kevans
2021-06-20 20:09:47 -07:00
Warner Losh
0a99422970 Move mips and arm to 1000Hz by default.
armv6 and armv7 systems already were 1000Hz. The other armv5 were a
mix of 100 and 1000. This changes them to 1000. Should there be
issues, we can add options HZ=100 to the systems that have bad
performance at the drop of a hat.

mips is a lot more complicated. But most of the systems are already
1000HZ. The hardware exceptions are all fast enough to run at
1000Hz. MALTA is our primary emulator, and history has shown emulators
tend to like 100Hz better, so run those systems at 100Hz. As with arm,
any system that shows a huge performance regression can reverted to
100Hz easily.

This was going to be committed well in advance of the 13 branch, but
it was delayed and forgotten til now.

Discussed on:	#bsdmips ages ago
Sponsored by:	Netflix
2021-06-16 20:00:14 -06:00
John Baldwin
faf0224ff2 ktls: Don't mark existing received mbufs notready for TOE TLS.
The TOE driver might receive decrypted TLS records that are enqueued
to the socket buffer after ktls_try_toe() returns and before
ktls_enable_rx() locks the receive buffer to call sb_mark_notready().
In that case, sb_mark_notready() would incorrectly treat the decrypted
TLS record as an encrypted record and schedule it for decryption.
This always resulted in the connection being dropped as the data in
the control message did not look like a valid TLS header.

To fix, don't try to handle software decryption of existing buffers in
the socket buffer for TOE TLS in ktls_enable_rx().  If a TOE TLS
driver needs to decrypt existing data in the socket buffer, the driver
will need to manage that in its tod_alloc_tls_session method.

Sponsored by:	Chelsio Communications
2021-06-15 17:45:21 -07:00
Konstantin Belousov
a12e901a5a Add a knob to disable dequeueing SIGCHLD on waiting for live process
It seems that Linux does not dequeue siginfo for SIGCHLD when wait*(2)
reports status of the running process.  In particular, sigwaitinfo(2)
and other signal querying syscalls can observe the siginfo after wait.

FreeBSD dequeued siginfo from the beginning, so we cannot change the
default ABI to be more compatible.  Still, add a knob to enable to
change to the other behavior for debugging purposes.

Reported by:	dchagin
Reviewed by:	dchagin, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30675
2021-06-16 02:00:19 +03:00
Konstantin Belousov
bc38762474 Add a knob to not drop signal with default ignored or ignored actions
Traditionally, BSD drops signals with the default action during send,
not even putting them to the destination process queue.  This semantic
is not shared with other operating systems (Linux), which do queue
such signals.  In particular, sigtimedwait(2) and related syscalls can
observe the delivery.

Add a global knob kern.sig_discard_ign which can be set to false to force
enqueuing of the signals with default action.  Also add an ABI flag to
indicate that signals should be queued.

Note that it is not practical to run with the knob turned on, because almost
all software that care about the delivery of such signals, is aware of the
difference, and misbehaves if the signals are actually queued.  The purpose
of the knob as is is to allow for easier diagnostic of the programs that
need the adjustments, to confirm the cause of problem.

Reported by:	dchagin
Reviewed by:	dchagin, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30675
2021-06-16 02:00:19 +03:00
Konstantin Belousov
acced8b043 sigwait: add comment explaining EINTR/ERESTART details
Reviewed by:	dchagin, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30675
2021-06-16 02:00:19 +03:00
Konstantin Belousov
afb36e289c sigwait(2) and sigtimedwait(2) must not be restarted.
Reported by:	dchagin
Reviewed by:	dchagin, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30675
2021-06-16 02:00:18 +03:00
Mark Johnston
a100217489 Consistently use the SOCKBUF_MTX() and SOCK_MTX() macros
This makes it easier to change the socket locking protocols.  No
functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-06-14 17:32:32 -04:00
Mark Johnston
f4bb1869dd Consistently use the SOLISTENING() macro
Some code was using it already, but in many places we were testing
SO_ACCEPTCONN directly.  As a small step towards fixing some bugs
involving synchronization with listen(2), make the kernel consistently
use SOLISTENING().  No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-06-14 17:32:27 -04:00
Andrew Gallatin
ed5e13cfc2 ktls: Fix interaction with RATELIMIT
uipc_ktls.c was missing opt_ratelimit.h, so it was
never noticing that RATELIMIT was enabled.  Once it was
enabled, it failed to compile as  ktls_modify_txrtlmt()
had accrued a compilation error when it was not being
compiled in.

Sponsored by: Netflix
2021-06-14 10:51:16 -04:00
Dmitry Chagin
e884512ad1 Split kern_poll() on two counterparts.
The kern_poll_kfds() operates on clear kernel data, kfds points to an
array in the kernel, while kern_poll() operates on user supplied pollfd.
Move nfds check to kern_poll_maxfds().

No functional changes, it's for future use in the Linux emulation layer.

Reviewd by:		kib
Differential Revision:	https://reviews.freebsd.org/D30690
MFC after:		2 weeks
2021-06-10 15:11:25 +03:00
Dmitry Chagin
f570a6723e Fix copyright, remove "all rights reserved".
The eventfd code was written by me, rdivacky@ copyrigth applicable only
to epoll part of the Linuxulator code. Roman is ok to retire his copyright
from sys/kern/sys_eventfd.c and 'All rights reserved.' lines from
sys/compat/linux/linux_event.[c|h] and sys/kern/sys_eventfd.c files.

Reviewed by:		kib, emaste
Approved by:		rdivacky
Differential Revision:	https://reviews.freebsd.org/D30677
MFC after:		2 weeks
2021-06-08 08:18:00 +03:00
Mark Johnston
887c753c9f Fix handling of D_GIANTOK
It was meant to suppress only the printf(), not the subsequent injection
of Giant-protected thunks for various file operations.

Fixes:		fbeb4ccac9
Reported by:	pho
Tested by:	pho
MFC after:	6 days
Pointy hat:	markj
2021-06-07 16:45:50 -04:00
Mark Johnston
fbeb4ccac9 Suppress D_NEEDGIANT warnings for some drivers
During boot we warn that the kbd and openfirm drivers are Giant-locked
and may be deleted.  Generally, the warning helps signal that certain
old drivers are not being maintained and are subject to removal, but
this doesn't really apply to certain drivers which are harder to
detangle from Giant.

Add a flag, D_GIANTOK, that devices can specify to suppress the
misleading warning.  Use it in the kbd and openfirm drivers.

Reviewed by:	imp, jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30649
2021-06-06 16:44:46 -04:00
Konstantin Belousov
2d423f7671 sysent: allow ABI to disable setid on exec.
Reviewed by:	dchagin
Tested by:	trasz
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28154
2021-06-06 21:42:52 +03:00
Konstantin Belousov
19e6043a44 kern_exec.c: Add execve_nosetid() helper
Reviewed by:	dchagin
Tested by:	trasz
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D28154
2021-06-06 21:42:41 +03:00
Jason A. Harmening
59409cb90f Add a generic mechanism for preventing forced unmount
This is aimed at preventing stacked filesystems like nullfs and unionfs
from "losing" their lower mounts due to forced unmount.  Otherwise,
VFS operations that are passed through to the lower filesystem(s) may
crash or otherwise cause unpredictable behavior.

Introduce two new functions: vfs_pin_from_vp() and vfs_unpin().
which are intended to be called on the lower mount(s) when the stacked
filesystem is mounted and unmounted, respectively.
Much as registration in the mnt_uppers list previously did, pinning
will prevent even forced unmount of the lower FS and will allow the
stacked FS to freely operate on the lower mount either by direct
use of the struct mount* or indirect use through a properly-referenced
vnode's v_mount field.

vfs_pin_from_vp() is modeled after vfs_ref_from_vp() in that it uses
the mount interlock coupled with re-checking vp->v_mount to ensure
that it will fail in the face of a pending unmount request, even if
the concurrent unmount fully completes.

Adopt these new functions in both nullfs and unionfs.

Reviewed By:	kib, markj
Differential Revision: https://reviews.freebsd.org/D30401
2021-06-05 18:20:36 -07:00
wiklam
43521b46fc Correcting comment about "sched_interact_score".
Reviewed by:	jrtc@, imp@
Pull Request:	https://github.com/freebsd/freebsd-src/pull/431

Sponsored by:		Netflix
2021-06-02 21:50:57 -06:00
Warner Losh
9f3d1a98dd regen after tweaks to getgroups and setgroups
Sponsored by:		Netflix
2021-06-02 13:24:50 -06:00
Moritz Buhl
4bc2174a1b kern: fail getgroup and setgroup with negative int
Found using
https://github.com/NetBSD/src/blob/trunk/tests/lib/libc/sys/t_getgroups.c

getgroups/setgroups want an int and therefore casting it to u_int
resulted in `getgroups(-1, ...)` not returning -1 / errno = EINVAL.

imp@ updated syscall.master and made changes markj@ suggested

PR:			189941
Tested by:		imp@
Reviewed by:		markj@
Pull Request:		https://github.com/freebsd/freebsd-src/pull/407
Differential Revision:	https://reviews.freebsd.org/D30617
2021-06-02 13:22:57 -06:00
Mateusz Guzik
c9f8dcda85 kqueue: replace kq_ncallouts loop with atomic_fetchadd 2021-06-02 15:14:58 +00:00
Rich Ercolani
a19ae1b099 vfs: fix MNT_SYNCHRONOUS check in vn_write
ca1ce50b2b ("vfs: add more safety against concurrent forced
unmount to vn_write") has a side effect of only checking MNT_SYNCHRONOUS
if O_FSYNC is set.

Reviewed By: mjg
Differential Revision: https://reviews.freebsd.org/D30610
2021-06-02 13:42:02 +00:00
Kyle Evans
2d741f33bd kern: ether_gen_addr: randomize on default hostuuid, too
Currently, this will still hash the default (all zero) hostuuid and
potentially arrive at a MAC address that has a high chance of collision
if another interface of the same name appears in the same broadcast
domain on another host without a hostuuid, e.g., some virtual machine
setups.

Instead of using the default hostuuid, just treat it as a failure and
generate a random LA unicast MAC address.

Reviewed by:	bz, gbe, imp, kbowling, kp
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D29788
2021-06-01 22:59:21 -05:00
Mark Johnston
283e60fb31 ktrace: Fix an inverted comparison added in commit f3851b235
Fixes:		f3851b235 ("ktrace: Fix a race with fork()")
Reported by:	dchagin, phk
2021-06-01 09:15:35 -04:00
Konstantin Belousov
d3f7975fcb thread_reap_barrier(): remove unused variable
Noted by:	alc
Sponsored by:	Mellanox Technologies/NVidia Networking
MFC after:	1 week
2021-05-31 23:03:42 +03:00
Konstantin Belousov
f62c7e54e9 Add thread_reap_barrier()
Reviewed by:	hselasky,markj
Sponsored by:	Mellanox Technologies/NVidia Networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30468
2021-05-31 18:09:22 +03:00
Konstantin Belousov
3a68546d23 quisce_cpus(): add special handling for PDROP
Currently passing PDROP to the quisce_cpus() function does not make sense.
Add special meaning for it, by not waiting for the idle thread to schedule.

Also avoid allocating u_int[MAXCPU] on the stack.

Reviewed by:	hselasky, markj
Sponsored by:	Mellanox Technologies/NVidia Networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30468
2021-05-31 18:09:22 +03:00
Konstantin Belousov
845d77974b kern_thread.c: wrap too long lines
Reviewed by:	hselasky, markj
Sponsored by:	Mellanox Technologies/NVidia Networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30468
2021-05-31 18:09:22 +03:00
Konstantin Belousov
e266a0f7f0 kern linker: do not allow more than one kldload and kldunload syscalls simultaneously
kld_sx is dropped e.g. for executing sysinits, which allows user
to initiate kldunload while module is not yet fully initialized.

Reviewed by:	markj
Differential revision:	https://reviews.freebsd.org/D30456
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-05-31 18:09:22 +03:00
Konstantin Belousov
27006229f7 vinvalbuf: do not panic if we were unable to flush dirty buffers
Return EBUSY instead and let caller to handle the issue.

For vgone()/vnode reclamation, caller first does vinvalbuf(V_SAVE),
which return EBUSY in case dirty buffers where not flushed. Then caller
calls vinvalbuf(0) due to non-zero return, which gets rid of all dirty
buffers without dependencies.

PR:	238565
Reviewed by:	asomers, mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D30555
2021-05-31 01:20:53 +03:00
Jason A. Harmening
a4b07a2701 VFS_QUOTACTL(9): allow implementation to indicate busy state changes
Instead of requiring all implementations of vfs_quotactl to unbusy
the mount for Q_QUOTAON and Q_QUOTAOFF, add an "mp_busy" in/out param
to VFS_QUOTACTL(9).  The implementation may then indicate to the caller
whether it needed to unbusy the mount.

Also, add stbool.h to libprocstat modules which #define _KERNEL
before including sys/mount.h.  Otherwise they'll pull in sys/types.h
before defining _KERNEL and therefore won't have the bool definition
they need for mp_busy.

Reviewed By:	kib, markj
Differential Revision: https://reviews.freebsd.org/D30556
2021-05-30 14:53:47 -07:00
Jason A. Harmening
271fcf1c28 Revert commits 6d3e78ad6c and 54256e7954
Parts of libprocstat like to pretend they're kernel components for the
sake of including mount.h, and including sys/types.h in the _KERNEL
case doesn't fix the build for some reason.  Revert both the
VFS_QUOTACTL() change and the follow-up "fix" for now.
2021-05-29 17:48:02 -07:00
Mateusz Guzik
3cf75ca220 vfs: retire unused vn_seqc_write_begin_unheld* 2021-05-29 22:04:09 +00:00
Mateusz Guzik
d81aefa8b7 vfs: use the sentinel trick in locked lookup path parsing 2021-05-29 22:04:09 +00:00
Mateusz Guzik
478c52f1e3 vfs: slightly rework vn_rlimit_fsize 2021-05-29 22:04:09 +00:00
Mateusz Guzik
9bfddb3ac4 fd: use PROC_WAIT_UNLOCKED when clearing p_fd/p_pd 2021-05-29 22:04:09 +00:00
Jason A. Harmening
6d3e78ad6c VFS_QUOTACTL(9): allow implementation to indicate busy state changes
Instead of requiring all implementations of vfs_quotactl to unbusy
the mount for Q_QUOTAON and Q_QUOTAOFF, add an "mp_busy" in/out param
to VFS_QUOTACTL(9).  The implementation may then indicate to the caller
whether it needed to unbusy the mount.

Reviewed By:	kib, markj
Differential Revision: https://reviews.freebsd.org/D30218
2021-05-29 14:05:39 -07:00