Commit Graph

18793 Commits

Author SHA1 Message Date
Mateusz Guzik
628c3b307f cache: only let non-dir descriptors through when doing EMPTYPATH lookups
Otherwise things like realpath against a file and '.' end up with an
illegal state of having a regular vnode for the parent.

Reported by:	syzbot+9aa5439dd9c708aeb1a8@syzkaller.appspotmail.com
2021-10-27 18:27:47 +00:00
Mark Johnston
71f31d784e rmslock: Update td_locks during lock and unlock operations
Reviewed by:	mjg
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32692
2021-10-27 11:18:13 -04:00
Gordon Bergling
70de1003da jail(8): Fix a few common typos in source code comments
- s/phyiscal/physical/

MFC after:	3 days
2021-10-27 06:16:06 +02:00
Kirk McKusick
dfd704b7fb Allow biodone() to be used as a completion routine.
An ordered series of BIO_READ and BIO_WRITE operations are
typically done as:

	while (work to do) {
		setup bp for I/O
		g_io_request(bp, consumer);
		biowait(bp);
	}

Here you need to have biodone() called at the completion of
the I/O to set the BIO_DONE flag and awaken the biowait(). The
obvious way to do this would be to set bio_done = biodone, but
biodone() will only take the desired action if bio_done == NULL.
The relevant code at the end of biodone() is:

	done = bp->bio_done;
	if (done == NULL) {
		mtxp = mtx_pool_find(mtxpool_sleep, bp);
		mtx_lock(mtxp);
		bp->bio_flags |= BIO_DONE;
		wakeup(bp);
		mtx_unlock(mtxp);
	} else
		done(bp);

This code would infinitely recurse if biodone() is specified as the
routine to use at completion. So before this change, a wrapper done
function had to be written:

static void
g_io_done(struct bio *bp)
{

	bp->bio_done = NULL;
	biodone(bp);
	bp->bio_done = g_io_done;
}

This commit changes

	if (done == NULL)

to

	if (done == NULL || done == biodone)

which eliminates the need for the wrapper function.

Reviewed by:  kib
Sponsored by: Netflix
2021-10-23 14:11:57 -07:00
Edward Tomasz Napierala
6e66030c4c linux: implement PTRACE_EVENT_EXEC
This fixes strace(1) from Ubuntu Focal.

Reviewed By:	jhb
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D32367
2021-10-23 19:46:26 +01:00
Konstantin Belousov
3b5331dd8d uipc_shm: silent warnings about write-only variables in largepage code
In shm_largepage_phys_populate(), the result from vm_page_grab() is only
needed for assertion.

In shm_dotruncate_largepage(), there is a commented-out prototype code
for managed largepages.   The oldobjsz is saved for its sake, so mark
the variable as __unused directly.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-10-21 21:40:46 +03:00
Konstantin Belousov
3d2778515a sig_ast_checksusp(): mark the local p as __diagused
It is only used to assert that the (current) process is locked

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-10-21 21:40:46 +03:00
Konstantin Belousov
6776747a0e subr_firmware.c::unloadentry(): remote write-only variable
The function ignores result returned by linker_release_module().
The FW_UNLOAD flag on the file is cleared, so even on error it would
not be tried again.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-10-21 21:40:46 +03:00
Konstantin Belousov
993446638c alq_open_flags(): mark local td variable as unused
It is passed to the NDINIT() macro which ignores the thread argument
for some time.

Sponsored by:	The FreeBSD Foundation
2021-10-21 21:40:46 +03:00
Konstantin Belousov
bded8fa300 umtxq_requeue: remove write-only variable uh2
umtxq_queue_lookup() does not change state.  It is redone inside
umtxq_insert() later, anyway.

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-10-21 21:40:46 +03:00
John Baldwin
96668a81ae ktls: Always create a software backend for receive sessions.
A future change to TOE TLS will require a software fallback for the
first few TLS records received.  Future support for NIC TLS on receive
will also require a software fallback for certain cases.

Reviewed by:	gallatin, hselasky
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D32566
2021-10-21 09:37:17 -07:00
John Baldwin
c57dbec69a ktls: Add a routine to query information in a receive socket buffer.
In particular, ktls_pending_rx_info() determines which TLS record is
at the end of the current receive socket buffer (including
not-yet-decrypted data) along with how much data in that TLS record is
not yet present in the socket buffer.

This is useful for future changes to support NIC TLS receive offload
and enhancements to TOE TLS receive offload.  Those use cases need a
way to synchronize a state machine on the NIC with the TLS record
boundaries in the TCP stream.

Reviewed by:	gallatin, hselasky
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D32564
2021-10-21 09:36:29 -07:00
Mark Johnston
84c3922243 Convert consumers to vm_page_alloc_noobj_contig()
Remove now-unneeded page zeroing.  No functional change intended.

Reviewed by:	alc, hselasky, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32006
2021-10-19 21:22:56 -04:00
Mark Johnston
a4667e09e6 Convert vm_page_alloc() callers to use vm_page_alloc_noobj().
Remove page zeroing code from consumers and stop specifying
VM_ALLOC_NOOBJ.  In a few places, also convert an allocation loop to
simply use VM_ALLOC_WAITOK.

Similarly, convert vm_page_alloc_domain() callers.

Note that callers are now responsible for assigning the pindex.

Reviewed by:	alc, hselasky, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31986
2021-10-19 21:22:56 -04:00
Konstantin Belousov
c7f38a2df1 procctl: stop using SA_*LOCKED, define local enum
Using SA_*LOCKED constants breaks !INVARIANT builds

Reported by:	cy
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-10-20 00:25:19 +03:00
Konstantin Belousov
49db81aa05 kern_procctl: skip zombies for process group operations
When iterating over the process group members, skip zombies same as it
is done by pfind() for single-process operation.

Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32513
2021-10-19 23:04:34 +03:00
Konstantin Belousov
3692877a6c kern_procctl.c: use td->td_proc instead of curproc
Suggested by:	markj
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32513
2021-10-19 23:04:34 +03:00
Konstantin Belousov
f5bb6e5a6d procctl: actually require debug privileges over target
for state control over TRACE, TRAPCAP, ASLR, PROTMAX, STACKGAP,
NO_NEWPRIVS, and WXMAP.

Reported by:	emaste
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32513
2021-10-19 23:04:34 +03:00
Konstantin Belousov
1c4dbee5dd procctl: make it possible to specify that some operations require debug privilege over the target
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32513
2021-10-19 23:04:34 +03:00
Konstantin Belousov
32026f5983 sys_procctl(): zero the data buffer once, on syscall entry
and remove zeroing of it from specific functions.  This way it is
guaranteed that we do not leak kernel data.

Suggested by:	markj
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32513
2021-10-19 23:04:34 +03:00
Konstantin Belousov
56d5323b4d sys_procctl(): use table data to do copyin/copyout
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32513
2021-10-19 23:04:34 +03:00
Konstantin Belousov
68dc5b381a kern_procctl_single(): convert to use table data
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32513
2021-10-19 23:04:34 +03:00
Konstantin Belousov
34f39a8c0e procctl: convert PDEATHSIG_CTL/STATUS to regular kern_procctl_single() cases
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32513
2021-10-19 23:04:34 +03:00
Konstantin Belousov
f833ab9dd1 procctl(2): add consistent shortcut P_ID:0 as curproc
Reported by:	bdrewery, emaste
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32513
2021-10-19 23:04:34 +03:00
Konstantin Belousov
7ae879b14a kern_procctl(): convert the function to be table-driven
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32513
2021-10-19 23:04:34 +03:00
Konstantin Belousov
31faa565ed sys_procctl(2): remove sysproto and argused
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32513
2021-10-19 23:04:33 +03:00
Mark Johnston
621fd9dcb2 timecounter: Lock the timecounter list
Timecounter registration is dynamic, i.e., there is no requirement that
timecounters must be registered during single-threaded boot.  Loadable
drivers may in principle register timecounters (which can be switched to
automatically).  Timecounters cannot be unregistered, though this could
be implemented.

Registered timecounters belong to a global linked list.  Add a mutex to
synchronize insertions and the traversals done by (mpsafe) sysctl
handlers.  No functional change intended.

Reviewed by:	imp, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32511
2021-10-18 09:56:59 -04:00
Mark Johnston
81f2e9063d signal: Add SIG_FOREACH and refactor issignal()
Add a SIG_FOREACH macro that can be used to iterate over a signal set.
This is a bit cleaner and more efficient than calling sig_ffs() in a
loop.  The implementation is based on BIT_FOREACH_ISSET(), except
that the bitset limbs are always 32 bits wide, and signal sets are
1-indexed rather than 0-indexed like bitset(9) sets.

issignal() cannot really be modified to use SIG_FOREACH() directly.
Take this opportunity to split the function into two explicit loops.
I've always found this function hard to read and think that this change
is an improvement.

Remove sig_ffs(), nothing uses it now.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32473
2021-10-18 09:56:58 -04:00
Colin Percival
52e125c2bd TSLOG: Report final execname, not first
In cases such as daemons launched via limits(1), a process may call
exec multiple times; the last name of the last binary executed is
usually (always?) more informative.

Fixes:	46dd801acb Add userland boot profiling to TSLOG
Sponsored by:	https://www.patreon.com/cperciva
2021-10-17 13:36:38 -07:00
Jessica Clarke
682c00a6ce riscv: Implement pmap_mapdev_attr
This is needed for LinuxKPI's _ioremap_attr. This reuses the generic
implementation introduced for aarch64, and itself requires implementing
pmap_kenter, which is trivial to do given riscv currently treats all
mapping attributes the same due to the Svpbmt extension not yet being
ratified and in hardware.

Reviewed by:	markj, mhorne
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D32445
2021-10-17 15:31:35 +01:00
Mateusz Guzik
1045352f15 cache: only assert on flags when dealing with EMPTYPATH
Reported by:	syzbot+bd48ee0843206a09e6b8@syzkaller.appspotmail.com
Fixes:		7dd419cabc ("cache: add empty path support")
2021-10-17 08:42:47 +00:00
Mateusz Guzik
7dd419cabc cache: add empty path support
This avoids spurious drop offs as EMPTY is passed regardless of the
actual path name.

Pushign the work inside the lookup instead of just ignorign the flag
allows avoid checking for empty pathname for all other lookups.
2021-10-16 20:08:37 +00:00
Colin Percival
46dd801acb Add userland boot profiling to TSLOG
On kernels compiled with 'options TSLOG', record for each process ID:
* The timestamp of the fork() which creates it and the parent
process ID,
* The first path passed to execve(), if any,
* The first path resolved by namei, if any, and
* The timestamp of the exit() which terminates the process.

Expose this information via a new sysctl, debug.tslog_user.

On kernels lacking 'options TSLOG' (the default), no information is
recorded and the sysctl does not exist.

Note that recording namei is needed in order to obtain the names of
rc.d scripts being launched, as the rc system sources them in a
subshell rather than execing the scripts.

With this commit it is now possible to generate flamecharts of the
entire boot process from the start of the loader to the end of
/etc/rc.  The code needed to perform this processing is currently
found in github: https://github.com/cperciva/freebsd-boot-profiling

Reviewed by:	mhorne
Sponsored by:	https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D32493
2021-10-16 11:47:34 -07:00
Dawid Gorecki
a97d697122 kern_exec: Add kern.stacktop sysctl.
With stack gap enabled top of the stack is moved down by a random
amount of bytes. Because of that some multithreaded applications
which use kern.usrstack sysctl to calculate address of stacks for
their threads can fail. Add kern.stacktop sysctl, which can be used
to retrieve address of the stack after stack gap is applied to it.
Returns value identical to kern.usrstack for processes which have
no stack gap.

Reviewed by: kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D31897
2021-10-15 10:21:55 +02:00
Dawid Gorecki
889b56c8cd setrlimit: Take stack gap into account.
Calling setrlimit with stack gap enabled and with low values of stack
resource limit often caused the program to abort immediately after
exiting the syscall. This happened due to the fact that the resource
limit was calculated assuming that the stack started at sv_usrstack,
while with stack gap enabled the stack is moved by a random number
of bytes.

Save information about stack size in struct vmspace and adjust the
rlim_cur value. If the rlim_cur and stack gap is bigger than rlim_max,
then the value is truncated to rlim_max.

PR: 253208
Reviewed by: kib
Obtained from: Semihalf
Sponsored by: Stormshield
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D31516
2021-10-15 10:21:47 +02:00
John Baldwin
a72ee35564 ktls: Defer creation of threads and zones until first use.
Run ktls_init() when the first KTLS session is created rather than
unconditionally during boot.  This avoids creating unused threads and
allocating unused resources on systems which do not use KTLS.

Reviewed by:	gallatin, markj
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D32487
2021-10-14 15:48:34 -07:00
Konstantin Belousov
1adebca1fc Style
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2021-10-14 23:07:32 +03:00
Brooks Davis
04c91ac48a selsocket: handle sopoll() errors correctly
Without this change, unmounting smbfs filesystems with an INVARIANTS
kernel would panic after 10e64782ed.

Found by:	markj
Reviewed by:	markj, jhb
Obtained from:	CheriBSD
MFC after:	3 days
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D32492
2021-10-14 00:43:48 +01:00
John Baldwin
9f03d2c001 ktls: Ensure FIFO encryption order for TLS 1.0.
TLS 1.0 records are encrypted as one continuous CBC chain where the
last block of the previous record is used as the IV for the next
record.  As a result, TLS 1.0 records cannot be encrypted out of order
but must be encrypted as a FIFO.

If the later pages of a sendfile(2) request complete before the first
pages, then TLS records can be encrypted out of order.  For TLS 1.1
and later this is fine, but this can break for TLS 1.0.

To cope, add a queue in each TLS session to hold TLS records that
contain valid unencrypted data but are waiting for an earlier TLS
record to be encrypted first.

- In ktls_enqueue(), check if a TLS record being queued is the next
  record expected for a TLS 1.0 session.  If not, it is placed in
  sorted order in the pending_records queue in the TLS session.

  If it is the next expected record, queue it for SW encryption like
  normal.  In addition, check if this new record (really a potential
  batch of records) was holding up any previously queued records in
  the pending_records queue.  Any of those records that are now in
  order are also placed on the queue for SW encryption.

- In ktls_destroy(), free any TLS records on the pending_records
  queue.  These mbufs are marked M_NOTREADY so were not freed when the
  socket buffer was purged in sbdestroy().  Instead, they must be
  freed explicitly.

Reviewed by:	gallatin, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D32381
2021-10-13 12:30:15 -07:00
John Baldwin
a63752cce6 ktls: Reject attempts to enable AES-CBC with TLS 1.3.
AES-CBC cipher suites are not supported in TLS 1.3.

Reported by:	syzbot+ab501c50033ec01d53c6@syzkaller.appspotmail.com
Reviewed by:	tuexen, markj
Differential Revision:	https://reviews.freebsd.org/D32404
2021-10-13 12:12:58 -07:00
Mark Johnston
03d5820f73 mount: Check for !VDIR mount points before handling -o emptydir
To implement -o emptydir, vfs_emptydir() checks that the passed
directory is empty.  This should be done after checking whether the
vnode is of type VDIR, though, or vfs_emptydir() may end up calling
VOP_READDIR on a non-directory.

Reported by:	syzbot+4006732c69fb0f792b2c@syzkaller.appspotmail.com
Reviewed by:	kib, imp
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32475
2021-10-13 09:33:35 -04:00
John Baldwin
d1b6fef075 Stop creating socket aio kprocs during boot.
Create the initial pool of kprocs on demand when the first socket AIO
request is submitted instead.  The pool of kprocs used for other AIO
requests is similarly created on first use.

Reviewed by:	asomers
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D32468
2021-10-12 14:03:07 -07:00
Kyle Evans
7259ca3104 fifos: delegate unhandled kqueue filters to underlying filesystem
This gives the vfs layer a chance to provide handling for EVFILT_VNODE,
for instance.  Change pipe_specops to use the default vop_kqfilter to
accommodate fifoops that don't specify the method (i.e. all in-tree).

Based on a patch by Jan Kokemüller.

PR:		225934
Reviewed by:	kib, markj (both pre-KASSERT)
Differential Revision:	https://reviews.freebsd.org/D32271
2021-10-12 02:43:07 -05:00
Greg V
98dae405de O_PATH: allow vfs_extattr syscalls
These calls do operate on vnodes only, not file contents.
This is useful for e.g. the xdg-document-portal fuse filesystem.

Reviewed by:	kib, markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D32438
2021-10-11 20:09:49 +03:00
Mateusz Guzik
2b68eb8e1d vfs: remove thread argument from VOP_STAT
and fo_stat.
2021-10-11 13:22:32 +00:00
Mateusz Guzik
b4a58fbf64 vfs: remove cn_thread
It is always curthread.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D32453
2021-10-11 13:21:47 +00:00
Andrew Turner
a85ce4ad72 Add pmap_change_prot on arm64
Support changing the protection of preloaded kernel modules by
implementing pmap_change_prot on arm64 and calling it from
preload_protect.

Reviewed by:	alc (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D32026
2021-10-11 10:26:45 +01:00
Mateusz Guzik
93e0523499 vfs: add predicts to getvnode and getvnode_path 2021-10-10 18:24:29 +00:00
Mateusz Guzik
a0558fe90d Retire code added to support CloudABI
CloudABI was removed in cf0ee8738e
2021-10-10 18:24:29 +00:00
Konstantin Belousov
5fb54d2fc8 readlinkat(2): allow O_PATH fd
PR:	258856
Reported by:	ashish
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32390
2021-10-09 22:31:37 +03:00
Mark Johnston
fa9da1f590 timecounter: Let kern.timecounter.stepwarnings be set as a tunable
MFC after:	1 week
2021-10-09 12:34:06 -04:00
Konstantin Belousov
b5cadc643e Make core dump writes interruptible with SIGKILL
This can be disabled by sysctl kern.core_dump_can_intr

Reported and tested by:	pho
Reviewed by:	imp, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32313
2021-10-08 03:21:43 +03:00
Konstantin Belousov
244ab56611 Add curproc_sigkilled()
Function returns an indicator that the process was killed with SIGKILL

Reviewed by:	imp, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32313
2021-10-08 03:21:43 +03:00
Konstantin Belousov
dc2d0899bb kern_sig.c: Remove unused SIGPROP_CANTMASK
Reviewed by:	imp, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32313
2021-10-08 03:21:42 +03:00
Mark Johnston
880b670c6f malloc: Unmark KASAN redzones if the full allocation size was requested
Consumers that want the full allocation size will typically access the
full buffer, so mark the entire allocation as valid to avoid useless
KASAN reports.

Sponsored by:	The FreeBSD Foundation
2021-10-06 16:09:41 -04:00
Konstantin Belousov
9b86d3e5de When queuing ignored signal, only abort target thread' sleep if it is inside sigwait()
Reported and tested by:	trasz
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32252
2021-10-06 17:05:22 +03:00
Konstantin Belousov
f17eb93d55 When sending ignored signal, arrange for zero return code from sleep
Otherwise consumers get unexpected EINTR errors without seeing
a properly discarded signal.

Reported and tested by:	trasz
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32252
2021-10-06 17:05:22 +03:00
Konstantin Belousov
b599982b65 Move td_pflags2 TDP2_SIGWAIT to td_flags TDF_SIGWAIT
The flag should be accessible from non-current threads.

Reviewed by:	markj
Tested by:	trasz
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32252
2021-10-06 17:05:22 +03:00
Alexander Motin
7835b2cb4a sbuf(9): Microoptimize sbuf_put_byte()
This function is actively used by sbuf_vprintf(), so this simple
inlining in half reduces time of kern.geom.confxml generation.

MFC after:	2 weeks
Sponsored by:	iXsystem, Inc.
2021-10-05 14:47:38 -04:00
Alexander Motin
6df1359e55 sleepqueue(9): Remove sbinuptime() from sleepq_timeout().
Callout c_time is always bigger or equal than the scheduled time.  It
is also smaller than sbinuptime() and can't change while the callback
is running.  So we reliably can use it instead of sbinuptime() here.
In case there was a race and the callout was rescheduled to the later
time, the callback will be called again.

According to profiles it saves ~5% of the timer interrupt time even
with fast TSC timecounter.

MFC after:	1 month
2021-10-02 21:08:41 -04:00
Alexander Motin
1c119e173d sched_ule(4): Fix possible significance loss.
Before this change kern.sched.interact sysctl setting above 32 gave
all interactive threads identical priority of PRI_MIN_INTERACT due to
((PRI_MAX_INTERACT - PRI_MIN_INTERACT + 1) / sched_interact) turning
zero.  Setting the sysctl lower reduced the range of used priority
levels up to half, that is not great either.

Change of the operations order should fix the issue, always using full
range of priorities, while overflow is impossible there since both
score and priority values are small.  While there, make the variables
unsigned as they really are.

MFC after:	1 month
2021-10-02 00:09:45 -04:00
Mateusz Guzik
c9536389d7 vfs: hoist cn_thread assert in namei
Making it condtional on whether ktrace happens to be enabled makes no
sense.
2021-10-01 21:56:29 +00:00
Gleb Smirnoff
a37e4fd1ea Re-style dfcef87714 to keep the code and variables related to
listening sockets separated from code for generic sockets.

No objection:	markj
2021-10-01 13:38:24 -07:00
Justin Hibbits
63cb9308a7 Fix segment size in compressing core dumps
A core segment is bounded in size only by memory size.  On 64-bit
architectures this means a segment can be much larger than 4GB.
However, compress_chunk() takes only a u_int, clamping segment size to
4GB-1, resulting in a truncated core.  Everything else, including the
compressor internally, uses size_t, so use size_t at the boundary here.

This dates back to the original refactor back in 2015 (r279801 /
aa14e9b7).

MFC after:	1 week
Sponsored by:	Juniper Networks, Inc.
2021-10-01 14:16:33 -05:00
Kyle Evans
2f4dbe279f kqueue: fix recent assertion
NOTE_ABSTIME may also have a zero timeout, which indicates that we
should still fire immediately as an absolute time in the past.  A test
has been added for this one as well.

Fixes:	9c999a259f ("kqueue: don't arbitrarily restrict long-past...")
Point hat:	kevans
Reported by:	syzbot+1c8d1154f560b3930042@syzkaller.appspotmail.com
2021-10-01 13:17:30 -05:00
Kyle Evans
9c999a259f kqueue: don't arbitrarily restrict long-past values for NOTE_ABSTIME
NOTE_ABSTIME values are converted to values relative to boottime in
filt_timervalidate(), and negative values are currently rejected.  We
don't reject times in the past in general, so clamp this up to 0 as
needed such that the timer fires immediately rather than imposing what
looks like an arbitrary restriction.

Another possible scenario is that the system clock had to be adjusted
by ~minutes or ~hours and we have less than that in terms of uptime,
making a reasonable short-timeout suddenly invalid. Firing it is still
a valid choice in this scenario so that applications can at least
expect a consistent behavior.

Reviewed by:	kib, markj
Discussed with:	allanjude
Differential Revision:	https://reviews.freebsd.org/D32230
2021-09-30 21:31:24 -05:00
Mateusz Guzik
85c855d31b fd: add pwd_hold_proc 2021-09-30 12:49:51 +02:00
Mitchell Horne
ab4ed843a3 minidump: De-duplicate the progress bar
The implementation of the progress bar is simple, but duplicated for
most minidump implementations. Extract the common bits to kern_dump.c.
Ensure that the bar is reset with each subsequent dump; this was only
done on some platforms previously.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D31885
2021-09-29 16:42:21 -03:00
Jamie Gritton
747a47261e Fix error return of kern.ipc.posix_shm_list, which caused it (and thus
"posixshmcontrol ls") to fail for all jails that didn't happen to own
the last shm object in the list.
2021-09-29 10:20:36 -07:00
Mitchell Horne
800e74955d boot(9): update to match reality
This function was renamed to kern_reboot() in 2010, but the man page has
failed to keep in sync. Bring it up to date on the rename, add the
shutdown hooks to the synopsis, and document the (obvious) fact that
kern_reboot() does not return.

Fix an outdated reference to the old name in kern_reboot(), and leave a
reference to the man page so future readers might find it before any
large changes.

Reviewed by:	imp, markj
MFC after:	3 days
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32085
2021-09-28 11:36:09 -03:00
Kirk McKusick
1c8d670cb6 Bring the tags and links entries for amd64 up to date.
MFC after:    1 week
Sponsored by: Netflix
2021-09-27 20:04:51 -07:00
Elliott Mitchell
bcddaadbef rman: fix overflow in rman_reserve_resource_bound()
If the default range of [0, ~0] is given, then (~0 - 0) + 1 == 0. This
in turn will cause any allocation of non-zero size to fail. Zero-sized
allocations are prohibited, so add a KASSERT to this effect.

History indicates it is part of the original rman code.  This bug may in
fact be older than some contributors.

Reviewed by:	mhorne
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D30280
2021-09-27 14:38:26 -03:00
Alexander Motin
08063e9f98 sched_ule(4): Fix hang with steal_thresh < 2.
e745d729be caused infinite loop with interrupts disabled in load
stealing code if steal_thresh set below 2.  Such configuration should
not generally be used, but appeared some people are using it to
workaround some problems.

To fix the problem explicitly pass to sched_highest() minimum number
of transferrable threads, supported by the caller, instead of guessing.

MFC after:	25 days
2021-09-26 12:03:05 -04:00
Gordon Bergling
8771ff7538 jail(9): Fix a typo in a comment
- s/erorr/error/

MFC after:	3 days
2021-09-26 15:17:41 +02:00
Yoshihiro Ota
cb17f4a6bd kern_ctf: Use zlib's uncompress function for simpler code.
Reviewed by:	markj, delphij
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D21531
2021-09-25 23:33:00 -07:00
Mateusz Guzik
d71e1a883c fifo: support flock
This evens it up with Linux.

Original patch by:	Greg V <greg@unrelenting.technology>
Differential Revision:	https://reviews.freebsd.org/D24255#565302
2021-09-25 14:58:31 +00:00
Konstantin Belousov
71d31f1cf6 malloc_aligned(9): allow zero size and alignment
For alignment we do not need to do anything to make it operational.
For size, upgrade zero sized request to one byte so that we do not
request insane amount of memory for placeholder.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32127
2021-09-25 15:58:12 +03:00
Gordon Bergling
2aad906266 ubsan: Fix a typo in an error message
- s/asumption/assumption/

Obtained from:	NetBSD
MFC after:	1 week
2021-09-25 11:47:24 +02:00
Alexander Motin
d3a8f98acb Make CPU children explicitly share parent unit numbers.
Before this device unit number match was coincidental and broke if I
disabled some CPU device(s).  Aside of cosmetics, for some drivers
(may be considered broken) it caused talking to wrong CPUs.
2021-09-24 23:31:51 -04:00
Alexander Motin
f73c2bbf81 bus: Cleanup device_probe_child()
When device driver probe method returns 0, i.e. absolute priority, do
not remove its class from the device just to set it back few lines
later, that may change the device unit number, etc. and after which
we'd better call the probe again.

If during search we found some driver with absolute priority, we do
not need to set device driver and class since we haven't removed them
before.

It should not happen, but if second probe method call failed, remove
the driver and possibly the class from the device as it was when we
started.

Reviewed by:	imp, jhb
Differential Revision:	https://reviews.freebsd.org/D32125
2021-09-24 20:34:56 -04:00
Warner Losh
67a9e76da6 bus: Fix LINT / BUS_DEBUG build
Fix 0389e9be63 for LINT built. Removed an arg only from code
under BUS_DEBUG w/o rebuilding LINT...

Sponsored by:		Netflix
Fixes: 0389e9be63
2021-09-24 14:04:39 -06:00
Warner Losh
0389e9be63 bus: retire DF_REBID
I did DF_REBID to allow for 'hoover' drivers that would attach to
otherwise unattached devices in the tree. This notion didn't catch on as
it was tricky to make work well and it was easier to just publish a /dev
node of some flavor by the parent device. It's been nothing but dead
weight for a long time.

Reviewed by:		mav
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D32056
2021-09-24 12:15:34 -06:00
Nathaniel Wesley Filardo
0321a7990b kqueue: Add EV_KEEPUDATA flag
When this flag is set, operations that update an existing kevent will
not change the udata field.  This can be used to NOTE_TRIGGER or
EV_{EN,DIS}ABLE events without overwriting the stashed pointer.

Reviewed by:	Domagoj Stolfa <domagoj.stolfa@gmail.com>
Obtained from:	CheriBSD
Sponsored by:	Microsoft
Differential Revision:	https://reviews.freebsd.org/D30286
2021-09-23 17:31:39 -07:00
Konstantin Belousov
45c2c7c484 aio_aqueue(): avoid ucred leak on failure path
PR:	258698
Submitted by:	sigsys@gmail.com
MFC after:	1 week
2021-09-24 03:18:34 +03:00
Alexander Motin
ef50d5fbc3 x86: Add NUMA nodes into CPU topology.
Depending on hardware, NUMA nodes may match last level caches, or
they may be above them (AMD Zen 2/3) or below (Intel Xeon w/ SNC).
This information is provided by ACPI instead of CPUID, and it is
provided for each CPU individually instead of mask widths, but
this code should be able to properly handle all the above cases.

This change should immediately allow idle stealing in sched_ule(4)
to prefer load from NUMA-local CPUs to remote ones when the node
does not match LLC.  Later we may think of how to better handle it
on sched_pickcpu() side.

MFC after:	1 month
2021-09-23 14:31:38 -04:00
Wojciech Macek
7bc13692a2 hwpmc: fix performance issues
Differential revision:	https://reviews.freebsd.org/D32025

Avoid using atomics as it_wait is guarded by td_lock.

Report threshold calculation is done only if at least one PMC hook
is installed

Fixes:
* avoid unnecessary branching (if frame != null ...)
  by having PMC_HOOK_INSTALLED_ANY
  condition on the top of them, which should hint
  the core not to execute speculatively anything
  which us underneath;
* access intr_hwpmc_waiting_report_threshold cacheline
  only if at least one hook is loaded;
2021-09-23 07:15:42 +02:00
Alexander Motin
884f38590c Fix false device_set_unit() error.
It should silently succeed if the current unit number is the same as
requested, not fail immediately.

MFC after:	1 week
2021-09-22 08:44:39 -04:00
Alexander Motin
8db1669959 Fix build without SMP.
MFC after:	1 month
2021-09-21 22:13:33 -04:00
Alexander Motin
e745d729be sched_ule(4): Improve long-term load balancer.
Before this change long-term load balancer was unable to migrate
running threads, only ones waiting on run queues.  But with growing
number of CPU cores it is quite typical now for system to not have
many waiting threads.  But same time if due to some coincidence two
long-running CPU-bound threads ended up sharing same physical CPU
core, they could suffer from the SMT penalty indefinitely, and the
load balancer couldn't help.

Improve that by teaching the load balancer to hint running threads
to migrate by marking them with TDF_NEEDRESCHED and new TDF_PICKCPU
flag, making sched_pickcpu() to search for better CPU later, when
it is convenient.

Fix CPU search logic when balancing to limit round-robin migrations
in case of almost equal load to the group of physical cores.  The
previous code bounced threads across all the system, that should be
pretty bad for caches and NUMA affinity, while additional fairness
was almost invisible, diminishing with number of cores in the group.

MFC after:	1 month
2021-09-21 18:19:20 -04:00
Konstantin Belousov
397f188936 Remove SV_CAPSICUM
It was only needed for cloudabi

Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D31923
2021-09-22 00:18:44 +03:00
Konstantin Belousov
cf0ee8738e Drop cloudabi
According to https://github.com/NuxiNL/cloudlibc:
CloudABI is no longer being maintained. It was an awesome experiment,
but it never got enough traction to be sustainable.

There is no reason to keep it in FreeBSD.

Approved by:	ed (private mail)
Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D31923
2021-09-22 00:18:44 +03:00
Alexander Motin
bd84094a51 sched_ule(4): Fix interactive threads stealing.
In scenarios when first thread in the queue can migrate to specified
CPU, but later ones can't runq_steal_from() incorrectly returned NULL.

MFC after:	2 weeks
2021-09-21 16:03:32 -04:00
Konstantin Belousov
bd9e0f5df6 amd64: eliminate td_md.md_fpu_scratch
For signal send, copyout from the user FPU save area directly.

For sigreturn, we are in sleepable context and can do temporal
allocation of the transient save area.  We cannot copying from userspace
directly to user save area because XSAVE state needs to be validated,
also partial copyins can corrupt it.

Requested by:	jhb
Reviewed by:	jhb, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31954
2021-09-21 20:20:15 +03:00
Konstantin Belousov
df8dd6025a amd64: stop using top of the thread' kernel stack for FPU user save area
Instead do one more allocation at the thread creation time.  This frees
a lot of space on the stack.

Also do not use alloca() for temporal storage in signal delivery sendsig()
function and signal return syscall sys_sigreturn().  This saves equal
amount of space, again by the cost of one more allocation at the thread
creation time.

A useful experiment now would be to reduce KSTACK_PAGES.

Reviewed by:	jhb, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31954
2021-09-21 20:20:15 +03:00
Konstantin Belousov
2933a7ca03 aio_fsync_vnode: handle ERELOOKUP after VOP_FSYNC()
Reported by:	tmunro
Reviewed by:	jhb, tmunro
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32023
2021-09-20 21:40:17 +03:00
Konstantin Belousov
922bee44e4 aio_fsync_vnode: use for(;;) loop instead of label
Reviewed by:	jhb, tmunro
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32023
2021-09-20 21:39:46 +03:00
Bartlomiej Grzesik
3f9a00e3b5 device: add device_get_property and device_has_property
Generialize bus specific property accessors. Those functions allow driver code
to access device specific information.

Currently there is only support for FDT and ACPI buses.

Reviewed by: manu, mw
Sponsored by: Semihalf
Differential revision: https://reviews.freebsd.org/D31597
2021-09-20 17:17:57 +02:00
Mateusz Guzik
7b2ac8eb9b vfs: add missing VIRF_MOUNTPOINT in vfs_mountroot_shuffle
Reported by:	mav
2021-09-18 21:13:51 +02:00
Mateusz Guzik
0d9e99ce3b vfs: add the missing vnode interlock in vfs_mountroot_shuffle
Around v_mountedhere assignment.
2021-09-18 21:13:51 +02:00
Mark Johnston
50b07c1f71 unix: Fix a use-after-free in unp_drop()
We need to load the socket pointer after locking the PCB, otherwise
the socket may have been detached and freed by the time that unp_drop()
sets so_error.

This previously went unnoticed as the socket zone was _NOFREE.

Reported by:	pho
MFC after:	1 week
2021-09-18 10:38:39 -04:00
Mateusz Guzik
f902e4bb04 lockmgr: fix lock profiling of face adaptive spinning 2021-09-18 10:16:58 +00:00
Mateusz Guzik
a2cb65b8fe cache: count vnodes in cache_purgevfs 2021-09-18 10:16:50 +00:00
Mateusz Guzik
5d8e32a66c vfs: retire VNODE_REFCOUNT_FENCE_* macros
They are unused as of last year.
2021-09-18 10:16:00 +00:00
Mark Johnston
2bd9826995 vfs: Permit unix sockets to be opened with O_PATH
As with FIFOs, a path descriptor for a unix socket cannot be used with
kevent().

In principle connectat(2) and bindat(2) could be modified to support an
AT_EMPTY_PATH-like mode which operates on the socket referenced by an
O_PATH fd referencing a unix socket.  That would eliminate the path
length limit imposed by sockaddr_un.

Update O_PATH tests.

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31970
2021-09-17 14:19:06 -04:00
Mark Johnston
ade1daa5c0 socket: Synchronize soshutdown() with listen(2) and AIO
To handle shutdown(SHUT_RD) we flush the receive buffer of the socket.
This may involve searching for control messages of type SCM_RIGHTS,
since we need to close the file references.  Closing arbitrary files
with socket buffer locks held is undesirable, mainly due to lock
ordering issues, so we instead make a copy of the socket buffer and
operate on that without any locks.  Fields in the original buffer are
cleared.

This behaviour clobbered the AIO job queue associated with a receive
buffer.  It could also cause us to leak a KTLS session reference.
Reorder socket buffer fields to address this.

An alternate solution would be to remove the hack in sorflush(), but
this is not quite feasible (yet).  In particular, though sorflush()
flags the sockbuf with SBS_CANTRCVMORE, it is possible for more data to
be queued - the flag just prevents userspace from reading more data.  I
suspect we should fix this; SBS_CANTRCVMORE represents a terminal state
and protocols can likely just drop any data destined for such a buffer.
Many of them already do, but in some cases the check is racy, and some
KPI churn will be needed to fix everything.  This approach is more
straightforward for now.

Reported by:	syzbot+104d8ee3430361cb2795@syzkaller.appspotmail.com
Reported by:	syzbot+5bd2e7d05f84a59d0d1b@syzkaller.appspotmail.com
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31976
2021-09-17 14:19:06 -04:00
Mark Johnston
883761f0a8 socket: Remove NOFREE from the socket zone
This flag was added during the transition away from the legacy zone
allocator, commit c897b81311.  The old
zone allocator effectively provided _NOFREE semantics, but it seems that
they are not required for sockets.  In particular, we use reference
counting to keep sockets live.

One somewhat dangerous case is sonewconn(), which returns a pointer to a
socket with reference count 0.  This socket is still effectively owned
by the listening socket.  Protocols must therefore be careful to
synchronize sonewconn() calls with their pru_close implementations,
since for listening sockets soclose() will abort the child sockets.  For
example, TCP holds the listening socket's PCB read locked across the
sonewconn() call, which blocks tcp_usr_close(), and sofree()
synchronizes with a concurrent soabort() of the nascent socket.
However, _NOFREE semantics are not required here.

Eliminating _NOFREE has several benefits: it enables use-after-free
detection (e.g., by KASAN) and lets the system reclaim memory from the
socket zone under memory pressure.  No functional change intended.

Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31975
2021-09-17 14:19:06 -04:00
Mark Johnston
6b288408ca socket: Add assertions around naked refcount decrements
Sockets in a listen queue hold a reference to the parent listening
socket.  Several code paths release this reference manually when moving
a child socket out of the queue.

Replace comments about the expected post-decrement refcount value with
assertions.  Use refcount_load() instead of a plain load.  No functional
change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31974
2021-09-17 14:19:06 -04:00
Mark Johnston
dfcef87714 socket: Fix a use-after-free in soclose()
After releasing the fd reference to a socket "so", we should avoid
testing SOLISTENING(so) since the socket may have been freed.  Instead,
directly test whether the list of unaccepted sockets is empty.

Fixes:		f4bb1869dd ("Consistently use the SOLISTENING() macro")
Pointy hat:	markj
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31973
2021-09-17 14:19:05 -04:00
Mark Johnston
bf25678226 ktls: Fix error/mode confusion in TCP_*TLS_MODE getsockopt handlers
ktls_get_(rx|tx)_mode() can return an errno value or a TLS mode, so
errors are effectively hidden.  Fix this by using a separate output
parameter.  Convert to the new socket buffer locking macros while here.

Note that the socket buffer lock is not needed to synchronize the
SOLISTENING check here, we can rely on the PCB lock.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31977
2021-09-17 14:19:05 -04:00
Mark Johnston
40fcdb9366 kcov: Disable address and memory sanitizers in get_kinfo()
get_kinfo() is only called from the coverage sanitizer callbacks, which
are similarly uninstrumented.

Sponsored by:	The FreeBSD Foundation
2021-09-17 14:19:05 -04:00
Konstantin Belousov
197a4f29f3 buffer pager: allow get_blksize method to return error
Reported and reviewed by:	asomers
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31998
2021-09-17 20:29:55 +03:00
Konstantin Belousov
796a8e1ad1 procctl(2): Add PROC_WXMAP_CTL/STATUS
It allows to override kern.elf{32,64}.allow_wx on per-process basis.
In particular, it makes it possible to run binaries without PT_GNU_STACK
and without elfctl note while allow_wx = 0.

Reviewed by:	brooks, emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31779
2021-09-17 15:42:01 +03:00
Konstantin Belousov
f575573ca5 Remove PT_GET_SC_ARGS_ALL
Reimplement bdf0f24bb1 by checking for the caller' ABI in
the implementation of PT_GET_SC_ARGS, and copying out everything if
it is Linuxolator.

Also fix a minor information leak: if PT_GET_SC_ARGS_ALL is done on the
thread reused after other process, it allows to read some number of that
thread last syscall arguments. Clear td_sa.args in thread_alloc().

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D31968
2021-09-16 20:11:27 +03:00
Piotr Pawel Stefaniak
6e8272f317 mount: improve error message for invalid filesystem names
For an invalid filesystem name used like this:
mount -t asdfs /dev/ada1p5 /usr/obj

emit an error message like this:
mount: /dev/ada1p5: Invalid fstype: Invalid argument

instead of:
mount: /dev/ada1p5: Operation not supported by device

Differential Revision:	https://reviews.freebsd.org/D31540
2021-09-15 16:25:31 +02:00
Edward Tomasz Napierala
bdf0f24bb1 linux: implement PTRACE_GET_SYSCALL_INFO
This is one of the pieces required to make modern (ie Focal)
strace(1) work.

Reviewed By:	jhb (earlier version)
Sponsored by:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D28212
2021-09-14 20:19:55 +00:00
John Baldwin
c782ea8bb5 Add a switch structure for send tags.
Move the type and function pointers for operations on existing send
tags (modify, query, next, free) out of 'struct ifnet' and into a new
'struct if_snd_tag_sw'.  A pointer to this structure is added to the
generic part of send tags and is initialized by m_snd_tag_init()
(which now accepts a switch structure as a new argument in place of
the type).

Previously, device driver ifnet methods switched on the type to call
type-specific functions.  Now, those type-specific functions are saved
in the switch structure and invoked directly.  In addition, this more
gracefully permits multiple implementations of the same tag within a
driver.  In particular, NIC TLS for future Chelsio adapters will use a
different implementation than the existing NIC TLS support for T6
adapters.

Reviewed by:	gallatin, hselasky, kib (older version)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D31572
2021-09-14 11:43:41 -07:00
Mark Johnston
cf4670fe0b kcov: Integrate with KMSAN
- kern_kcov.c needs to be compiled with -fsanitize=kernel-memory when
  KMSAN is configured since it calls into various other subsystems.
- Disable address and memory sanitizers in kcov(4)'s coverage sanitizer
  callbacks, as they do not provide useful checking.  Moreover, with
  KMSAN we may otherwise get false positives since the caller (coverage
  sanitizer runtime) is not instrumented.
- Disable KASAN and KMSAN interceptors in subr_coverage.c, as they do
  not provide any benefit but do introduce overhead when fuzzing.

Sponsored by:	The FreeBSD Foundation
2021-09-14 14:29:27 -04:00
Mark Johnston
fa0463c384 socket: De-duplicate SBLOCKWAIT() definitions
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-09-14 09:01:32 -04:00
Wojciech Macek
6fa041d7f1 Measure latency of PMC interruptions
Add HWPMC events to measure latency.
Provide sysctl to choose the number of outstanding events which
trigger HWPMC event.

Obtained from:		Semihalf
Sponsored by:		Stormshield
Differential revision:	https://reviews.freebsd.org/D31283
2021-09-13 06:08:32 +02:00
Mark Johnston
b864b67a0d socket: Do not include control messages in FIONREAD return value
Some system software expects to be able to read at least the number of
bytes returned by FIONREAD.  When control messages are counted in this
return value, this assumption is violated.  Follow Linux and OpenBSD
here (as well as our own kevent(EVFILT_READ)) and only return the number
of data bytes available.

Reported by:	avg
MFC after:	2 weeks
2021-09-12 16:39:44 -04:00
Mark Johnston
2884918c73 aio: Fix up the opcode in aiocb32_copyin()
With lio_listio(2), the opcode is specified by userspace rather than
being hard-coded by the system call (e.g., aio_readv() -> LIO_READV).
kern_lio_listio() calls aio_aqueue() with an opcode of LIO_NOP, which
gets fixed up when the aiocb is copied in.

When copying in a job request for vectored I/O, we need to dynamically
allocate a uio to wrap an iovec.  So aiocb_copyin() needs to get the
opcode from the aiocb and then decide whether an allocation is required.
We failed to do this in the COMPAT_FREEBSD32 case.  Fix it.

Reported by:	syzbot+27eab6f2c2162f2885ee@syzkaller.appspotmail.com
Reviewed by:	kib, asomers
Fixes:	f30a1ae8d5 ("lio_listio(2):  Allow LIO_READV and LIO_WRITEV.")
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31914
2021-09-11 12:58:41 -04:00
Mark Johnston
141fe2dcee aio: Interlock with listen(2)
soo_aio_queue() did not handle the possibility that the provided socket
is a listening socket.  Up until recently, to fix this one would have to
acquire the socket lock first and check, since the socket buffer locks
were destroyed by listen(2).

Now that the socket buffer locks belong to the socket, simply check
SOLISTENING(so) after acquiring them, and make listen(2) return an error
if any AIO jobs are enqueued on the socket.

Add a couple of simple regression test cases.

Note that this fixes things only for the default AIO implementation;
cxgbe(4)'s TCP offload has a separate pru_aio_queue implementation which
requires its own solution.

Reported by:	syzbot+c8aa122fa2c6a4e2a28b@syzkaller.appspotmail.com
Reported by:	syzbot+39af117d43d4f0faf512@syzkaller.appspotmail.com
Reported by:	syzbot+60cceb9569145a0b993b@syzkaller.appspotmail.com
Reported by:	syzbot+2d522c5db87710277ca5@syzkaller.appspotmail.com
Reviewed by:	tuexen, gallatin, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31901
2021-09-10 17:21:11 -04:00
Mark Johnston
b1e6a792d6 net: Enter a net epoch around protocol if_up/down notifications
When traversing a list of interface addresses, we need to be in a net
epoch section, and protocol ctlinput routines need a stable reference to
the address.

Reported by:	syzbot+3219af764ead146a3a4e@syzkaller.appspotmail.com
Reviewed by:	kp, melifaro
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31889
2021-09-10 09:07:40 -04:00
Mark Johnston
187afc5879 osd: Fix racy assertions
osd_register(9) may reallocate and expand the destructor array for a
given object type if no space is available for a new key.  This happens
with the object lock held.  Thus, when verifying that a given slot in
the array is occupied, we need to hold the object lock to avoid racing
with a reallocation.

Reported by:	syzbot+69ce54c7d7d813315dd3@syzkaller.appspotmail.com
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2021-09-09 10:11:02 -04:00
Rick Macklem
c5128c48df VOP_COPY_FILE_RANGE: Add a COPY_FILE_RANGE_TIMEO1SEC flag
Although it is not specified in the RFCs, the concept that
the NFSv4 server should reply to an RPC request within a
reasonable time is accepted practice within the NFSv4 community.

Without this patch, the NFSv4.2 server attempts to reply to
a Copy operation within 1second by limiting the copy to
vfs.nfs.maxcopyrange bytes (default 10Mbytes). This is crude at
best, given the large variation in I/O subsystem performance.

This patch adds a kernel only flag COPY_FILE_RANGE_TIMEO1SEC
that the NFSv4.2 can specify, which tells VOP_COPY_FILE_RANGE()
to return after approximately 1 second with a partial result and
implements this in vn_generic_copy_file_range(), used by
vop_stdcopyfilerange().

Modifying the NFSv4.2 server to set this flag will be done in
a separate patch.  Also under consideration is exposing the
COPY_FILE_RANGE_TIMEO1SEC to userland for use on the FreeBSD
copy_file_range(2) syscall.

MFC after:	2 weeks
Reviewed by:	khng
Differential Revision:	https://reviews.freebsd.org/D31829
2021-09-07 17:35:26 -07:00
Mark Johnston
a8aa6f1f78 socket: Avoid clearing SS_ISCONNECTING if soconnect() fails
This behaviour appears to date from the 4.4 BSD import.  It has two
problems:

1. The update to so_state is not protected by the socket lock, so
   concurrent updates to so_state may be lost.
2. Suppose two threads race to call connect(2) on a socket, and one
   succeeds while the other fails.  Then the failing thread may
   incorrectly clear SS_ISCONNECTING, confusing the state machine.

Simply remove the update.  It does not appear to be necessary:
pru_connect implementations which call soisconnecting() only do so after
all failure modes have been handled.  For instance, tcp_connect() and
tcp6_connect() will never return an error after calling soisconnected().
However, we cannot correctly assert that SS_ISCONNECTED is not set after
an error from soconnect() since the socket lock is not held across the
pru_connect call, so a concurrent connect(2) may have set the flag.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31699
2021-09-07 17:12:09 -04:00
Mark Johnston
523d58aad1 socket: Remove unneeded SOLISTENING checks
Now that SOCK_IO_*_LOCK() checks for listening sockets, we can eliminate
some racy SOLISTENING() checks.  No functional change intended.

Reviewed by:	tuexen
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31660
2021-09-07 17:12:09 -04:00
Mark Johnston
bd4a39cc93 socket: Properly interlock when transitioning to a listening socket
Currently, most protocols implement pru_listen with something like the
following:

	SOCK_LOCK(so);
	error = solisten_proto_check(so);
	if (error) {
		SOCK_UNLOCK(so);
		return (error);
	}
	solisten_proto(so);
	SOCK_UNLOCK(so);

solisten_proto_check() fails if the socket is connected or connecting.
However, the socket lock is not used during I/O, so this pattern is
racy.

The change modifies solisten_proto_check() to additionally acquire
socket buffer locks, and the calling thread holds them until
solisten_proto() or solisten_proto_abort() is called.  Now that the
socket buffer locks are preserved across a listen(2), this change allows
socket I/O paths to properly interlock with listen(2).

This fixes a large number of syzbot reports, only one is listed below
and the rest will be dup'ed to it.

Reported by:	syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com
Reviewed by:	tuexen, gallatin
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31659
2021-09-07 17:11:43 -04:00
Mark Johnston
f94acf52a4 socket: Rename sb(un)lock() and interlock with listen(2)
In preparation for moving sockbuf locks into the containing socket,
provide alternative macros for the sockbuf I/O locks:
SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK().  These operate on a
socket rather than a socket buffer.  Note that these locks are used only
to prevent concurrent readers and writters from interleaving I/O.

When locking for I/O, return an error if the socket is a listening
socket.  Currently the check is racy since the sockbuf sx locks are
destroyed during the transition to a listening socket, but that will no
longer be true after some follow-up changes.

Modify a few places to check for errors from
sblock()/SOCK_IO_(SEND|RECV)_LOCK() where they were not before.  In
particular, add checks to sendfile() and sorflush().

Reviewed by:	tuexen, gallatin
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31657
2021-09-07 15:06:48 -04:00
Warner Losh
ecfbb2e302 genoffset.sh: Use 10 X's instead of 5 for pick mkdtemp implementations
Linux fails to build now because the mkdtemp in the bootstrapped
environment wants 6 or more X's. Use 10 out of an abundance of caution.

Sponsored by:		Netflix
Reviewed by:		arichards
Differential Revision:	https://reviews.freebsd.org/D31863
2021-09-07 10:08:51 -06:00
Konstantin Belousov
98168a6e6c kqueue: drain kqueue taskqueue if syscall tickled it
Otherwise return from the syscall and next syscall, which could be
kevent(2) on the kqueue that should be notified, races with the kqueue
taskqueue thread, and potentially misses the wakeup.  This is reliably
visible when kevent(2) only peeks into events using zeroed timeout.

PR:	258310
Reported by:	arichardson, Jan Kokemüller <jan.kokemueller@gmail.com>
Reviewed by:	arichardson, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31858
2021-09-07 02:43:34 +03:00
Colin Percival
bd11e253a9 Add _sleep to TSLOG
Most of the nvme initialization time in my tests is being spent here
(via pause_sbt).
2021-09-05 12:50:15 -07:00
Colin Percival
7347dfce01 Add run_interrupt_driven_config_hooks to TSLOG
The 'intr_config_hooks' SYSINIT is now taking a nontrivial amount of
time in my profiling; run_interrupt_driven_config_hooks is responsible
for most of it, so this adds useful information to the resulting
flamecharts.
2021-09-05 12:45:29 -07:00
Alexander Motin
a264594d4f Unify console output.
Without this change when virtual console enabled depending on buffer
presence and state different parts of output go to different consoles.

MFC after:	1 month
2021-09-03 23:13:42 -04:00
Alexander Motin
bd6085c6ae Re-implement virtual console (constty).
Protect conscallout with tty lock instead of Giant.  In addition to
Giant removal it also closes race on console unset.

Introduce additional lock to protect against concurrent console sets.

Remove consbuf free on console unset as unsafe, making impossible to
change buffer size after first allocation.  Instead increase default
buffer size from 8KB to 64KB and processing rate from 5Hz to 10-15Hz
to make the output more smooth.

MFC after:	1 month
2021-09-03 22:18:51 -04:00
Alexander Motin
4730a8972b callout(9): Allow spin locks use with callout_init_mtx().
Implement lock_spin()/unlock_spin() lock class methods, moving the
assertion to _sleep() instead.  Change assertions in callout(9) to
allow spin locks for both regular and C_DIRECT_EXEC cases. In case of
C_DIRECT_EXEC callouts spin locks are the only locks allowed actually.

As the first use case allow taskqueue_enqueue_timeout() use on fast
task queues.  It actually becomes more efficient due to avoided extra
context switches in callout(9) thanks to C_DIRECT_EXEC.

MFC after:	2 weeks
Reviewed by:	hselasky
Differential Revision:	https://reviews.freebsd.org/D31778
2021-09-02 21:16:46 -04:00
Konstantin Belousov
5cc82c563e cluster_write(): do not access buffer after it is released
The issue was reported by
Alexander Lochmann <alexander.lochmann@tu-dortmund.de>,
who found the problem by performing lock analysis using LockDoc,
see https://doi.org/10.1145/3302424.3303948.

Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31780
2021-09-02 21:36:33 +03:00
Mateusz Guzik
6352bbf7be vmem: disable debug.vmem_check by default
It has a prohibitive performance impact when running real workloads.

Note this only affects kernels with DIAGNOSTIC.

Reviewed by:	markj
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D31784
2021-09-02 18:28:45 +00:00
Brooks Davis
6bc90e8acf syscalls.master: correct formatting issues
Reviewed by:	kevans, emaste
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D31351
2021-09-01 21:58:22 +01:00
Brooks Davis
df501bac69 syscalls.master: switch to CAPENABLED flags
Switch the main syscall table to use CAPENABLED flags rather than
capabilities.conf.  This avoid synchronization issues between
syscalls.master and capabilities.conf (e.g. when renaming a syscall
during development).

For now, move capabilities.conf to sys/compat/freebsd32 and use it
there.  Use of sys/compat/freebsd32/syscalls.master should be replaced
by makesyscalls.lua enhancements to allow the main one to be used.

This change results in no changes to generated files after running
`make sysent`.

Reviewed by:	kevans, emaste
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D31350
2021-09-01 21:58:16 +01:00
Brooks Davis
6945df3fff makesyscalls.lua: add a CAPENABLED flag
The CAPENABLED flag indicates that the syscall can be used in capsicum
capability mode.  It is intended to replace capabilities.conf.

Reviewed by:	kevans, emaste
MFC after:	1 week
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D31349
2021-09-01 21:58:06 +01:00
Mark Johnston
c511383de7 kevent: Fix races between timer detach and kqtimer_proc_continue()
- When detaching a knote, we need to double check the enqueued flag
  after acquiring the process lock, as kqtimer_proc_continue() may have
  toggled it.
- kqtimer_proc_continue() could in principle reschedule a stopped
  callout after filt_timerdetach() drains the callout.  So, we need to
  re-check.

Reported by:	syzbot+4a4cebb3ec07892cb040@syzkaller.appspotmail.com
Reported by:	syzbot+a9c04bc76078a3b7dd8d@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31772
2021-09-01 14:18:58 -04:00
Ka Ho Ng
92bb74fd4f vfs: Use file_cred for VOP_DEALLOCATE in vn_deallocate if non-NULL
This changes vn_deallocate() to match the behavior of vn_rdwr() when
picking which ucred to use. That is, vn_deallocate() uses file_cred for
making VOP call if it is non-NULL, or use active_cred otherwise.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D31712
2021-09-01 20:19:08 +08:00
Alexander Motin
706b1a5724 Align taskqueue_enqueue_timeout() to hardclock.
It is done for all other KPIs using HZ, but was missed here.

MFC after:	2 weeks
2021-08-31 23:50:35 -04:00
Mark Johnston
3138392a46 itimer: Serialize access to the p_itimers array
Fix the following race between itimer_proc_continue() and process exit.

itimer_proc_continue() may be called via realitexpire(), the real
interval timer.  Note that exit1() drains this timer _after_ draining
and freeing itimers.  Moreover, itimers_exit() is called without the
process lock held; it only acquires the proc lock when deleting
individual itimers, so once they are drained we free p->p_itimers
without any synchronization.  Thus, itimer_proc_continue() may load a
non-NULL p->p_itimers array and iterate over it after it has been freed.

Fix the problem by using the process lock when clearing p->p_itimers, to
synchronize with itimer_proc_continue().  Formally, accesses to this
field should be protected by the process lock anyway, and since the
array is allocated lazily this will not incur any overhead in the common
case.

Reported by:	syzbot+c40aa8bf54fe333fc50b@syzkaller.appspotmail.com
Reported by:	syzbot+929be2f32503bbc3844f@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31759
2021-08-31 16:38:05 -04:00
John Baldwin
470e851c4b ktls: Support asynchronous dispatch of AEAD ciphers.
KTLS OCF support was originally targeted at software backends that
used host CPU cycles to encrypt TLS records.  As a result, each KTLS
worker thread queued a single TLS record at a time and waited for it
to be encrypted before processing another TLS record.  This works well
for software backends but limits throughput on OCF drivers for
coprocessors that support asynchronous operation such as qat(4) or
ccr(4).  This change uses an alternate function (ktls_encrypt_async)
when encrypt TLS records via a coprocessor.  This function queues TLS
records for encryption and returns.  It defers the work done after a
TLS record has been encrypted (such as marking the mbufs ready) to a
callback invoked asynchronously by the coprocessor driver when a
record has been encrypted.

- Add a struct ktls_ocf_state that holds the per-request state stored
  on the stack for synchronous requests.  Asynchronous requests malloc
  this structure while synchronous requests continue to allocate this
  structure on the stack.

- Add a ktls_encrypt_async() variant of ktls_encrypt() which does not
  perform request completion after dispatching a request to OCF.
  Instead, the ktls_ocf backends invoke ktls_encrypt_cb() when a TLS
  record request completes for an asynchronous request.

- Flag AEAD software TLS sessions as async if the backend driver
  selected by OCF is an async driver.

- Pull code to create and dispatch an OCF request out of
  ktls_encrypt() into a new ktls_encrypt_one() function used by both
  ktls_encrypt() and ktls_encrypt_async().

- Pull code to "finish" the VM page shuffling for a file-backed TLS
  record into a helper function ktls_finish_noanon() used by both
  ktls_encrypt() and ktls_encrypt_cb().

Reviewed by:	markj
Tested on:	ccr(4) (jhb), qat(4) (markj)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D31665
2021-08-30 13:11:52 -07:00
Andrew Turner
b792434150 Create sys/reg.h for the common code previously in machine/reg.h
Move the common kernel function signatures from machine/reg.h to a new
sys/reg.h. This is in preperation for adding PT_GETREGSET to ptrace(2).

Reviewed by:	imp, markj
Sponsored by:	DARPA, AFRL (original work)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19830
2021-08-30 12:50:53 +01:00
Ka Ho Ng
a58e222b3b vfs: yield in vn_deallocate_impl() loop
Yield at the end of each loop iteration if there are remaining works as
indicated by the value of *len updated by VOP_DEALLOCATE. Without this,
when calling vop_stddeallocate to zero a large region, the
implementation only zerofills a relatively small chunk and returns.

Sponsored by:	The FreeBSD Foundation
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D31705
2021-08-29 16:26:00 +08:00
Mark Johnston
7326e8589c fsetown: Avoid process group lock recursion
Restore the pre-1d874ba4f8ba behaviour of disassociating the current
SIGIO recipient before looking up the specified process or process
group.  This avoids a lock recursion in the scenario where a process
group is configured to receive SIGIO for an fd when it has already been
so configured.

Reported by:	pho
Tested by:	pho
Reviewed by:	kib
MFC after:	3 days
2021-08-28 15:50:44 -04:00
Rick Macklem
da779f262c vfs_default: Change vop_stddeallocate() from static to global
A future commit to the NFS client uses vop_stddeallocate() for
cases where the NFS server does not support a Deallocate operation.
Change vop_stddeallocate() from static to global so that it can
be called by the NFS client.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D31640
2021-08-27 18:25:44 -07:00