Commit Graph

18931 Commits

Author SHA1 Message Date
Gleb Smirnoff
644ca0846d domains: make domain_init() initialize only global state
Now that each module handles its global and VNET initialization
itself, there is no VNET related stuff left to do in domain_init().

Differential revision:	https://reviews.freebsd.org/D33541
2022-01-03 10:15:22 -08:00
Gleb Smirnoff
24e1c6ae7d domains: init with standard SYSINIT(9) or VNET_SYSINIT()
There left only three modules that used dom_init().  And netipsec
was the last one to use dom_destroy().

Differential revision:	https://reviews.freebsd.org/D33540
2022-01-03 10:15:22 -08:00
Gleb Smirnoff
340c7343f4 protocols: don't execute protosw_init() for every VNET
The function now modifies pr_usrreqs only, which are always
global.  Rename it to pr_usrreqs_init().

Differential revision:	https://reviews.freebsd.org/D33538
2022-01-03 10:15:21 -08:00
Gleb Smirnoff
89128ff3e4 protocols: init with standard SYSINIT(9) or VNET_SYSINIT
The historical BSD network stack loop that rolls over domains and
over protocols has no advantages over more modern SYSINIT(9).
While doing the sweep, split global and per-VNET initializers.

Getting rid of pr_init allows to achieve several things:
o Get rid of ifdef's that protect against double foo_init() when
  both INET and INET6 are compiled in.
o Isolate initializers statically to the module they init.
o Makes code easier to understand and maintain.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D33537
2022-01-03 10:15:21 -08:00
Jessica Clarke
a3e828c91d intrng: Use less confusing return value for intr_pic_add_handler
Currently intr_pic_add_handler either returns the PIC you gave it (which
is useless and risks causing confusion about whether it's creating
another PIC) or, on error, NULL. Instead, convert it to return an int
error code as one would expect.

Note that the only consumer of this API, arm64's gicv3_its, does not use
the return value, so no uses need updating to work with the revised API.

Reviewed by:	markj, mmel
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D33341
2022-01-03 17:08:44 +00:00
Stefan Eßer
ec3af9d0ca sys/kern/sched_4bsd.c: fix typo introduced in previous commit 2022-01-01 15:33:38 +01:00
Stefan Eßer
a19bd8e30e Restore variable aliasing in the context of cpu set operations
A simplification of set operations removed side-effects of the
previous code, which are restored by this commit.
2022-01-01 11:58:40 +01:00
Mark Johnston
6b95cf5bde callout: Wait for the softclock thread to switch before rescheduling
When a softclock thread prepares to go off-CPU, the following happens in
the context of the thread:

1. callout state is locked
2. thread state is set to IWAIT
3. thread lock is switched from the tdq lock to the callout lock
4. tdq lock is released
5. sched_switch() sets td_lock to &blocked_lock
6. sched_switch() releases old td_lock (callout lock)
7. sched_switch() removes td from its runqueue
8. cpu_switch() sets td_lock back to the callout lock

Suppose a timer interrupt fires while the softclock thread is switching
off, and callout_process() schedules the softclock thread.  Then there
is a window between steps 5 and 8 where callout_process() can call
sched_add() while td_lock is &blocked_lock, but this is not correct
since the thread is not logically locked.

callout_process() thus needs to spin waiting for the softclock thread to
finish switching off (i.e., after step 8 completes) before rescheduling
it, since callout_process() does not acquire the thread lock directly.

Reported by:	syzbot+fb44dbf6734ff492c337@syzkaller.appspotmail.com
Fixes:		74cf7cae4d ("softclock: Use dedicated ithreads for running callouts.")
Reviewed by:	mav, kib, jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33709
2021-12-31 17:01:39 -05:00
Mark Johnston
f04a096049 exec: Simplify sv_copyout_strings implementations a bit
Simplify control flow around handling of the execpath length and signal
trampoline.  Cache the sysentvec pointer in a local variable.

No functional change intended.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33703
2021-12-31 12:50:15 -05:00
John Baldwin
74cf7cae4d softclock: Use dedicated ithreads for running callouts.
Rather than using the swi infrastructure, rewrite softclock() as a
thread loop (softclock_thread()) and use it as the main routine of the
softclock threads.  The threads use the CC_LOCK as the thread lock
when idle.

Reviewed by:	mav, imp, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D33683
2021-12-30 14:55:08 -08:00
Stefan Eßer
e2650af157 Make CPU_SET macros compliant with other implementations
The introduction of <sched.h> improved compatibility with some 3rd
party software, but caused the configure scripts of some ports to
assume that they were run in a GLIBC compatible environment.

Parts of sched.h were made conditional on -D_WITH_CPU_SET_T being
added to ports, but there still were compatibility issues due to
invalid assumptions made in autoconfigure scripts.

The differences between the FreeBSD version of macros like CPU_AND,
CPU_OR, etc. and the GLIBC versions was in the number of arguments:
FreeBSD used a 2-address scheme (one source argument is also used as
the destination of the operation), while GLIBC uses a 3-adderess
scheme (2 source operands and a separately passed destination).

The GLIBC scheme provides a super-set of the functionality of the
FreeBSD macros, since it does not prevent passing the same variable
as source and destination arguments. In code that wanted to preserve
both source arguments, the FreeBSD macros required a temporary copy of
one of the source arguments.

This patch set allows to unconditionally provide functions and macros
expected by 3rd party software written for GLIBC based systems, but
breaks builds of externally maintained sources that use any of the
following macros: CPU_AND, CPU_ANDNOT, CPU_OR, CPU_XOR.

One contributed driver (contrib/ofed/libmlx5) has been patched to
support both the old and the new CPU_OR signatures. If this commit
is merged to -STABLE, the version test will have to be extended to
cover more ranges.

Ports that have added -D_WITH_CPU_SET_T to build on -CURRENT do
no longer require that option.

The FreeBSD version has been bumped to 1400046 to reflect this
incompatible change.

Reviewed by:	kib
MFC after:	2 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D33451
2021-12-30 12:20:32 +01:00
Colin Percival
33812d60b9 vfs_mountroot: Check for root dev before waiting
If GEOM is idle but the root device is not yet present when we enter
vfs_mountroot_wait_if_necessary, we call vfs_mountroot_wait to wait
for root holds (e.g. CAM or USB initialization).  Upon returning from
vfs_mountroot_wait, we wait 100 ms at a time until the root device
shows up.

Since the root device most likely appeared during vfs_mountroot_wait
-- waiting for subsystems which may be responsible for the root
device is the whole purpose of that function -- it makes sense to
check if the device is now present rather than printing a warning
and pausing for 100 ms before checking.

Reviewed by:	trasz
Fixes: a3ba3d09c2 Make root mount wait mechanism smarter
Sponsored by:	https://www.patreon.com/cperciva
Differential Revision:	https://reviews.freebsd.org/D33593
2021-12-29 12:41:09 -08:00
Colin Percival
19a172158c vfs_mountroot: Wait for GEOM idle post root holds
In the case of a root hold related to the initialization of a disk
device, a flurry of GEOM tasting is likely to take place as soon as
the device is initialized and the root hold is released.  If we
don't wait for GEOM idle it's easy for vfs_mountroot to "win" the
race and proceed before the root filesystem GEOM is ready.

Reviewed by:	imp
Sponsored by:	https://www.patreon.com/cperciva
Differential Revision:	https://reviews.freebsd.org/D33592
2021-12-29 12:41:08 -08:00
Colin Percival
e6db5eb9ec vfs_mountroot: Skip 'Root mount waiting' < 1 s
While the message is technically correct, it's not particularly
helpful in the case where we're only waiting a few ms; this case
occurs frequently on EC2 arm64 instances with CAM initialization
racing to release its root hold before vfs_mountroot reaches this
point.  Only print the message if we end up waiting for more than
one second.

Sponsored by:	https://www.patreon.com/cperciva
Differential Revision:	https://reviews.freebsd.org/D33591
2021-12-29 12:41:08 -08:00
Edward Tomasz Napierala
626d6992ca Move fork_rfppwait() check into ast()
This will always sleep at least once, so it's a slow path by definition.

Reviewed By:	kib
Sponsored By:	EPSRC
Differential Revision:	https://reviews.freebsd.org/D33387
2021-12-26 17:22:21 +00:00
Roger Pau Monné
60e749da3c mbuf_tags: use explicitly sized type for 'type' parameter
Functions manipulating mbuf tags are using an int type for passing the
'type' parameter, but the internal tag storage is using a 16bit
integer to store it. This leads to the following code:

t = m_tag_alloc(...,0xffffffff,...,...);
m_tag_prepend(m, t);
r = m_tag_locate(m ,...,0xffffffff, NULL);

Returning r == NULL because m_tag_locate doesn't truncate the type
parameter when doing the match. This is unexpected because the type of
the 'type' parameter is int, and the caller doesn't need to know about
the internal truncations.

Fix this by making the 'type' parameter of type uint16_t in order to
match the size of its internal storage and make it obvious to the
caller the actual size of the parameter.

While there also use uint uniformly replacing the existing u_int
instances.

Reviewed by: kp, donner, glebius
Differential revision: https://reviews.freebsd.org/D33680
2021-12-29 09:23:52 +01:00
John Baldwin
254e4e5b77 Simplify swi for bus_dma.
When a DMA request using bounce pages completes, a swi is triggered to
schedule pending DMA requests using the just-freed bounce pages.  For
a long time this bus_dma swi has been tied to a "virtual memory" swi
(swi_vm).  However, all of the swi_vm implementations are the same and
consist of checking a flag (busdma_swi_pending) which is always true
and if set calling busdma_swi.  I suspect this dates back to the
pre-SMPng days and that the intention was for swi_vm to serve as a
mux.  However, in the current scheme there's no need for the mux.

Instead, remove swi_vm and vm_ih.  Each bus_dma implementation that
uses bounce pages is responsible for creating its own swi (busdma_ih)
which it now schedules directly.  This swi invokes busdma_swi directly
removing the need for busdma_swi_pending.

One consequence is that the swi now works on RISC-V which had previously
failed to invoke busdma_swi from swi_vm.

Reviewed by:	imp, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D33447
2021-12-28 13:51:25 -08:00
John Baldwin
2cee586189 sys/kern: Use C99 fixed-width integer types.
No functional change.

Reviewed by:	imp, kib
Differential Revision:	https://reviews.freebsd.org/D33630
2021-12-28 09:41:08 -08:00
Konstantin Belousov
23ba59fbfb itimers: strip unused bits from struct itimer and struct itimers
Reviewed by:	imp, markj, mav
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33670
2021-12-28 03:02:53 +02:00
Konstantin Belousov
3f15708478 itimers_alloc: no need to initialize its_timers array
struct itimers is allocated with M_ZERO, setting all members to NULL
is tautological.

Reviewed by:	imp, markj, mav
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33670
2021-12-28 03:02:53 +02:00
Alexander Motin
c6c52d8e39 kern: Remove CTLFLAG_NEEDGIANT from some more sysctls.
MFC after:	2 weeks
2021-12-26 23:07:33 -05:00
Gleb Smirnoff
eb8dcdeac2 jail: network epoch protection for IP address lists
Now struct prison has two pointers (IPv4 and IPv6) of struct
prison_ip type.  Each points into epoch context, address count
and variable size array of addresses.  These structures are
freed with network epoch deferred free and are not edited in
place, instead a new structure is allocated and set.

While here, the change also generalizes a lot (but not enough)
of IPv4 and IPv6 processing. E.g. address family agnostic helpers
for kern_jail_set() are provided, that reduce v4-v6 copy-paste.

The fast-path prison_check_ip[46]_locked() is also generalized
into prison_ip_check() that can be executed with network epoch
protection only.

Reviewed by:		jamie
Differential revision:	https://reviews.freebsd.org/D33339
2021-12-26 10:45:50 -08:00
Alexander Motin
fe27f1db5f kern: Remove CTLFLAG_NEEDGIANT from some sysctls.
MFC after:	2 weeks
2021-12-26 12:03:33 -05:00
Jessica Clarke
d2ef377430 Fix buffer overread in preloaded hostuuid parsing
Commit b6be9566d2 stopped prison0_init writing outside of the
preloaded hostuuid's bounds. However, the preloaded data will not
(normally) have a NUL in it, and so validate_uuid will walk off the end
of the buffer in its call to sscanf. Previously if there was any
whitespace in the string we'd at least know there's a NUL one past the
end due to the off-by-one error, but now no such byte is guaranteed.

Fix this by copying to a temporary buffer and explicitly adding a NUL.

Whilst here, change the strlcpy call to use a far less suspicious
argument for dstsize; in practice it's fine, but it's an unusual pattern
and not necessary.

Found by:	CHERI
Reviewed by:	emaste, kevans, jhb
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D33616
2021-12-22 16:47:23 +00:00
Hans Petter Selasky
f9978339d1 Remove dead code.
The variable orig_resid is always set to zero right after the while loop
where it is cleared.

Reviewed by:	gallatin@ and glebius@
Differential Revision:	https://reviews.freebsd.org/D33589
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2021-12-21 18:35:03 +01:00
Mark Johnston
36bd49ac4d fd: Avoid truncating output buffers for KERN_PROC_{CWD,FILEDESC}
These sysctls failed to return an error if the caller had provided too
short an output buffer.  Change them to return ENOMEM instead, to ensure
that callers can detect truncation in the face of a concurrently
changing fd table.

PR:		228432
Discussed with:	cem, jhb
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D15607
2021-12-17 13:10:23 -05:00
Mark Johnston
327060bd77 fd: Initialize more export_fd_buf fields in kern_proc_cwd_out()
In particular, we need to initialize efbuf->flags, since
export_vnode_to_sb() loads that field.  This was mostly harmless since
the flag only determines whether the output kinfo_file is packed, and
KERN_PROC_CWD only ever emits a single kinfo_file anyway.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2021-12-17 13:10:22 -05:00
Mark Johnston
d157f2627b unix: Increase the default datagram recv buffer size
syslog(3) was recently change to support larger messages, up to 8KB.
Our syslogd handles this fine, as it adjusts /dev/log's recv buffer to a
large size.  rsyslog, however, uses the system default of 4KB.  This
leads to problems since our syslog(3) retries indefinitely when a send()
returns ENOBUFS, but if the message is large enough this will never
succeed.

Increase the default recv buffer size for datagram sockets to support
8KB syslog messages without requiring the logging daemon to adjust its
buffers.

PR:		260126
Reviewed by:	asomers
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33380
2021-12-17 13:09:49 -05:00
Alan Somers
b214fcceac Change VOP_READDIR's cookies argument to a **uint64_t
The cookies argument is only used by the NFS server.  NFSv2 defines the
cookie as 32 bits on the wire, but NFSv3 increased it to 64 bits.  Our
VOP_READDIR, however, has always defined it as u_long, which is 32 bits
on some architectures.  Change it to 64 bits on all architectures.  This
doesn't matter for any in-tree file systems, but it matters for some
FUSE file systems that use 64-bit directory cookies.

PR:             260375
Reviewed by:    rmacklem
Differential Revision: https://reviews.freebsd.org/D33404
2021-12-15 20:54:57 -07:00
Mark Johnston
bfd451210e imgact_elf: Disable the stack gap for now
The integration with RLIMIT_STACK is still causing problems for some
programs such as lang/sdcc and syzkaller's executor.  Until this is
resolved by some work currently in progress, disable the stack gap by
default.

PR:		260303
Reviewed by:	kib, emaste
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33438
2021-12-14 16:19:40 -05:00
John Baldwin
05a1d0f5d7 ktls: Support for TLS 1.3 receive offload.
Note that support for TLS 1.3 receive offload in OpenSSL is still an
open pull request in active development.  However, potential changes
to that pull request should not affect the kernel interface.

Reviewed by:	hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D33007
2021-12-14 11:01:05 -08:00
Mateusz Guzik
af40f9bfa6 socket: plug set-but-not-used vars
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2021-12-14 17:12:53 +00:00
Mateusz Guzik
3c39566d65 mbuf: plug set-but-not-used vars
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2021-12-14 14:54:02 +00:00
Mateusz Guzik
b2835fa22a kobj: plug set-but-not-used vars
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2021-12-14 14:52:25 +00:00
Mateusz Guzik
a90b85dd5a ktls: plug set-but-not-used vars
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2021-12-14 14:44:37 +00:00
Konstantin Belousov
e499988f0c exec_elf: use intermediate u_long variable to correct mismatched type
vm_offset_t * vs. u_long *

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2021-12-12 20:07:04 +02:00
Konstantin Belousov
bf83941638 imgact_elf: avoid mapsz overflow
Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33359
2021-12-12 16:28:39 +02:00
Konstantin Belousov
36df8f540f imgact_elf: check that the alignment of PT_LOAD segment is power of two
and stop recalculating alignment for PIE base, which was off by one
power of two.

Suggested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33359
2021-12-12 16:28:38 +02:00
Konstantin Belousov
714d6d09b5 imgact_elf: exclude invalid alignment requests
Only accept at most superpage alignment, or if the arch does not have
superpages supported, artificially limit it to PAGE_SIZE * 1024.
This is somewhat arbitrary, and e.g. could change what binaries do
we accept between native i386 vs. amd64 ia32 with superpages disabled,
but I do not believe the difference there is affecting anybody with
real (useful) binaries.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33359
2021-12-12 16:28:38 +02:00
Konstantin Belousov
a4007ae10c rnd_elf: add comment explaining the interface
Requested and reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33359
2021-12-12 16:28:38 +02:00
Konstantin Belousov
9cf78c1cf6 elf image activator: convert asserts into errors
Invalid (artificial) layout of the loadable ELF segments might result in
triggering the assertion.  This means that the file should not be
executed, regardless of the kernel debug mode.  Change calling
conventions for rnd_elf{32,64} helpers to allow returning an error, and
abort activation with ENOEXEC if its invariants are broken.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33359
2021-12-12 16:28:38 +02:00
Konstantin Belousov
b4b20492cd exec_elf: assert that the image vnode is still locked on return
Suggested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33359
2021-12-12 16:28:38 +02:00
Konstantin Belousov
88dd7a0a39 Style
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33359
2021-12-12 16:28:38 +02:00
Alexander Motin
81dc00331d Make msgbuf_peekbytes() not return leading zeroes.
Introduce new MSGBUF_WRAP flag, indicating that buffer has wrapped
at least once and does not keep zeroes from the last msgbuf_clear().
It allows msgbuf_peekbytes() to return only real data, not requiring
every consumer to trim the leading zeroes after doing pointless copy.
The most visible effect is that kern.msgbuf sysctl now always returns
proper zero-terminated string, not only after the first buffer wrap.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2021-12-10 23:35:53 -05:00
Mateusz Guzik
4dd23ae150 vfs: retire MNTK_NOKNOTE and VV_NOKNOTE
MNTK_NOKNOTE was introduced in 679985d03a
(dated 2005), VV_NOKNOTE in 34cc826ae8 few
months later.

Neither was ever used by anything in the tree.
2021-12-10 19:53:57 +00:00
Florian Walpen
a9545eede4 Add idle priority scheduling privilege group to MAC/priority
Add an idletime user group that allows non-root users to run processes
with idle scheduling priority. Privileges are granted by a MAC policy in
the mac_priority module. For this purpose, the kernel privilege
PRIV_SCHED_IDPRIO was added to sys/priv.h (kernel module ABI change).

Deprecate the system wide sysctl(8) knob
security.bsd.unprivileged_idprio which lets any user run idle priority
processes, regardless of context. While the knob is still working, it is
marked as deprecated in the description and in the man pages.

MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D33338
2021-12-10 04:54:48 +02:00
Florian Walpen
a20a2450cd Add PRIV_SCHED_IDPRIO
The privilege allows the holder to assign idle priority type to thread
or process.

MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D33338
2021-12-10 04:54:48 +02:00
Warner Losh
d14bc7230b newbus: add bus_topo_assert
Add bus_topo_assert() and implmement it as GIANT_REQUIRED for the
moment. This will allow us to change more easily to a newbus-specific
lock int he future.

Sponsored by:		Netflix
Reviewed by:		wulf, mav, jhb
Differential Revision:	https://reviews.freebsd.org/D31833
2021-12-09 17:04:57 -07:00
Warner Losh
815a0973b2 bus: mark sysctls MPSAFE
Mark the sysctls MPSAFE and pickup the bus topo lock while processing
them.

Sponsored by:		Netflix
Reviewed by:		mav, jhb
Differential Revision:	https://reviews.freebsd.org/D31832
2021-12-09 17:04:57 -07:00
Warner Losh
c6df6f5322 Create wrapper for Giant taken for newbus
Create a wrapper for newbus to take giant and for busses to take it too.
bus_topo_lock() should be called before interacting with newbus routines
and unlocked with bus_topo_unlock(). If you need the topology lock for
some reason, bus_topo_mtx() will provide that.

Sponsored by:		Netflix
Reviewed by:		mav
Differential Revision:	https://reviews.freebsd.org/D31831
2021-12-09 17:04:45 -07:00
Bjoern A. Zeeb
df38ada293 modules: increase MAXMODNAME and provide backward compat
With various firmware files used by graphics and wireless drivers
we are exceeding the current 32 character module name (file path
in kldxref) length.
In order to overcome this issue bump it to the maximum path length
for the next version.
To be able to MFC provide backward compat support for another version
of the struct as the offsets for the second half change due to the
array size increase.

MAXMODNAME being defined to MAXPATHLEN needs param.h to be
included first.  With only 7 modules (or LinuxKPI module.h) not
doing that adjust them rather than including param.h in module.h [1].

Reported by:	Greg V (greg unrelenting.technology)
Sponsored by:	The FreeBSD Foundation
Suggested by:	imp [1]
MFC after:	10 days
Reviewed by:	imp (and others to different level)
Differential Revision:	https://reviews.freebsd.org/D32383
2021-12-09 18:09:53 +00:00
Mateusz Guzik
0a048d4a98 mbuf: plug set-but-not-used vars
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2021-12-09 17:59:11 +00:00
Justin Hibbits
d2de68811a Fix assert check for SV_DSO_SIG in exec_sysvec_init_secondary()
The only requirement for SV_DSO_SIG here is that the flags match between
the source and target sysentvec.

The current assertion is too strict and fails on powerpc64, the only
other architecture than amd64 that uses this function, which doesn't
implement sigtramp in a VDSO.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D33355
2021-12-08 22:54:07 -06:00
Konstantin Belousov
b7c55487ff Regen 2021-12-09 02:49:10 +02:00
Konstantin Belousov
5346570276 swapoff: add one more variant of the syscall
Requested and reviewed by:	brooks
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33343
2021-12-09 02:48:46 +02:00
Konstantin Belousov
c1a8472793 syscalls: add COMPAT13
Reviewed by:	brooks
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33343
2021-12-09 02:48:32 +02:00
Konstantin Belousov
ecd8245e0d Kernel linkers: add emergency sysctl to restore old behavior
allowing linking to static symbols from other files.  Default the new
settings to true, delaying the change of the kernel linker behavior
for other day.

Suggested by:	emaste
PR:	207898
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32878
2021-12-08 23:32:30 +02:00
Konstantin Belousov
95c20faf11 kernel linker: do not read debug symbol tables for non-debug symbols
In particular, this prevents resolving locals from other files.
To access debug symbol tables, add LINKER_LOOKUP_DEBUG_SYMBOL and
LINKER_DEBUG_SYMBOL_VALUES kobj methods, which are allowed to use
any types of present symbols in all tables.

PR:	207898
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32878
2021-12-08 23:32:29 +02:00
Konstantin Belousov
72f6662662 linker_debug_symbol_values(): use proper linker interface to get debug values
Reported by:	markj
Reviewed by:	emaste, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32878
2021-12-08 23:32:26 +02:00
Konstantin Belousov
c37c6f994f Style
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32878
2021-12-08 23:32:20 +02:00
Konstantin Belousov
794d3e8e63 fcntl(2): add F_KINFO operation
that returns struct kinfo_file for the given file descriptor.  Among
other data, it also returns kf_path, if file op was able to restore file
path.

Reviewed by:	jhb, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33277
2021-12-06 22:18:09 +02:00
Konstantin Belousov
6e51d61a96 Add declaration for static export_file_to_kinfo()
Reviewed by:	jhb, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33277
2021-12-06 22:18:09 +02:00
Konstantin Belousov
eb02958748 Add kern.elf{32,64}.vdso knobs to enable/disable vdso preloading
Reviewed by:	emaste
Discussed with:	jrtc27
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D32960
2021-12-06 20:46:49 +02:00
Konstantin Belousov
98c8b62524 vdso for ia32 on amd64
Reviewed by:	emaste
Discussed with:	jrtc27
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D32960
2021-12-06 20:46:49 +02:00
Konstantin Belousov
290e05dde0 imgact_aout.c: some style
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32960
2021-12-06 20:46:49 +02:00
Konstantin Belousov
01c77a436e Pass vdso address to userspace
Reviewed by:	emaste
Discussed with:	jrtc27
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D32960
2021-12-06 20:46:49 +02:00
Konstantin Belousov
ab4524b3d7 amd64: wrap 64bit sigtramp into vdso
Reviewed by:	emaste
Discussed with:	jrtc27
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 month
Differential revision:	https://reviews.freebsd.org/D32960
2021-12-06 20:46:49 +02:00
Konstantin Belousov
9da5257e1c imgact_aout.c: We do not expect the aout support to be ported
Specify that the only supported architecture for a.out is ia32 (either
i386 or amd64 host kernel).

Requested by:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32960
2021-12-06 20:46:49 +02:00
Scott Long
95d35d7a0e Fix "set but not used" in kern_cpu.c
Sponsored by: Rubicon Communications, LLC ("Netgate")
2021-12-05 15:33:04 -07:00
Konstantin Belousov
a5c2d59ed3 Expand comment explaining reasons for automatic swapoff on shutdown
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33167
2021-12-03 10:42:21 +02:00
Cy Schubert
db0ac6ded6 Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816"
This reverts commit 266f97b5e9, reversing
changes made to a10253cffe.

A mismerge of a merge to catch up to main resulted in files being
committed which should not have been.
2021-12-02 14:45:04 -08:00
Cy Schubert
266f97b5e9 wpa: Import wpa_supplicant/hostapd commit 14ab4a816
This is the November update to vendor/wpa committed upstream 2021-11-26.

MFC after:      1 month
2021-12-02 13:35:14 -08:00
Gleb Smirnoff
d96fccc505 epoch: with EPOCH_TRACE add epoch_where_report()
which will report where the epoch was entered and also
mark the tracker, so that exit will also be reported.

Helps to understand epoch entrance/exit scenarios in
complex cases, like network stack.  As everything else
under EPOCH_TRACE it is a developer only tool.
2021-12-02 11:02:51 -08:00
Gleb Smirnoff
de2d47842e SMR protection for inpcbs
With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc).  However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.

Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree().  However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it.  Details:

- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
  Once the desired inpcb is found we need to transition from SMR
  section to r/w lock on the inpcb itself, with a check that inpcb
  isn't yet freed.  This requires some compexity, since SMR section
  itself is a critical(9) section.  The complexity is hidden from
  KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
  notification) also a new KPI is provided, that hides internals of
  the database - inp_next(struct inp_iterator *).

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D33022
2021-12-02 10:48:48 -08:00
Gordon Bergling
fe96f62d61 kern: Correct a typo in a sysctl description
- s/osbolete/obsolete/

MFC after:	3 days
2021-12-02 10:54:15 +01:00
Warner Losh
1c7d15b030 Make device_busy/unbusy work w/o Giant held
The vast majority of the busy/unbusy users in the tree don't acquire
Giant before calling device_busy/unbusy. However, if multiple threads
are opening a file, say, that causes the device to busy/unbusy, then we
can race to the root marking things busy. Move to using a reference
count to keep track of how many times a device_t has been made busy. Use
that count to make the same decisions that we'd make with the old device
state.

Note: gpiopps.c uses D_TRACKCLOSE. Others do as well. However, there's a
known race with closes that will be corrected for all the drivers that
do this in a future commit.

Sponsored by:		Netflix
Reviewed by:		hselasky, jhb
Differential Revision:	https://reviews.freebsd.org/D26284
2021-11-30 15:18:01 -07:00
Warner Losh
25c49c426c Revert "Make device_busy/unbusy work w/o Giant held"
This reverts commit 08e7819153.

Commit message was for a very old version of the patch. Will re-commit
with the right one since it's so bad. There's no locked versions of
it...that code was reworked to use refcnt APIs.

Noticed by:	jhb, jtrc27
Sponsored by:	Netflix
2021-11-30 15:17:07 -07:00
Warner Losh
08e7819153 Make device_busy/unbusy work w/o Giant held
The vast majority of the busy/unbusy users in the tree don't acquire Giant
before calling device_busy/unbusy. However, if multiple threads are opening a
file, say, that causes the device to busy/unbusy, then we can race to the root
marking things busy. Create a new device_busy_locked and device_unbusy_locked
that are the current implemntations of device_busy and device_unbusy. Make
device_busy and unbusy acquire Giant before calling the _locked versrions. Since
we never sleep in the busy/unbusy path, Giant's single threaded semantics
suffice to keep this safe.

Sponsored by:		Netflix
Reviewed by:		hselasky, jhb
Differential Revision:	https://reviews.freebsd.org/D26284
2021-11-30 15:03:26 -07:00
Andriy Gapon
3d9d64aa18 kern_tc: unify timecounter to bintime delta conversion
There are two places where we convert from a timecounter delta to
a bintime delta: tc_windup and bintime_off.
Both functions use the same calculations when the timecounter delta is
small.  But for a large delta (greater than approximately an equivalent
of 1 second) the calculations were different.  Both functions use
approximate calculations based on th_scale that avoid division.  Both
produce values slightly greater than a true value, calculated with
division by tc_frequency, would be.  tc_windup is slightly more
accurate, so its result is closer to the true value and, thus, smaller
than bintime_off result.

As a consequence there can be a jump back in time when time hands are
switched after a long period of time (a large delta).  Just before the
switch the time would be calculated with a large delta from
th_offset_count in bintime_off.  tc_windup does the switch using its own
calculations of a new th_offset using the large delta.  As explained
earlier, the new th_offset may end up being less than the previously
produced binuptime.  So, for a period of time new binuptime values may
be "back in time" comparing to values just before the switch.

Such a jump must never happen.  All the code assumes that the uptime is
monotonically nondecreasing and some code works incorrectly when that
assumption is broken.  For example, we have observed sleepq_timeout()
ignoring a timeout when the sbinuptime value obtained by the callout
code was greater than the expiration value, but the sbinuptime obtained
in sleepq_timeout() was less than it.  In that case the target thread
would never get woken up.

The unified calculations should ensure the monotonic property of the
uptime.

The problem is quite rare as normally tc_windup should be called HZ
times per second (typically 1000 or 100).  But it may happen in VMs on
very busy hypervisors where a VM's virtual CPU may not get an execution
time slot for a second or more.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	Panzura LLC
2021-11-30 15:23:23 +02:00
Gordon Bergling
b6f4818a7e vfs: Fix a typo in a sysctl description
- s/dependecies/dependencies/

MFC after:	3 days
2021-11-30 07:28:40 +01:00
Brooks Davis
0e765d9b08 syscalls: regen 2021-11-29 22:04:58 +00:00
Brooks Davis
6d37a1670b syscalls: mprotect does not take a const
The mprotect syscall decleration is not const.  I added this one
incorrectly in a944d28d0e.

Reported by:	kib
Reviewed by:	kib, imp
2021-11-29 22:04:47 +00:00
Brooks Davis
401eec3635 syscalls: regen 2021-11-29 22:04:44 +00:00
Brooks Davis
a8efd4d1b3 syscalls: make syscall and __syscall SYSMUX
Rather than combining the declearation of nosys with the registration
of SYS_syscall, declare syscall(2) and __syscall(2) with the new
SYSMUX type in syscalls.master and declare nosys directly.  This
eliminates the last use of syscall aliases in the tree.

Reviewed by:	kib, imp
2021-11-29 22:04:44 +00:00
Brooks Davis
d7f306c5be makesyscalls: add a new SYSMUX type
This type is for system call multiplexers (syscall(2), __syscall(2))
that don't have a normal handler and instead are handled in the
machine-dependent syscall code.

Reviewed by:	kib, imp
2021-11-29 22:04:43 +00:00
Brooks Davis
5c1835b1d4 syscalls: regen 2021-11-29 22:04:43 +00:00
Brooks Davis
cffb55f0f3 syscalls: normalize exit
Declare the exit system call normally.  This results in the
implementation being named sys_exit rather than sys_sys_exit and
being decalred as returning an int.  Infact it does not return
at all because exit1 does not, so add an __unreachable() to let the
compiler know that.

Reviewed by:	kib, imp
2021-11-29 22:04:43 +00:00
Brooks Davis
7fb006e7d6 syscalls: regen 2021-11-29 22:04:42 +00:00
Brooks Davis
638c5fa8df syscalls: normalize (get|set)rlimit
Declare normal <foo>_args structs rather than going out of the way
to declare __<foo>_args.

Reviewed by:	kib, imp
2021-11-29 22:04:42 +00:00
Brooks Davis
c2996f8ad9 syscalls: regen 2021-11-29 22:04:42 +00:00
Brooks Davis
ba4e5253a3 syscalls: normalize orecvfrom and ogetsockname
Declare o<foo>_args rather than reusing the equivalent <foo>_args
structs.  Avoiding the addition of a new type isn't worth the
gratutious differences.

Reviewed by:	kib, imp
2021-11-29 22:04:42 +00:00
Brooks Davis
28f0471884 uipc: rework recvfrom, getsockname, getpeername
Stop using <foo>_args structs as part of internal kernel APIs.  Add
a kern_recvfrom and adjust getsockname and getpeername's equivalent
functions to take individual arguments rather than a uap pointer.

Adopt a convention from CheriBSD that a function interacting with
userspace pointers and sitting between the sys_<foo> syscall and
kern_<foo> implementation is named user_<foo>.

Reviewed by:	kib, imp
2021-11-29 22:04:41 +00:00
Brooks Davis
3660e76a22 syscalls: correct a couple style issues
Reviewed by:	kib, imp
2021-11-29 22:04:41 +00:00
Brooks Davis
33f9ea209e syscalls: add missing SAL annotations
freebsd7_shmctl was missing an annotation

Reviewed by:	kib, imp
2021-11-29 22:04:41 +00:00
Konstantin Belousov
08bb51f8d6 shutdown: unmount filesystems after swapoff
Swap on file requires operational underlying mount, otherwise
swapoff_all() is guaranteed to panic due to the default strategy VOP for
reclaimed vnodes.

Reported and tested by:	peterj
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33147
2021-11-29 18:38:02 +02:00
Konstantin Belousov
4f924a786a linker_kldload_busy(): allow recursion
Some drivers recursively loads modules by explicit calls to kldload
during initialization, which might occur during kldload.

PR:	259748
Reported and tested by:	thj
Reviewed by:	markj
Sponsored by:	Nvidia networking
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D32972
2021-11-28 10:36:09 +02:00
Mateusz Guzik
e511bd1406 vfs: fully lockless v_writecount adjustment
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D33128
2021-11-27 23:07:26 +00:00
Mateusz Guzik
4dcdf3987c vfs: replace the MNTK_TEXT_REFS flag with VIRF_TEXT_REF
This allows to stop maintaing the VI_TEXT_REF flag and consequently
opens up fully lockless v_writecount adjustment.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D33127
2021-11-27 23:07:25 +00:00
Mateusz Guzik
054f5815c5 vfs: plug a set-but-not-used var in kern_alternate_path
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2021-11-26 12:22:09 +00:00
Mateusz Guzik
3ffcfa599e vfs: add vop_stdadd_writecount_nomsync
This avoids needing to inspect the mount point every time.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D33125
2021-11-26 12:06:08 +00:00