Commit Graph

19445 Commits

Author SHA1 Message Date
Konstantin Belousov
1b0a4974c5 thread_create(): call cpu_copy_thread() after td_pflags is zeroed
By calling the function too early we might still have the td_pflags
value cached from the previous struct thread use. cpu_copy_thread()
depends on correct value for TDP_KTHREAD at least on x86.

Reported, bisected, and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D36069
2022-08-08 19:44:17 +03:00
Gordon Bergling
fa1ac9693a vnode(9): Fix a typo in a source code comment
- s/paramater/parameter/

MFC after:	3 days
2022-08-07 16:08:43 +02:00
Ed Maste
f0687f3e0e Clarify code comments on ASLR default settings
Sponsored by:	The FreeBSD Foundation
2022-08-05 10:01:16 -04:00
Mark Johnston
d07675a935 file: Move code to share fdtol structs into kern_descrip.c
This ensures the filedesc-to-leader code is consistently encapsulated in
kern_descrip.c.

No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35988
2022-08-04 09:39:25 -04:00
Konstantin Belousov
c53fec7603 sig_suspend_threads(): remove 'sending' arg
The TDA_AST flag is set on td2 unconditionally (as it was TDF_ASTPENDING
before AST rework), so it is not used practically for some time.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D36033
2022-08-03 16:56:23 +03:00
Konstantin Belousov
f2fd7d8bfc ast_sig(): add missed TDAI()
Mask checked was completely wrong

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D36033
2022-08-03 16:56:23 +03:00
Mark Johnston
852695416c domain: Use designated constants for timeout periods
No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-08-02 20:31:29 -04:00
Konstantin Belousov
4a662c9064 ktrace: change AST handler to require AST flag set
When it was inline it made sense to depend on the existing nested check
in KTRUSERRET() rather than adding a new td_flags flag.  However, since
we now have a TDA_KTRACE flag anyway, we might as well check it and
avoid the call.

Suggested by:	jhb
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35888
2022-08-02 21:11:10 +03:00
Konstantin Belousov
c46771a7b7 kern/subr_trap.c: cleanup no longer needed headers
Also bump Foundation' copyright year

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35888
2022-08-02 21:11:10 +03:00
Konstantin Belousov
cc1ec77231 Adjust g_waitidle() visibility and definition
Explicitly pass the struct thread argument.
Move the function prototype from sys/systm.h to geom/geom.h, we do not
need almost each kernel source to see the prototype, it is now used
only by kern/vfs_mountroot.c outside geom/geom_event.c, where the
function is defined.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35888
2022-08-02 21:11:10 +03:00
Konstantin Belousov
4fced8642f sigfastblock_setpend() and fastblock_mask can be static now
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35888
2022-08-02 21:11:10 +03:00
Konstantin Belousov
c6d31b8306 AST: rework
Make most AST handlers dynamically registered.  This allows to have
subsystem-specific handler source located in the subsystem files,
instead of making subr_trap.c aware of it.  For instance, signal
delivery code on return to userspace is now moved to kern_sig.c.

Also, it allows to have some handlers designated as the cleanup (kclear)
type, which are called both at AST and on thread/process exit.  For
instance, ast(), exit1(), and NFS server no longer need to be aware
about UFS softdep processing.

The dynamic registration also allows third-party modules to register AST
handlers if needed.  There is one caveat with loadable modules: the
code does not make any effort to ensure that the module is not unloaded
before all threads processed through AST handler in it.  In fact, this
is already present behavior for hwpmc.ko and ufs.ko.  I do not think it
is worth the efforts and the runtime overhead to try to fix it.

Reviewed by:	markj
Tested by:	emaste (arm64), pho
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35888
2022-08-02 21:11:09 +03:00
Alexander V. Chernikov
be1f485d7d sockets: add MSG_TRUNC flag handling for recvfrom()/recvmsg().
Implement Linux-variant of MSG_TRUNC input flag used in recv(), recvfrom() and recvmsg().
Posix defines MSG_TRUNC as an output flag, indicating packet/datagram truncation.
Linux extended it a while (~15+ years) ago to act as input flag,
resulting in returning the full packet size regarless of the input
buffer size.
It's a (relatively) popular pattern to do recvmsg( MSG_PEEK | MSG_TRUNC) to get the
packet size, allocate the buffer and issue another call to fetch the packet.
In particular, it's popular in userland netlink code, which is the primary driving factor of this change.

This commit implements the MSG_TRUNC support for SOCK_DGRAM sockets (udp, unix and all soreceive_generic() users).

PR:		kern/176322
Reviewed by:	pauamma(doc)
Differential Revision: https://reviews.freebsd.org/D35909
MFC after:	1 month
2022-07-30 18:21:51 +00:00
John Baldwin
ea8f128c7c pmap_mapdev: Consistently use vm_paddr_t for the first argument.
The devmap variants used vm_offset_t for some reason, and a few places
explicitly cast bus addresses to vm_offset_t.  (Probably those casts
along with similar casts for vm_size_t should just be removed and
instead permit the compiler to DTRT.)

Reviewed by:	markj
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D35961
2022-07-28 15:55:10 -07:00
Dimitry Andric
a387bd1b6a Adjust function definition in vfs_bio.c to avoid clang 15 warnings
With clang 15, the following -Werror warning is produced:

    sys/kern/vfs_bio.c:3430:11: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    buf_daemon()
              ^
               void

This is because buf_daemon() is declared with a (void) argument list,
but defined with an empty argument list. Make the definition match the
declaration.

MFC after:	3 days
2022-07-26 19:59:57 +02:00
Dimitry Andric
78cfed2de7 Adjust function definitions in sysv_msg.c to avoid clang 15 warnings
With clang 15, the following -Werror warnings are produced:

    sys/kern/sysv_msg.c:213:8: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    msginit()
           ^
            void
    sys/kern/sysv_msg.c:316:10: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    msgunload()
             ^
              void

This is because msginit() and msgunload() are declared with (void)
argument lists, but defined with empty argument lists. Make the
definitions match the declarations.

MFC after:	3 days
2022-07-26 19:59:57 +02:00
Dimitry Andric
b54e962aca Adjust function definition in subr_bus.c to avoid clang 15 warnings
With clang 15, the following -Werror warning is produced:

    sys/kern/subr_bus.c:871:16: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    bus_topo_assert()
                   ^
                    void

This is because bus_topo_assert() is declared with a (void) argument
list, but defined with an empty argument list. Make the definition match
the declaration.

MFC after:	3 days
2022-07-26 19:59:57 +02:00
Dimitry Andric
3c8f0790dd Adjust function definition in subr_autoconf.c to avoid clang 15 warnings
With clang 15, the following -Werror warning is produced:

    sys/kern/subr_autoconf.c:119:34: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    run_interrupt_driven_config_hooks()
                                     ^
                                      void

This is because run_interrupt_driven_config_hooks() is declared with a
(void) argument list, but defined with an empty argument list. Make the
definition match the declaration.

MFC after:	3 days
2022-07-26 19:59:57 +02:00
Dimitry Andric
f2eb09b089 Adjust function definitions in kern_resource.c to avoid clang 15 warnings
With clang 15, the following -Werror warnings are produced:

    sys/kern/kern_resource.c:1212:10: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    lim_alloc()
             ^
              void
    sys/kern/kern_resource.c:1365:11: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    uihashinit()
              ^
               void

This is because lim_alloc() and uihashinit() are declared with (void)
argument lists, but defined with empty argument lists. Make the
definitions match the declarations.

MFC after:	3 days
2022-07-26 19:59:57 +02:00
Dimitry Andric
db8ea61ae2 Adjust function definitions in kern_dtrace.c to avoid clang 15 warnings
With clang 15, the following -Werror warnings are produced:

    sys/kern/kern_dtrace.c:64:18: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    kdtrace_proc_size()
                     ^
                      void
    sys/kern/kern_dtrace.c:87:20: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    kdtrace_thread_size()
                       ^
                        void

This is because kdtrace_proc_size() and kdtrace_thread_size() are
declared with (void) argument lists, but defined with empty argument
lists. Make the definitions match the declarations.

MFC after:	3 days
2022-07-26 19:59:57 +02:00
Dimitry Andric
9806e82a23 Adjust function definitions in kern_cons.c to avoid clang 15 warnings
With clang 15, the following -Werror warnings are produced:

    sys/kern/kern_cons.c:201:14: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    cninit_finish()
                 ^
                  void
    sys/kern/kern_cons.c:376:7: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    cngrab()
          ^
           void
    sys/kern/kern_cons.c:389:9: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    cnungrab()
            ^
             void
    sys/kern/kern_cons.c:402:9: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    cnresume()
            ^
             void

This is because cninit_finish(), cngrab(), cnungrab(), and cnresume()
are declared with (void) argument lists, but defined with empty argument
lists. Make the definitions match the declarations.

MFC after:	3 days
2022-07-26 19:59:56 +02:00
Ka Ho Ng
8c9aa94b42 Convert runtime param checks to KASSERTs for fo_fspacectl
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D35880
2022-07-23 15:16:23 -04:00
Colin Percival
84ec7df0d7 Add kern.reboot_wait_time sysctl
Historic FreeBSD behaviour (dating back to 1994-04-02) when rebooting
is to print "Rebooting..." and then
	/* wait 1 sec for printf's to complete and be read */

Prior to April 1994, there was a 100 ms delay (added 1993-11-12).

Since (a) most users will already be aware that the system is rebooting
and do not need to take time to read an additional message to that
effect, and (b) most FreeBSD systems don't have anyone actively looking
at the console anyway, this delay no longer serves much purpose.

This commit adds a kern.reboot_wait_time sysctl which defaults to 0;
historic behaviour can be regained by setting it to 1.

Reviewed by:	imp
Relnotes:	FreeBSD now reboots faster; to restore the traditional
		wait after printing "Rebooting..." to the console, set
		kern.reboot_wait_time=1 (or more).
Sponsored by:	https://www.patreon.com/cperciva
Differential Revision:	https://reviews.freebsd.org/D35796
2022-07-18 17:23:25 -07:00
Mitchell Horne
2449b9e5fe mac: kdb/ddb framework hooks
Add three simple hooks to the debugger allowing for a loaded MAC policy
to intervene if desired:
 1. Before invoking the kdb backend
 2. Before ddb command registration
 3. Before ddb command execution

We extend struct db_command with a private pointer and two flag bits
reserved for policy use.

Reviewed by:	markj
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D35370
2022-07-18 22:06:13 +00:00
Mitchell Horne
c84c5e00ac ddb: annotate some commands with DB_CMD_MEMSAFE
This is not completely exhaustive, but covers a large majority of
commands in the tree.

Reviewed by:	markj
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D35583
2022-07-18 22:06:09 +00:00
Mark Johnston
bd980ca847 sched_ule: Ensure we hold the thread lock when modifying td_flags
The load balancer may force a running thread to reschedule and pick a
new CPU.  To do this it sets some flags in the thread running on a
loaded CPU.  But the code assumed that a running thread's lock is the
same as that of the corresponding runqueue, and there are small windows
where this is not true.  In this case, we can end up with non-atomic
modifications to td_flags.

Since this load balancing is best-effort, simply give up if the thread's
lock doesn't match; in this case the thread is about to enter the
scheduler anyway.

Reviewed by:	kib
Reported by:	glebius
Fixes:		e745d729be ("sched_ule(4): Improve long-term load balancer.")
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35821
2022-07-18 15:52:27 -04:00
Kornel Dulęba
939f0b6323 Implement shared page address randomization
It used to be mapped at the top of the UVA.
If the randomization is enabled any address above .data section will be
randomly chosen and a guard page will be inserted in the shared page
default location.
The shared page is now mapped in exec_map_stack, instead of
exec_new_vmspace. The latter function is called before image activator
has a chance to parse ASLR related flags.
The KERN_PROC_VM_LAYOUT sysctl was extended to provide shared page
address.
The feature is enabled by default for 64 bit applications on all
architectures.
It can be toggled kern.elf64.aslr.shared_page sysctl.

Approved by:	mw(mentor)
Sponsored by:	Stormshield
Obtained from:	Semihalf
Reviewed by:	kib
Differential Revision: https://reviews.freebsd.org/D35349
2022-07-18 16:27:37 +02:00
Kornel Dulęba
361971fbca Rework how shared page related data is stored
Store the shared page address in struct vmspace.
Also instead of storing absolute addresses of various shared page
segments save their offsets with respect to the shared page address.
This will be more useful when the shared page address is randomized.

Approved by:	mw(mentor)
Sponsored by:	Stormshield
Obtained from:	Semihalf
Reviewed by:	kib
Differential Revision: https://reviews.freebsd.org/D35393
2022-07-18 16:27:32 +02:00
Kornel Dulęba
f6ac79fb12 Introduce the PROC_SIGCODE() macro
Use a getter macro instead of fetching the sigcode address directly
from a sysent of a given process. It assumes that the sigcode is stored
in the shared page, which is true in all cases, except for a.out
binaries. This will be later useful when the shared page address
randomization is introduced.
No functional change intended.

Approved by:	mw(mentor)
Sponsored by:	Stormshield
Obtained from:	Semihalf
Reviewed by:	kib
Differential Revision: https://reviews.freebsd.org/D35392
2022-07-18 16:27:26 +02:00
Mark Johnston
46eab86035 callout: Simplify the inner loop in callout_process() a bit
- Use LIST_FOREACH_SAFE.
- Simplify control flow.

No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-07-17 13:58:19 -04:00
Mark Johnston
aac7c7ac54 callout: Remove a redundant parameter to callout_cc_add()
The passed cpuid is always equal to the one stored in the callout
structure.  No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-07-17 13:58:19 -04:00
Mateusz Guzik
6eeba7dbd6 ule: unbreak UP builds
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-07-16 12:45:09 +00:00
Dmitry Chagin
fc90f3a281 ktrace: Increase precision of timestamps.
Replace struct timeval in header with struct timespec.
To differentiate header formats, add a new KTR_VERSIONED flag
set in the header type field similar to the existing KTRDROP flag.

To make it easier to extend ktrace headers in the future,
extend the existing header with a version field (version 0 is
reserved for older records without KTR_VERSIONED) as well as
new fields holding the thread ID and CPU ID.

Reviewed by:		jhb, pauamma
Differential Revision:	https://reviews.freebsd.org/D35774
MFC after:		2 weeks
2022-07-16 12:46:12 +03:00
John Baldwin
2cf7870864 Collapse interrupt thread priorities.
Allow high priority hardware interrupts to run at PI_REALTIME via
INTR_TYPE_CLK, but collapse all other hardware interrupt threads to
the next priority level (PI_INTR).  Collapse all SWI priorities to
the same priority level (PI_SOFT) just below PI_INTR.

Reviewed by:	kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35646
2022-07-14 13:14:33 -07:00
John Baldwin
40efe74352 4bsd: Simplistic time-sharing for interrupt threads.
If an interrupt thread runs for a full quantum without yielding the
CPU, demote its priority and schedule a preemption to give other
ithreads a turn.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35645
2022-07-14 13:14:17 -07:00
John Baldwin
954cffe95d ule: Simplistic time-sharing for interrupt threads.
If an interrupt thread runs for a full quantum without yielding the
CPU, demote its priority and schedule a preemption to give other
ithreads a turn.

Reviewed by:	kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35644
2022-07-14 13:13:57 -07:00
John Baldwin
ed998d1c24 ithreads: Support priority adjustment by schedulers.
Use sched_wakeup instead of sched_add when marking an ithread
runnable.  This allows schedulers to reset their internal time slice
tracking state and restore the base ithread priority when an ithread
resumes from idle.

Reviewed by:	markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35643
2022-07-14 13:13:35 -07:00
John Baldwin
fea89a2804 Add sched_ithread_prio to set the base priority of an interrupt thread.
Use it instead of sched_prio when setting the priority of an interrupt
thread.

Reviewed by:	kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35642
2022-07-14 13:13:10 -07:00
Mark Johnston
6cbc4ceb7a sched_ule: Use the correct atomic_load variant for tdq_lowpri
Reported by:	tuexen
Fixes:	11484ad8a2 ("sched_ule: Use explicit atomic accesses for tdq fields")
2022-07-14 15:34:02 -04:00
Mark Johnston
11484ad8a2 sched_ule: Use explicit atomic accesses for tdq fields
Different fields in the tdq have different synchronization protocols.
Some are constant, some are accessed only while holding the tdq lock,
some are modified with the lock held but accessed without the lock, some
are accessed only on the tdq's CPU, and some are not synchronized by the
lock at all.

Convert ULE to stop using volatile and instead use atomic_load_* and
atomic_store_* to provide the desired semantics for lockless accesses.
This makes the intent of the code more explicit, gives more freedom to
the compiler when accesses do not need to be qualified, and lets KCSAN
intercept unlocked accesses.

Thus:
- Introduce macros to provide unlocked accessors for certain fields.
- Use atomic_load/store for all accesses of tdq_cpu_idle, which is not
  synchronized by the mutex.
- Use atomic_load/store for accesses of the switch count, which is
  updated by sched_clock().
- Add some comments to fields of struct tdq describing how accesses are
  synchronized.

No functional change intended.

Reviewed by:	mav, kib
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35737
2022-07-14 10:45:33 -04:00
Mark Johnston
0927ff7814 sched_ule: Enable preemption of curthread in the load balancer
The load balancer executes from statclock and periodically tries to move
threads among CPUs in order to balance load.  It may move a thread to
the current CPU (the loader balancer always runs on CPU 0).  When it
does so, it may need to schedule preemption of the interrupted thread.
Use sched_setpreempt() to do so, same as sched_add().

PR:		264867
Reviewed by:	mav, kib, jhb
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35744
2022-07-14 10:27:58 -04:00
Mark Johnston
6d3f74a14a sched_ule: Fix racy loads of pc_curthread
Thread switching used to be atomic with respect to the current CPU's
tdq lock.  Since commit 686bcb5c14 that is no longer the case.  Now
sched_switch() does this:

1.  lock tdq (might already be locked)
2.  maybe put the current thread in the tdq, choose a new thread to run
2a. update tdq_lowpri
3.  unlock tdq
4.  switch CPU context, update curthread

Some code paths in ULE will load pc_curthread from a remote CPU with
that CPU's tdq lock held, usually to inspect its priority.  But, as of
the aforementioned commit this is racy.

The problem I noticed is in tdq_notify(), which optionally sends an IPI
to a remote CPU when a new thread is added to its runqueue.  If the new
thread's priority is higher (lower) than the currently running thread's
priority, then we deliver an IPI.  But inspecting
pc_curthread->td_priority doesn't work, since pc_curthread might be
between steps 3 and 4 above.  If pc_curthread's priority is higher than
that of the newly added thread, but pc_curthread is switching to a
lower-priority thread, then tdq_notify() might fail to deliever an IPI,
leaving a high priority thread stuck on the runqueue for longer than it
should.  This can cause multi-millisecond stalls in
interactive/ithread/realtime threads.

Fix this problem by modifying tdq_add() and tdq_move() to return the
value of tdq_lowpri before the addition of the new thread.  This ensures
that tdq_notify() has the correct priority value to compare against.

The other two uses of pc_curthread are susceptible to the same race.  To
fix the one in sched_rem()->tdq_setlowpri() we need to have an exact
value for curthread.  Thus, introduce a new tdq_curthread field to the
tdq which gets updated any time a new thread is selected to run on the
CPU.  Because this field is synchronized by the thread lock, its
priority reflects the correct lowpri value for the tdq.

PR:		264867
Fixes:		686bcb5c14 ("schedlock 4/4")
Reviewed by:	mav, kib, jhb
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35736
2022-07-14 10:27:51 -04:00
Mark Johnston
ef221ff645 time: Make realitexpire() local to kern_time.c
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-07-13 09:57:28 -04:00
Mark Johnston
38e1d32dab callout: Simplify cpuid validation in callout_reset_sbt_on()
- Remove a flag variable.
- Convert a runtime check of the passed cpuid to a KASSERT.
- Remove the cc_inited flag.  An attempt to schedule a callout before
  SI_SUB_CPU will crash anyway since the per-CPU mutexes won't have been
  initialized, and that flag was only checked in the case where a cpuid
  was explicitly specified by the caller.

No functional change intended.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2022-07-13 09:47:33 -04:00
Mark Johnston
ece453d5fa eventtimer: Simplify KTR traces
Stop including the current CPU in all event messages, since it's already
saved in KTR log entries and thus is redundant.  All eventtimer traces
occur in a context where CPU migration is not possible.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-07-11 15:58:43 -04:00
Mark Johnston
a889a65ba3 eventtimer: Fix several races in the timer reload code
In handleevents(), lock the timer state before fetching the time for the
next event.  A concurrent callout_cc_add() call might be changing the
next event time, and the race can cause handleevents() to program an
out-of-date time, causing the callout to run later (by an unbounded
period, up to the idle hardclock period of 1s) than requested.

In cpu_idleclock(), call getnextcpuevent() with the timer state mutex
held, for similar reasons.  In particular, cpu_idleclock() runs with
interrupts enabled, so an untimely timer interrupt can result in a stale
next event time being programmed.  Further, an interrupt can cause
cpu_idleclock() to use a stale value for "now".

In cpu_activeclock(), disable interrupts before loading "now", so as to
avoid going backwards in time when calling handleevents().  It's ok to
leave interrupts enabled when checking "state->idle", since the race at
worst will cause handleevents() to be called unnecessarily.  But use an
atomic load to indicate that the test is racy.

PR:		264867
Reviewed by:	mav, jhb, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35735
2022-07-11 15:58:43 -04:00
Mark Johnston
ebb3cb6195 eventtimer: Pass a pcpu state pointer to getnext(cpu)event()
Callers have already loaded the pointer, so these functions don't need
to fetch it again.

No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-07-11 15:58:43 -04:00
Mark Johnston
ba71333f60 sched_ule: Fix a typo in a comment
PR:		226107
MFC after:	1 week
2022-07-11 15:58:43 -04:00
Mark Johnston
ef80894c9d sched_ule: Purge an obsolete comment
The referenced bitmask was removed in commit 62fa74d95a.

MFC after:	 1 week
Sponsored by:	The FreeBSD Foundation
2022-07-11 15:58:43 -04:00
Mark Johnston
35dd6d6cb5 sched_ule: Eliminate a superfluous local variable in tdq_move()
No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-07-11 15:58:43 -04:00
Gleb Smirnoff
c261510ef5 sockets: fix setsockopt(SO_RCVTIMEO) on a listening socket
MFC after:	3 weeks
2022-07-08 11:33:24 -07:00
Mitchell Horne
258958b3c7 ddb: use _FLAGS command macros where appropriate
Some command definitions were forced to use DB_FUNC in order to specify
their required flags, CS_OWN or CS_MORE. Use the new macros to simplify
these.

Reviewed by:	markj, jhb
MFC after:	3 days
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D35582
2022-07-05 11:56:55 -03:00
Gleb Smirnoff
d8596171c5 sockets: use only soref()/sorele() as socket reference count
o Retire SS_FDREF as it is basically a debug flag on top of already
  existing soref()/sorele().
o Convert SS_PROTOREF into soref()/sorele().
o Change reference model for the listen queues, see below.
o Make sofree() private.  The correct KPI to use is only sorele().
o Make soabort() respect the model and sorele() instead of sofree().

Note on listening queues.  Until now the sockets on a queue had zero
reference count.  And the reference were given only upon accept(2).  The
assumption was that there is no way to see the queued socket from anywhere
except its head.  This is not true, since queued sockets already have pcbs,
which are linked at least into the global pcb lists.  With this change we
put the reference right in the sonewconn() and on accept(2) path we just
hand the existing reference to the file descriptor.

Differential revision:	https://reviews.freebsd.org/D35679
2022-07-04 12:40:51 -07:00
Gleb Smirnoff
bc7605647c sockets: use positive flag for file descriptor socket reference
Rename SS_NOFDREF to SS_FDREF and flip all bitwise operations.
Mark sockets created by socreate() with SS_FDREF.

This change is mostly illustrative. With it we see that SS_FDREF
is a debugging flag, since:
* socreate() takes a reference with soref().
* on accept path solisten_dequeue() takes a reference
  with soref() and then soaccept() sets SS_FDREF.
* soclose() checks SS_FDREF, removes it and does sorele().

Reviewed by:		tuexen
Differential revision:	https://reviews.freebsd.org/D35678
2022-07-04 12:40:51 -07:00
Warner Losh
b69996d1d5 tty: Default to printing kernel stack traceback only on INVARIANT kernels
Change the default from printing a breif kernel thread stack informaton
back to omitting it for non-invariant kernels in response to
SIGINFO/^T. Full and brief stack support can be selected with the
kern.tty_info_kstacks sysctl.

MFC After:		2 weeks
Sponsored by:		Netflix
Reviewed by:		grembo, jhb
Differential Revision:	https://reviews.freebsd.org/D35576
2022-07-02 08:02:12 -06:00
John Baldwin
0bd73da206 busdma_bounce: Use PRI_ITHD scheduling class for worker thread.
Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35641
2022-06-30 10:06:04 -07:00
John Baldwin
0288d4277f Add register sets for NT_THRMISC and NT_PTLWPINFO.
For the kernel this is mostly a non-functional change.  However, this
will be useful for simplifying gcore(1).

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D35666
2022-06-30 10:04:56 -07:00
Gleb Smirnoff
66c8e3fccf socket: fix listen(2) on an already listening socket
Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35669
Fixes:			141fe2dcee
2022-06-30 07:50:29 -07:00
Konstantin Belousov
ad175a107b vfs_mount.c: convert explicit panics and KASSERTs to MPASSERT/MPPASS
Reviewed by:	imp, mjg
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35652
2022-06-29 21:31:47 +03:00
Konstantin Belousov
1e54362824 vfs_op_exit(): assert that mnt_vfs_ops stays non-zero for unmount or suspend
Reviewed by:	mjg
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35639
2022-06-29 21:31:47 +03:00
Jamie Gritton
7060da62ff jail: Remove a prison's shared memory when it dies
Add shm_remove_prison(), that removes all POSIX shared memory segments
belonging to a prison.  Call it from prison_cleanup() so a prison
won't be stuck in a dying state due to the resources still held.

PR:		257555
Reported by:	grembo
2022-06-29 10:47:39 -07:00
Jamie Gritton
a9f7455c38 jail: add prison_cleanup() to release resources held by a dying jail
Currently, when a jail starts dying, either by losing its last user
reference or by being explicitly killed,
osd_jail_call(...PR_METHOD_REMOVE...) is called.  Encapsulate this
into a function prison_cleanup() that can then do other cleanup.
2022-06-29 10:33:05 -07:00
Gleb Smirnoff
48a55bbfe9 unix: change error code for recvmsg() failed due to RLIMIT_NOFILE
Instead of returning EMSGSIZE pass the error code from fdallocn() directly
to userland.  That would be EMFILE, which makes much more sense.  This
error code is not listed in the specification[1], but the specification
doesn't cover such edge case at all.  Meanwhile the specification lists
EMSGSIZE as the error code for invalid value of msg_iovlen, and FreeBSD
follows that, see sys_recmsg().  Differentiating these two cases will make
a developer/admin life much easier when debugging.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/recvmsg.html

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35640
2022-06-29 09:42:58 -07:00
Kristof Provost
ab91feabcc ovpn: Introduce OpenVPN DCO support
OpenVPN Data Channel Offload (DCO) moves OpenVPN data plane processing
(i.e. tunneling and cryptography) into the kernel, rather than using tap
devices.
This avoids significant copying and context switching overhead between
kernel and user space and improves OpenVPN throughput.

In my test setup throughput improved from around 660Mbit/s to around
2Gbit/s.

Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D34340
2022-06-28 11:33:10 +02:00
Mateusz Guzik
7388fb714a cache: drop the vfs.cache_rename_add tunable
The functionality has been in use since Jan 2021 -- long enough(tm).
2022-06-27 09:56:20 +02:00
Gleb Smirnoff
458f475df8 unix/dgram: smart socket buffers for one-to-many sockets
A one-to-many unix/dgram socket is a socket that has been bound
with bind(2) and can get multiple connections.  A typical example
is /var/run/log bound by syslogd(8) and receiving multiple
connections from libc syslog(3) API.  Until now all of these
connections shared the same receive socket buffer of the bound
socket.  This made the socket vulnerable to overflow attack.
See 240d5a9b1c for a historical attempt to workaround the problem.

This commit creates a per-connection socket buffer for every single
connected socket and eliminates the problem.  The new behavior will
optimize seldom writers over frequent writers.  See added test case
scenarios and code comments for more detailed description of the
new behavior.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35303
2022-06-24 09:09:11 -07:00
Gleb Smirnoff
1093f16487 unix/dgram: reduce mbuf chain traversals in send(2) and recv(2)
o Use m_pkthdr.memlen from m_uiotombuf()
o Modify unp_internalize() to keep track of allocated space and memory
  as well as pointer to the last buffer.
o Modify unp_addsockcred() to keep track of allocated space and memory
  as well as pointer to the last buffer.
o Record the datagram len/memlen/ctllen in the first (from) mbuf of the
  chain in uipc_sosend_dgram() and reuse it in uipc_soreceive_dgram().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35302
2022-06-24 09:09:11 -07:00
Gleb Smirnoff
9b841b0e23 m_uiotombuf: write total memory length of the allocated chain in pkthdr
Data allocated by m_uiotombuf() usually goes into a socket buffer.
We are interested in the length of useful data to be added to sb_acc,
as well as total memory used by mbufs.  The later would be added to
sb_mbcnt.  Calculating this value at allocation time allows to save
on extra traversal of the mbuf chain.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35301
2022-06-24 09:09:11 -07:00
Gleb Smirnoff
a7444f807e unix/dgram: use minimal possible socket buffer for PF_UNIX/SOCK_DGRAM
This change fully splits away PF_UNIX/SOCK_DGRAM from other socket
buffer implementations, without any behavior changes.

Generic socket implementation is reduced down to one STAILQ and very
little code.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35300
2022-06-24 09:09:11 -07:00
Gleb Smirnoff
a4fc41423f sockets: enable protocol specific socket buffers
Split struct sockbuf into common shared fields and protocol specific
union, where protocols are free to implement whatever buffer they
want.  Such protocols should mark themselves with PR_SOCKBUF and are
expected to initialize their buffers in their pr_attach and tear
them down in pr_detach.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35299
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
315167c0de unix: provide an option to return locked from unp_connectat()
Use this new version in unix/dgram socket when sending to a target
address.  This removes extra lock release/acquisition and possible
counter-intuitive ENOTCONN.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35298
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
5dc8dd5f3a unix/dgram: inline sbappendaddr_locked() into uipc_sosend_dgram()
This allows to remove one M_NOWAIT allocation and also makes it
more clear what's going on.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35297
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
e3fbbf965e unix/dgram: add a specific receive method - uipc_soreceive_dgram
With this second step PF_UNIX/SOCK_DGRAM has protocol specific
implementation.  This gives some possibility performance
optimizations.  However, it still operates on the same struct
socket as all other sockets do.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35296
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
f384a97c83 unix/dgram: cleanup uipc_send of PF_UNIX/SOCK_DGRAM, step 2
Just remove one level of indentation as the case clause always match.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35295
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
7e5b6b391e unix/dgram: cleanup uipc_send of PF_UNIX/SOCK_DGRAM, step 1
Remove the dead code.  The new uipc_sosend_dgram() handles send()
on PF_UNIX/SOCK_DGRAM in full.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35294
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
3464958246 unix/dgram: add a specific send method - uipc_sosend_dgram()
This is first step towards splitting classic BSD socket
implementation into separate classes.  The first to be
split is PF_UNIX/SOCK_DGRAM as it has most differencies
to SOCK_STREAM sockets and to PF_INET sockets.

Historically a protocol shall provide two methods for sendmsg(2):
pru_sosend and pru_send.  The former is a generic send method,
e.g. sosend_generic() which would internally call the latter,
uipc_send() in our case.  There is one important exception, though,
the sendfile(2) code will call pru_send directly.  But sendfile
doesn't work on SOCK_DGRAM, so we can do the trick.  We will create
socket class specific uipc_sosend_dgram() which will carry only
important bits from sosend_generic() and uipc_send().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35293
2022-06-24 09:09:10 -07:00
Mitchell Horne
29afffb942 subr_bus: restore bus_null_rescan()
Partially revert the previous change; we need to keep this method as a
specific override for pci_driver subclasses which should not use
pci_rescan_method() -- cardbus and ofw_pcibus. However, change the return
value to ENODEV for the same reasoning given in the original commit, and
use this as the default rescan method in bus_if.m.

Reported by:	jhb
Fixes:		36a8572ee8 ("bus_if: provide a default null rescan method")
MFC with:	36a8572ee8
2022-06-23 16:07:00 -03:00
Mitchell Horne
8701571df9 set_cputicker: use a bool
The third argument to this function indicates whether the supplied
ticker is fixed or variable, i.e. requiring calibration. Give this
argument a type and name that better conveys this purpose.

Reviewed by:	kib, markj
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35459
2022-06-23 15:15:11 -03:00
Mitchell Horne
36a8572ee8 bus_if: provide a default null rescan method
There is an existing helper method in subr_bus.c, but almost no drivers
know to use it. It also returns the same error as an empty method,
making it not very useful. Move this to bus_if.m and return a more
sensible error code.

This gives a slightly more meaningful error message when attempting
'devctl rescan' on buses and devices alike:
  "Device not configured" --> "Operation not supported by device"

Reviewed by:	imp
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35501
2022-06-23 15:15:10 -03:00
Chuck Silvers
5bd21cbbd1 vfs: fix vfs_bio_clrbuf() for PAGE_SIZE > block size
Calculate the desired page valid mask using math that will not
overflow the types used.

Sponsored by:	Netflix

Reviewed by:	mckusick, kib, markj
Differential Revision:	https://reviews.freebsd.org/D34837
2022-06-21 17:58:52 -07:00
Mark Johnston
9553bc89db aio: Improve UMA usage
- Remove the AIO proc zone.  This zone gets one allocation per AIO
  daemon process, which isn't enough to warrant a dedicated zone.  Plus,
  unlike other AIO structures, aiops are small (32 bytes with LP64), so
  UMA doesn't provide better space efficiency than malloc(9).  Change
  one of the malloc types in vfs_aio.c to make it more general.

- Don't set the NOFREE flag on the other AIO zones.  This flag means
  that memory allocated to the AIO subsystem is never freed back to the
  VM, so it's always preferable to avoid using it when possible.  NOFREE
  was set without explanation when AIO was converted to use UMA 20 years
  ago, but it does not appear to be required; all of the structures
  allocated from UMA (per-process kaioinfo, kaiocb, and aioliojob) keep
  track of references and get freed only when none exist.  Plus, these
  structures will contain dangling pointer after they're freed (e.g.,
  the "cred", "fd_file" and "uiop" fields of struct kaiocb), so
  use-after-frees are dangerous even when the structures themselves are
  type-stable.

Reviewed by:	asomers
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35493
2022-06-20 12:48:13 -04:00
Damjan Jovanovic
8c309d48aa struct kinfo_file changes needed for lsof to work using only usermode APIs`
Add kf_pipe_buffer_[in/out/size] fields to kf_pipe, and populate them.

Add a kf_kqueue struct to the kf_un union, to allow querying kqueue state,
and populate it.

Populate the kf_sock_rcv_sb_state and kf_sock_snd_sb_state fields in
kf_sock for INET/INET6 sockets, and populate all other fields for all
transport layer protocols, not just TCP.

Bump __FreeBSD_version.

Differential revision:	https://reviews.freebsd.org/D34184
Reviewed by:	jhb, kib, se
MFC after:	1 week
2022-06-18 12:34:25 +03:00
Damjan Jovanovic
8ae7694913 KERN_LOCKF: report kl_file_fsid consistently with stat(2)
PR:	264723
Reviewed by:	kib
Discussed with:	markj
MFC after:	1 week
2022-06-18 12:34:17 +03:00
Mark Johnston
f6379f7fde socket: Fix a race between kevent(2) and listen(2)
When locking the knote list for a socket, we check whether the socket is
a listening socket in order to select the appropriate mutex; a listening
socket uses the socket lock, while data sockets use socket buffer
mutexes.

If SOLISTENING(so) is false and the knote lock routine locks a socket
buffer, then it must re-check whether the socket is a listening socket
since solisten_proto() could have changed the socket's identity while we
were blocked on the socket buffer lock.

Reported by:	syzkaller
Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35492
2022-06-16 10:20:04 -04:00
Mark Johnston
756bc3adc5 kasan: Create a shadow for the bootstack prior to hammer_time()
When the kernel is compiled with -asan-stack=true, the address sanitizer
will emit inline accesses to the shadow map.  In other words, some
shadow map accesses are not intercepted by the KASAN runtime, so they
cannot be disabled even if the runtime is not yet initialized by
kasan_init() at the end of hammer_time().

This went unnoticed because the loader will initialize all PML4 entries
of the bootstrap page table to point to the same PDP page, so early
shadow map accesses do not raise a page fault, though they are silently
corrupting memory.  In fact, when the loader does not copy the staging
area, we do get a page fault since in that case only the first and last
PML4Es are populated by the loader.  But due to another bug, the loader
always treated KASAN kernels as non-relocatable and thus always copied
the staging area.

It is not really practical to annotate hammer_time() and all callees
with __nosanitizeaddress, so instead add some early initialization which
creates a shadow for the boot stack used by hammer_time().  This is only
needed by KASAN, not by KMSAN, but the shared pmap code handles both.

Reported by:	mhorne
Reviewed by:	kib
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35449
2022-06-15 11:39:10 -04:00
Doug Ambrisko
ce00b11940 mount: revert the active vnode reporting feature
Revert the computing of active vnode reporting since statfs is used
by a lot of tools.  Only report the vnodes used.

Reported by:	mjg
2022-06-15 07:24:55 -07:00
Mark Johnston
7565431f30 mount: Fix an incorrect assertion in kernel_mount()
The pointer to the mount values may be null if an error occurred while
copying them in, so fix the assertion condition to reflect that
possibility.

While here, move some initialization code into the error == 0 block.  No
functional change intended.

Reported by:	syzkaller
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2022-06-14 12:00:59 -04:00
Mark Johnston
630f633f2a vm_object: Use the vm_object_(set|clear)_flag() helpers
... rather than setting and clearing flags inline.  No functional change
intended.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35469
2022-06-14 12:00:59 -04:00
Mark Johnston
e8955bd643 pipe: Use a distinct wait channel for I/O serialization
Suppose a thread tries to read from an empty pipe.  pipe_read() does the
following:

1. pipelock(), possibly sleeping
2. check for buffered data
3. pipeunlock()
4. set PIPE_WANTR and sleep
5. goto 1

pipelock() is an open-coded mutex; if a thread blocks in pipelock(), it
sleeps until the lock holder calls pipeunlock().

Both sleeps use the same wait channel.  So if there are multiple threads
in pipe_read(), a thread T1 in step 3 can wake up a thread T2 sleeping
in step 4.  Then T1 goes to sleep in step 4, and T2 acquires and
releases the pipelock, waking up T1 again.  This can go on indefinitely,
livelocking the process (and potentially starving a would-be writer).

Fix the problem by using a separate wait channel for pipelock().

Reported by:	Paul Floyd <paulf2718@gmail.com>
Reviewed by:	mjg, kib
PR:		264441
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35415
2022-06-14 12:00:59 -04:00
Cy Schubert
d781401512 kern_thread.c: Fix i386 build
Chase 4493a13e3b by updating static
assertions of struct proc.
2022-06-13 19:35:33 -07:00
Konstantin Belousov
1575804961 reap_kill_proc(): avoid singlethreading any other process if we are exiting
This is racy because curproc process lock is not used, but allows the
process to exit faster.  It is userspace issue to create such race
anyway, and not fullfilling the guarantee that all reaper descendants
are signalled should be fine.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
e0343eacf3 reap_kill_subtree(): hold the reaper when entering it into the queue to handle later
We drop proctree_lock, which allows the process to exit while memoized
in the list to proceed.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
1d4abf2cfa reap_kill_subtree_once(): handle proctree_lock unlock in reap_kill_proc()
Recorded reaper might loose its reaper status, so we should not assert
it, but check and avoid signalling if this happens.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 week
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
addf103ce6 reap_kill_proc: do not retry on thread_single() failure
The failure means that the process does single-threading itself, which
makes our action not needed.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
008b2e6544 Make stop_all_proc_block interruptible to avoid deadlock with parallel suspension
If we try to single-thread a process which thread entered
procctl(REAP_KILL_SUBTREE), and sleeping waiting for us unlocking
stop_all_proc_blocker, we must be able to finish single-threading.  This
requires the sleep to be interruptible.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Mark Johnston
2d5ef216b6 thread_single_end(): consistently maintain p_boundary_count for ALLPROC mode
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 week
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
1b4701fe1e thread_unsuspend(): do not unuspend the suspended leader thread doing SINGLE_ALLPROC
markj wrote:
tdsendsignal() may unsuspend a target thread. I think there is at least
one bug there: suppose thread T is suspended in
thread_single(SINGLE_ALLPROC) when trying to kill another process with
REAP_KILL. Suppose a different thread sends SIGKILL to T->td_proc. Then,
tdsendsignal() calls thread_unsuspend(T, T->td_proc). thread_unsuspend()
incorrectly decrements T->td_proc->p_suspcount to -1.

Later, when T->td_proc exits, it will wait forever in
thread_single(SINGLE_EXIT) since T->td_proc->p_suspcount never reaches 1.

Since the thread suspension is bounded by time needed to do
thread_single(), skipping the thread_unsuspend_one() call there should
not affect signal delivery if this thread is selected as target.

Reported by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
b9009b1789 thread_single(): remove already checked conditional expression
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
4493a13e3b Do not single-thread itself when the process single-threaded some another process
Since both self single-threading and remote single-threading rely on
suspending the thread doing thread_single(), it cannot be mixed: thread
doing thread_suspend_switch() might be subject to thread_suspend_one()
and vice versa.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
dd883e9a7e weed_inhib(): correct the condition to re-suspend a thread
suspended for SINGLE_ALLPROC mode.  There is no need to check for
boundary state.  It is only required to see that the suspension comes
from the ALLPROC mode.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
b9893b3533 weed_inhib(): do not double-suspend already suspended thread if the loop reiterates
In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:02 +03:00
Konstantin Belousov
d7a9e6e740 thread_single: wait for P_STOPPED_SINGLE to pass
to avoid ALLPROC mode to try to race with any other single-threading
mode.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:02 +03:00
Konstantin Belousov
02a2aacbe2 issignal(): ignore signals when process is single-threading for exit
Places that will wait for curproc->p_singlethr to become zero (in the
next commit, the counter of number of external single-threading is
to be introduced), must wait for it interruptible, otherwise we
deadlock.  On the other hand, a signal delivered during this window,
if directed to the waiting thread, would cause the wait loop to become
a busy loop.

Since we are exiting, it is safe to ignore the signals.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:02 +03:00
Konstantin Belousov
d3000939c7 P2_WEXIT: avoid thread_single() for exiting process earlier
before the process itself does thread_single(SINGLE_EXIT).  We cannot
single-thread such process in ALLPROC (external) mode, and properly
detect and report the failure to do so due to the process becoming
zombie is easier to prevent than handle.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:02 +03:00
Doug Ambrisko
6468cd8e0e mount: add vnode usage per file system with mount -v
This avoids the need to drop into the ddb to figure out vnode
usage per file system.  It helps to see if they are or are not
being freed.  Suggestion to report active vnode count was from
kib@

Reviewed by:   	kib
Differential Revision: https://reviews.freebsd.org/D35436
2022-06-13 07:56:38 -07:00
Hans Petter Selasky
b8394039dc mbuf(9): Fix size of mbuf for all 32-bit platforms (i386, ARM, PowerPC and RISCV)
Do this by reducing the size of the MBUF_PEXT_MAX_PGS, causing "struct mbuf" to
be bigger than M_SIZE, and also add a missing padding field to ensure 64-bit
alignment.

Reviewed by:	gallatin@
Reported by:	Elliott Mitchell
Differential revision:	https://reviews.freebsd.org/D35339
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2022-06-07 22:09:10 +02:00
Hans Petter Selasky
fe8c78f0d2 ktls: Add full support for TLS RX offloading via network interface.
Basic TLS RX offloading uses the "csum_flags" field in the mbuf packet
header to figure out if an incoming mbuf has been fully offloaded or
not. This information follows the packet stream via the LRO engine, IP
stack and finally to the TCP stack. The TCP stack preserves the mbuf
packet header also when re-assembling packets after packet loss. When
the mbuf goes into the socket buffer the packet header is demoted and
the offload information is transferred to "m_flags" . Later on a
worker thread will analyze the mbuf flags and decide if the mbufs
making up a TLS record indicate a fully-, partially- or not decrypted
TLS record. Based on these three cases the worker thread will either
pass the packet on as-is or recrypt the decrypted bits, if any, or
decrypt the packet as usual.

During packet loss the kernel TLS code will call back into the network
driver using the send tag, informing about the TCP starting sequence
number of every TLS record that is not fully decrypted by the network
interface. The network interface then stores this information in a
compressed table and starts asking the hardware if it has found a
valid TLS header in the TCP data payload. If the hardware has found a
valid TLS header and the referred TLS header is at a valid TCP
sequence number according to the TCP sequence numbers provided by the
kernel TLS code, the network driver then informs the hardware that it
can resume decryption.

Care has been taken to not merge encrypted and decrypted mbuf chains,
in the LRO engine and when appending mbufs to the socket buffer.

The mbuf's leaf network interface pointer is used to figure out from
which network interface the offloading rule should be allocated. Also
this pointer is used to track route changes.

Currently mbuf send tags are used in both transmit and receive
direction, due to convenience, but may get a new name in the future to
better reflect their usage.

Reviewed by:	jhb@ and gallatin@
Differential revision:	https://reviews.freebsd.org/D32356
Sponsored by:	NVIDIA Networking
2022-06-07 12:58:09 +02:00
Hans Petter Selasky
f0fca64618 ktls: Refer send tag pointer once.
So that the asserts and the actual code see the same values.

Differential revision:	https://reviews.freebsd.org/D32356
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2022-06-07 12:57:03 +02:00
Hans Petter Selasky
4d88d81c31 mbuf(9): Implement a leaf network interface field in the mbuf packet header.
When packets are received they may traverse several network interfaces like
vlan(4) and lagg(9). When doing receive side offloads it is important to
know the first network interface entry point, because that is where all
offloading is taking place. This makes it possible to track receive
side route changes for multiport setups, for example when lagg(9) receives
traffic from more than one port. This avoids having to install multiple
offloading rules for the same stream.

This field works similar to the existing "rcvif" mbuf packet header field.

Submitted by:	jhb@
Reviewed by:	gallatin@ and gnn@
Differential revision:	https://reviews.freebsd.org/D35339
Sponsored by:	NVIDIA Networking
Sponsored by:	Netflix
2022-06-07 12:54:42 +02:00
Gleb Smirnoff
d97922c6c6 unix/*: rewrite unp_internalize() cmsg parsing cycle
Make it a complex, but a single for(;;) statement.  The previous cycle
with some loop logic in the beginning and some loop logic at the end
was confusing.  Both me and markj@ were misleaded to a conclusion that
some checks are unnecessary, while they actually were necessary.

While here, handle an edge case found by Mark, when on 64-bit platform
an incorrect message from userland would underflow length counter, but
return without any error.  Provide a test case for such message.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35375
2022-06-06 10:05:28 -07:00
Yuichiro NAITO
8d95f50052 smp: Use local copies of the setup function pointer and argument
No functional change intended.

PR:		264383
Reviewed by:	jhb, markj
MFC after:	1 week
2022-06-06 11:29:51 -04:00
Gleb Smirnoff
2573e6ced9 unix/dgram: rename unpdg_sendspace to unpdg_maxdgram
Matches the meaning of the variable and sysctl node name.
2022-06-03 12:55:44 -07:00
Gleb Smirnoff
a8e286bb5d sockets: use socket buffer mutexes in struct socket directly
Convert more generic socket code to not use sockbuf compat pointer.
Continuation of 4328318445.
2022-06-03 12:55:44 -07:00
Mitchell Horne
35eb9b10c2 Use KERNEL_PANICKED() in more places
This is slightly more optimized than checking panicstr directly. For
most of these instances performance doesn't matter, but let's make
KERNEL_PANICKED() the common idiom.

Reviewed by:	mjg
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D35373
2022-06-02 10:15:43 -03:00
Gleb Smirnoff
f083739350 soo_aio_*: use socket buffer mutexes in struct socket directly
A miss from commit 4328318445.
2022-05-30 20:46:38 -07:00
Dmitry Chagin
d46174cd88 Finish cpuset_getaffinity() after f35093f8
Split cpuset_getaffinity() into a two counterparts, where the
user_cpuset_getaffinity() is intended to operate on the cpuset_t from
user va, while kern_cpuset_getaffinity() expects the cpuset from kernel
va.
Accordingly, the code that clears the high bits is moved to the
user_cpuset_getaffinity(). Linux sched_getaffinity() syscall returns
the size of set copied to the user-space and then glibc wrapper clears
the high bits.

MFC after:		2 weeks
2022-05-28 20:53:08 +03:00
Dmitry Chagin
31d1b816fe sysent: Get rid of bogus sys/sysent.h include.
Where appropriate hide sysent.h under proper condition.

MFC after:	2 weeks
2022-05-28 20:52:17 +03:00
Gleb Smirnoff
d64f2f42c1 unix: unp_externalize() can M_WAITOK
Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35318
2022-05-27 20:48:38 -07:00
Gleb Smirnoff
d59bc188d6 sockbuf: remove unused mbuf counter and cluster counter
With M_EXTPG mbufs these two counters already do not represent the
reality.  As we are moving towards protocol independent socket buffers,
which may not even use mbufs at all, the counters become less and less
relevant.  The only userland seeing them was 'netstat -x'.

PR:			264181 (exp-run)
Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35334
2022-05-27 08:20:17 -07:00
Gleb Smirnoff
75e7e3ce34 unix: fix incorrect assertion in 4682ac697c
Pointy hat to:	glebius
Fixes:		4682ac697c
2022-05-26 11:35:05 -07:00
Gleb Smirnoff
4682ac697c unix: turn check in unp_externalize() into assertion
In this function we always work with mbufs that we previously
created ourselves in unp_internalize().  They must be valid.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35319
2022-05-25 13:29:20 -07:00
Gleb Smirnoff
579b45e203 unix/*: check new control size in unp_internalize()
Now that we call sbcreatecontrol() with M_WAITOK, we are expected to
pass a valid size.  Return same error code, we are returning for an
oversized control from sockargs().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35317
2022-05-25 13:29:13 -07:00
Gleb Smirnoff
d60ea9a10a sockets: return EMSGSIZE if control part of message is too large
Specification doesn't list an explicit error code for the control
size specified by msg_control being too large.  But it does list
EMSGSIZE as error code for "message is too large to be sent all at
once (as the socket requires)".  It also lists EINVAL as code for
the "The sum of the iov_len values overflows an ssize_t."  Given
how generic and uninformative EINVAL is, the EMSGSIZE is more
appropriate.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/sendmsg.html

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35316
2022-05-25 13:29:04 -07:00
Gleb Smirnoff
ad51c47fb4 sockbuf: fix assertion in sbcreatecontrol()
Fixes:	6890b58814
2022-05-25 00:19:41 -07:00
Mark Johnston
524dadf7a8 kevent: Fix an off-by-one in filt_timerexpire_l()
Suppose a periodic kevent timer fires close to its deadline, so that
now - kc->next is small.  Then delta ends up being 1, and the next timer
deadline is set to (delta + 1) * kc->to, where kc->to is the timer
period.  This means that the timer fires at half of the requested rate,
and the value returned in kn_data is similarly inaccurate.

PR:		264131
Fixes:		7cb40543e9 ("filt_timerexpire: do not iterate over the interval")
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35313
2022-05-24 20:14:33 -04:00
Mateusz Guzik
cdb337b097 vfs: fix copy-pasto in previous
Reported by:	dchagin
2022-05-20 20:58:11 +00:00
Mateusz Guzik
ec3c225711 vfs: call vn_truncate_locked from kern_truncate
This fixes a bug where the syscall would not bump writecount.

PR:	263999
2022-05-20 17:25:51 +00:00
Mateusz Guzik
6b715687bd vfs: make sure truncate always calls NDFREE_*
While here convert it to NDFREE_NOTHING.
2022-05-20 17:25:51 +00:00
Mark Johnston
4a3e51335e cpuset: Fix the KASAN and KMSAN builds
Rename the "copyin" and "copyout" fields of struct cpuset_copy_cb to
something less generic, since sanitizers define interceptors for
copyin() and copyout() using #define.

Reported by:	syzbot+2db5d644097fc698fb6f@syzkaller.appspotmail.com
Fixes:	47a57144af ("cpuset: Byte swap cpuset for compat32 on big endian architectures")
Sponsored by:	The FreeBSD Foundation
2022-05-20 10:34:25 -04:00
Dmitry Chagin
eca368ecb6 Retire sv_transtrap
Call translate_traps directly from sendsig().

MFC after:		2 weeks
2022-05-20 14:54:03 +03:00
Dmitry Chagin
2479e381cd kqueue: Trim trailing whitespace
MFC after:		1 week
2022-05-19 19:52:02 +03:00
Justin Hibbits
47a57144af cpuset: Byte swap cpuset for compat32 on big endian architectures
Summary:
BITSET uses long as its basic underlying type, which is dependent on the
compile type, meaning on 32-bit builds the basic type is 32 bits, but on
64-bit builds it's 64 bits.  On little endian architectures this doesn't
matter, because the LSB is always at the low bit, so the words get
effectively concatenated moving between 32-bit and 64-bit, but on
big-endian architectures it throws a wrench in, as setting bit 0 in
32-bit mode is equivalent to setting bit 32 in 64-bit mode.  To
demonstrate:

32-bit mode:

BIT_SET(foo, 0):        0x00000001

64-bit sees: 0x0000000100000000

cpuset is the only system interface that uses bitsets, so solve this
by swapping the integer sub-components at the copyin/copyout points.

Reviewed by:	kib
MFC after:	3 days
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D35225
2022-05-19 10:49:55 -05:00
Andrew Turner
11a6ecd425 Handle cas failure when the compare succeeds
When locking a priority inherit mutex we perform a compare and swap
operation to try and acquire the mutex. This may fail even when the
compare succeeds.

Check and handle this case.

PR:		263825
Reviewed by:	kib, markj
Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35150
2022-05-19 11:30:21 +01:00
Gleb Smirnoff
6890b58814 sockbuf: improve sbcreatecontrol()
o Constify memory pointer.  Make length unsigned.
o Make it never fail with M_WAITOK and assert that length is sane.
2022-05-17 10:10:42 -07:00
Gleb Smirnoff
b46667c63e sockbuf: merge two versions of sbcreatecontrol() into one
No functional change.
2022-05-17 10:10:42 -07:00
Gleb Smirnoff
eac7f0798b unix: garbage collect unp_dispose_mbuf() for brevity 2022-05-17 10:10:41 -07:00
Gleb Smirnoff
2e5bf7c49f unix: fix mbuf leak on close of socket with data
Fixes:	1f32cef471
2022-05-17 10:10:41 -07:00
Vladimir Kondratyev
b6f87b78b5 LinuxKPI: Implement kthread_worker related functions
Kthread worker is a single thread workqueue which can be used in cases
where specific kthread association is necessary, for example, when it
should have RT priority or be assigned to certain cgroup.

This change implements Linux v4.9 interface which mostly hides kthread
internals from users thus allowing to use ordinary taskqueue(9) KPI.
As kthread worker prohibits enqueueing of already pending or canceling
tasks some minimal changes to taskqueue(9) were done.
taskqueue_enqueue_flags() was added to taskqueue KPI which accepts extra
flags parameter. It contains one or more of the following flags:

TASKQUEUE_FAIL_IF_PENDING - taskqueue_enqueue_flags() fails if the task
    is already scheduled to execution. EEXIST is returned and the
    ta_pending counter value remains unchanged.
TASKQUEUE_FAIL_IF_CANCELING - taskqueue_enqueue_flags() fails if the
    task is in the canceling state and ECANCELED is returned.

Required by:	drm-kmod 5.10

MFC after:	1 week
Reviewed by:	hselasky, Pau Amma (docs)
Differential Revision:	https://reviews.freebsd.org/D35051
2022-05-17 15:10:20 +03:00
Rick Macklem
373511338d uipc_socket.c: Modify MSG_TLSAPPDATA to only do Alert Records
Without this patch, the MSG_TLSAPPDATA flag would cause
soreceive_generic() to return ENXIO for any non-application
data record in a TLS receive stream.

This works ok for TLS1.2, since Alert records appear to be
the only non-application data records received.
However, for TLS1.3, there can be post-handshake handshake
records, such as NewSessionKey sent to the client from the
server. These handshake records cannot be handled by the
upcall which does an SSL_read() with length == 0.

It appears that the client can simply throw away these
NewSessionKey records, but to do so, it needs to receive
them within the kernel.

This patch modifies the semantics of MSG_TLSAPPDATA slightly,
so that it only applies to Alert records and not Handshake
records. It is needed to allow the krpc to work with KTLS1.3.

Reviewed by:	hselasky
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35170
2022-05-14 12:56:50 -07:00
Dmitry Chagin
cb2ae61631 sysvsem: Fix a typo
Per jamie@ rpr can be NULL if the jail is created with sysvsem=disable.
But at least it doesn't appear to be fatal, since rpr is never dereferenced
but is only compared to other prison pointers.

Reviewed by:		jamie
Differential revision:	https://reviews.freebsd.org/D35198
MFC after:		2 weeks
2022-05-14 14:07:20 +03:00
Dmitry Chagin
b6c8f461f0 sysvsem: Style(9)
MFC after:	2 weeks
2022-05-14 14:06:58 +03:00
Dmitry Chagin
f0b0fdf15e sysvsem: Trim traiing whitespace
MFC after:	2 weeks
2022-05-14 14:06:40 +03:00
Mitchell Horne
db71383b88 kerneldump: remove physical from dump routines
It is unused, especially now that the underlying d_dumper methods do not
accept the argument.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35174
2022-05-13 10:43:19 -03:00
Mitchell Horne
489ba22236 kerneldump: remove physical argument from d_dumper
The physical address argument is essentially ignored by every dumper
method. In addition, the dump routines don't actually pass a real
address; every call to dump_append() passes a value of zero for
physical.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35173
2022-05-13 10:42:48 -03:00
Mitchell Horne
0f50da2e09 Drop d_dump from struct cdevsw
It appears to be unused. These days struct disk has a d_dump member,
which is what gets passed to the kernel dump framework.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35172
2022-05-13 10:42:17 -03:00
Gleb Smirnoff
bb35a4e11d unix: microoptimize unp_connectat() - one less lock on success
This change is also a preparation for further optimization to
allow locked return on success.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35182
2022-05-12 13:22:39 -07:00
Gleb Smirnoff
08f17d1432 unix: make unp_connect2() void
Assert that sockets are of the same type.  unp_connectat() already did
this check.  Add the check to uipc_connect2().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35181
2022-05-12 13:22:39 -07:00
Gleb Smirnoff
4328318445 sockets: use socket buffer mutexes in struct socket directly
Since c67f3b8b78 the sockbuf mutexes belong to the containing socket,
and socket buffers just point to it.  In 74a68313b5 macros that access
this mutex directly were added.  Go over the core socket code and
eliminate code that reaches the mutex by dereferencing the sockbuf
compatibility pointer.

This change requires a KPI change, as some functions were given the
sockbuf pointer only without any hint if it is a receive or send buffer.

This change doesn't cover the whole kernel, many protocols still use
compatibility pointers internally.  However, it allows operation of a
protocol that doesn't use them.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35152
2022-05-12 13:22:12 -07:00
Gleb Smirnoff
01235012e5 unix/dgram: uipc_listen() is specific for SOCK_STREAM and SOCK_SEQPACKET
Rely on pr_usrreqs_init() to init SOCK_DGRAM to pru_listen_notsupp().
2022-05-12 11:04:40 -07:00
Gleb Smirnoff
3c87ba3c3b unix/dgram: pru_rcvd never called since PR_WANTRCVD not set 2022-05-12 11:04:40 -07:00
Gleb Smirnoff
2e4e5ee23f sockets: delete stale comment from sofree()
First  paragraph refers to old past "we used to" and is no longer
important today.  Second paragraph has just a wrong statement that
socket buffer is destroyed before pru_detach.
2022-05-12 11:02:50 -07:00
Gleb Smirnoff
1f32cef471 unix: don't call sbrelease() in uipc_detach()
Since a982ce0442 the socket buffer is already cleared and released in
unp_dispose() that is called just before uipc_detach().
2022-05-12 11:02:50 -07:00
Dmitry Chagin
586ed32106 kdump: Decode cpuset_t.
Reviewed by:		jhb
Differential revision:	https://reviews.freebsd.org/D34982
MFC after:		2 weeks
2022-05-11 10:40:39 +03:00
Dmitry Chagin
f35093f8d6 Use Linux semantics for the thread affinity syscalls.
Linux has more tolerant checks of the user supplied cpuset_t's.

Minimum cpuset_t size that the Linux kernel permits in case of
getaffinity() is the maximum CPU id, present in the system / NBBY,
the maximum size is not limited.
For setaffinity(), Linux does not limit the size of the user-provided
cpuset_t, internally using only the meaningful part of the set, where
the upper bound is the maximum CPU id, present in the system, no larger
than the size of the kernel cpuset_t.
Unlike FreeBSD, Linux ignores high bits if set in the setaffinity(),
so clear it in the sched_setaffinity() and Linuxulator itself.

Reviewed by:		Pau Amma (man pages)
In collaboration with:	jhb
Differential revision:	https://reviews.freebsd.org/D34849
MFC after:		2 weeks
2022-05-11 10:36:01 +03:00
Gleb Smirnoff
7db54446c6 sockbufs: make sbrelease_internal() private 2022-05-09 10:43:01 -07:00
Gleb Smirnoff
a982ce0442 sockets: remove the socket-on-stack hack from sorflush()
The hack can be tracked down to 4.4BSD, where copy was performed
under splimp() and then after splx() dom_dispose was called.
Stevens has a chapter on this function, but he doesn't answer why
this trick is necessary.  Why can't we call into dom_dispose under
splimp()?  Anyway, with multithreaded kernel the hack seems to be
necessary to avoid LORs between socket buffer lock and different
filesystem locks, especially network file systems.

The new socket buffers KPI sbcut() from 1d2df300e9 allow us to get
rid of the hack.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35125
2022-05-09 10:43:01 -07:00
Gleb Smirnoff
42f2fa9953 sockets: don't call dom_dispose() on a listening socket
sorflush() already did the right thing, so only sofree() needed
a fix.  Turn check into assertion in our only dom_dispose method.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35124
2022-05-09 10:42:57 -07:00
Gleb Smirnoff
c17418a0ba sockets: assert that any protocol with PR_RIGHTS has dom_dispose()
Through the entire history only PF_UNIX has this feature.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35123
2022-05-09 10:42:48 -07:00
Gleb Smirnoff
24df85d29a unix/*: unp_internalize() can sleep, so allocate mbufs with M_WAITOK 2022-05-09 10:42:48 -07:00
Gleb Smirnoff
97f8198e95 sockets: make SO_SND/SO_RCV a enum
Not a functional change now. The enum will also be used for other socket
buffer related KPIs.
2022-05-09 10:42:47 -07:00
Warner Losh
45ae223ac6 msgbuf: Allow microsecond granularity timestamps
Today, kern.msgbuf_show_timestamp=1 will give 1 second granularity
timestamps on dmesg lines. When kern.msgbuf_show_timestamp=2, we'll
produce microsecond level graunlarity.
For example:
old (== 1):
[13] Dual Console: Video Primary, Serial Secondary
[14] lo0: link state changed to UP
[15] bxe0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[15] bxe0: link state changed to UP
new (== 2):
[13.807015] Dual Console: Video Primary, Serial Secondary
[14.544150] lo0: link state changed to UP
[15.272044] bxe0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[15.272052] bxe0: link state changed to UP

Sponsored by:		Netflix
2022-05-07 09:32:22 -06:00
Alan Somers
1d2421ad8b Correctly measure system load averages > 1024
The old fixed-point arithmetic used for calculating load averages had an
overflow at 1024.  So on systems with extremely high load, the observed
load average would actually fall back to 0 and shoot up again, creating
a kind of sawtooth graph.

Fix this by using 64-bit math internally, while still reporting the load
average to userspace as a 32-bit number.

Sponsored by:	Axcient
Reviewed by:	imp
Differential Revision: https://reviews.freebsd.org/D35134
2022-05-06 17:25:43 -06:00
John Baldwin
2fdcc2ef6f cpufreq: Remove unused devclass argument to DRIVER_MODULE. 2022-05-06 15:46:58 -07:00
Dmitry Chagin
f04534f5c8 sysvsem: Add a timeout argument to the semop.
For future use in the Linux emulation layer for the semtimedop syscall
split the sys_semop syscall into two counterparts and add
struct timespec *timeout argument to the last one.

Reviewed by:		jhb, kib
Differential revision:	https://reviews.freebsd.org/D35121
MFC after:		2 weeks
2022-05-06 19:51:48 +03:00
Kristof Provost
613acc6483 mbuf: do not restore dying interfaces
When we remove an interface it is first removed from the interface list
V_ifnet (by if_unlink_ifnet()) and marked as IFF_DYING. We then wait for
any possible references to stop being used (i.e.
epoch_wait/epoch_drain_callbacks) before we tear it fully down.

However, the index in ifindex_table is not removed, so m_rcvif_restore()
can still find the (now dying) interface.

This results in panics, for example when dummynet restores the rcvif
pointer and passes a packet to ip6_input() we can panic because the
AF_INET6 domain has already been removed (so we end up dereferencing a
NULL pointer there).

Check that the interface is not dying before we restore it, which is
equivalent to checking its presence in V_ifnet, and thus ensures that
future accesses (while in NET_EPOCH) are safe.

Reviewed by:	glebius
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D34076

(cherry picked from commit 703e533da5)
2022-05-05 14:38:08 -04:00
Gleb Smirnoff
4d7a1361ef ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif
Supplement ifindex table with generation count and use it to
serialize & restore an ifnet pointer.

Reviewed by:		kp
Differential revision:	https://reviews.freebsd.org/D33266
Fun note:		git show e6abef0918

(cherry picked from commit e1882428dc)
2022-05-05 14:38:07 -04:00
Marko Zec
6c741ffbfa Revert "mbuf: do not restore dying interfaces"
This reverts commit 703e533da5.

Revert "ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif"

This reverts commit e1882428dc.

Obtained from: github.com/glebius/FreeBSD/commits/backout-ifindex
2022-05-03 19:11:40 +02:00
Konstantin Belousov
6fe78ad434 subr_unit.c: make userspace tests buildable
by defining a placeholder for UNR_NO_MTX

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2022-04-28 03:00:14 +03:00
Konstantin Belousov
709783373e Fix another race between fork(2) and PROC_REAP_KILL subtree
where we might not yet see a new child when signalling a process.
Ensure that this cannot happen by stopping all reapping subtree,
which ensures that the child is not inside a syscall, in particular
fork(2).

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:35 +03:00
Konstantin Belousov
39794d80ad Fix a race between fork(2) and PROC_REAP_KILL subtree
by repeating iteration over the subtree until there are no new processes
to signal.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:35 +03:00
Konstantin Belousov
d1df347368 kern_procctl: add possibility to take stop_all_proc_block() around exec
stop_allo_proc_block() must be taken before proctree_lock.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:35 +03:00
Konstantin Belousov
2e7595ef2f Add stop_all_proc_block(9)
It allows to have more than one consumer of thread_signle(SIGNLE_ALLPROC) by
serializing them.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:35 +03:00
Konstantin Belousov
54a11adbd9 reap_kill(): split children and subtree killers into helpers
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
134529b11b reap_kill(): rename the reap variable to reaper
Suggested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
e4ce431e2a reap_kill(): de-inline LIST_FOREACH(), twice
Suggested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
b9294a3e15 reaper_abandon_children(): upgrade proctree_lock assert to exclusive
p_reapsibling linkage is protected by proctree_lock, and it is modified
there.

Suggested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
e59b940dcb unr(9): allow to avoid internal locking
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
c4be460e84 init_unrhdr(): make it usable by initializing everything
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
John Baldwin
1431239494 Add a __witness_used for variables only used under #ifdef WITNESS.
__diagused is now solely used for variables only used under INVARIANTS.

Reviewed by:	mjg
Differential Revision:	https://reviews.freebsd.org/D35085
2022-04-27 11:46:16 -07:00
Dmitry Chagin
4a700f3c32 sigtimedwait: Prevent timeout math overflows.
Our kern_sigtimedwait() calculates absolute sleep timo value as 'uptime+timeout'.
So, when the user specifies a big timeout value (LONG_MAX), the calculated
timo can be less the the current uptime value.
In that case kern_sigtimedwait() returns EAGAIN instead of EINTR, if
unblocked signal was caught.

While here switch to a high-precision sleep method.

Reviewed by:		mav, kib
In collaboration with:	mav
Differential revision:	https://reviews.freebsd.org/D34981
MFC after:		2 weeks
2022-04-25 10:23:15 +03:00
Dmitry Chagin
91e7bdcdcf Add timespecvalid_interval macro and use it.
Reviewed by:		jhb, imp (early rev)
Differential revision:	https://reviews.freebsd.org/D34848
MFC after:		2 weeks
2022-04-25 10:20:54 +03:00
John Baldwin
a4c5d490f6 KTLS: Move OCF function pointers out of ktls_session.
Instead, create a switch structure private to ktls_ocf.c and store a
pointer to the switch in the ocf_session.  This will permit adding an
additional function pointer needed for NIC TLS RX without further
bloating ktls_session.

Reviewed by:	hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35011
2022-04-22 15:52:12 -07:00
John Baldwin
92e40a9b92 busdma_bounce: Batch bounce page free operations when possible.
Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D34968
2022-04-21 12:01:55 -07:00
John Baldwin
d4ab3a8d4f busdma_bounce: Add free_bounce_pages helper function.
Deduplicate code to iterate over the bpages list in a bus_dmamap_t
freeing bounce pages during bus_dmamap_unload.

Reviewed by:	imp
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D34967
2022-04-21 10:42:14 -07:00
John Baldwin
10fe9a1fb4 busdma_bounce: Make the map waiting list per-bounce-zone.
When pages are freed to a bounce zone, only maps waiting for pages for
that zone can make forward progress.  If a map for a different bounce
zone is at the head of the global list, then requests that could
otherwise make forward progress will be stalled waiting on the other
bounce zone.  If bounce zones shared bounce pages then a global list
would still make sense to prevent "later" requests from starving an
earlier request but that is not a concern with per-zone bounce page
pools.

Reviewed by:	imp
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D34966
2022-04-21 10:41:09 -07:00
John Baldwin
d11f5d4762 busdma_bounce: Use a simple kproc to invoke deferred requests.
Rather than using a software interrupt with a single handler, just
create a dedicated kernel process woken up with a simple wakeup().

Reviewed by:	imp
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D34965
2022-04-21 10:40:35 -07:00
John Baldwin
c7aa0304d5 Run softclock threads at a hardware ithread priority.
Add a new PI_SOFTCLOCK for use by softclock threads.  Currently this
maps to PI_AV which is the second-highest ithread priority.

Reviewed by:	mav, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D33693
2022-04-21 10:40:01 -07:00
John Baldwin
3d7e90fc20 cpufreq_curr_sysctl: Use devclass_find to lookup cpufreq devclass.
Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D35002
2022-04-21 10:29:14 -07:00
Kristof Provost
a879e40ca2 callout: fix using shared rmlocks
15b1eb142c changed the callout code to store the CALLOUT_SHAREDLOCK flag
in c_iflags (where it used to be c_flags), but failed to update the
check in softclock_call_cc(). This resulted in the callout code always
taking the write lock, even if a read lock had been requested (with
the CALLOUT_SHAREDLOCK flag in callout_init_rm()).

Reviewed by:	markj
MFC after:	1 week
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D34959
2022-04-20 13:06:50 +02:00
John Baldwin
5bdea8826b devclass_add_driver: Permit NULL to be passed in dcp.
This permits a driver module structure that doesn't want to store a
pointer to the new driver's devclass.

Reviewed by:	imp
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D34962
2022-04-19 10:43:50 -07:00
Mateusz Guzik
c5c981d443 signals: plug a set-but-not-used var
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-04-19 12:45:57 +00:00
John Baldwin
d139909d6e destroy_dev_sched*: Don't hold Giant for all deferred destroy_dev.
Rather than using taskqueue_swi_giant which holds Giant for all
deferred destroy_dev calls, create a separate queue for destroyed
devices with D_NEEDGIANT set in the corresponding cdevsw.  The task
for this queue holds Giant whild destroying deferred devices while the
task for the default queue does not hold Giant.

In addition, switch to taskqueue_thread for destroy_dev_sched.
Deferred destroy_dev requests don't need to run at an SWI priority.

Reviewed by:	imp, markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D34915
2022-04-18 12:04:30 -07:00
Konstantin Belousov
362ff9867e Revert rest of a5970a529c: use vrefact() when working on fp->f_vnode
Now, since O_PATH-opened file descriptors use use references instead
of the hold references, vrefact() chahges from that revision can be
reverted.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34906
2022-04-15 16:56:20 +03:00
Ed Maste
f99cc5a389 sysent: regen after 52a1d90c8b, posix_fadvise in capmode 2022-04-14 15:17:36 -04:00
Ed Maste
52a1d90c8b Allow posix_fadvise in capability mode
posix_fadvise operates only on a provided fd.  Noted by
Mathieu <sigsys@gmail.com> in review D34761.

No new CAP_ rights are added for posix_fadvise(), as 'advice' in
general only influences when I/O happens; the fd must have existing
CAP_ rights for actual data access.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34903
2022-04-14 15:11:21 -04:00
Konstantin Belousov
bf13db086b Mostly revert a5970a529c: Make files opened with O_PATH to not block non-forced unmount
Problem is that open(O_PATH) on nullfs -o nocache is broken then,
because there is no reference on the vnode after the open syscall exits.

Reported and tested by:	ambrisko
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2022-04-14 02:47:04 +03:00
John Baldwin
36fb372264 kern: Move variables only used for MAC under #ifdef MAC. 2022-04-13 16:08:23 -07:00
John Baldwin
4aec198420 sched_ule: Inline value of ts in sched_thread_priority.
This avoids a set but unused warning in kernels without SMP where
TDQ_CPU() doesn't use its argument.
2022-04-13 16:08:23 -07:00
John Baldwin
8758ac757f sched_4bsd: ts is only used in sched_bind for SMP. 2022-04-13 16:08:22 -07:00
John Baldwin
72ff256c51 sched_4bsd: Remove unused variables. 2022-04-12 14:58:59 -07:00
John Baldwin
dbd51c416a realloc(9): Move slab and zone under #ifndef DEBUG_REDZONE. 2022-04-12 14:58:59 -07:00
Mark Johnston
d769609620 tty: Remove an incorrect assertion from ttyinq_line_iterate()
We may legitimately have tib == NULL if we're at the very end of the
queue.

PR:		215373
Reported by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-04-12 17:30:04 -04:00
Tom Jones
1ea833a572 kdb: set kdb_why when entered via reboot and panic
Reviewed by:	jhb
Sponsored by:   NetApp, Inc.
Sponsored by:   Klara, Inc.
X-NetApp-PR:    #74
Differential Revision:	https://reviews.freebsd.org/D34551
2022-04-12 10:34:40 +01:00
Dmitry Chagin
c6487446d7 getdirentries: return ENOENT for unlinked but still open directory.
To be more compatible to IEEE Std 1003.1-2008 (“POSIX.1”).

Reviewed by:		mjg, Pau Amma (doc)
Differential revision:  https://reviews.freebsd.org/D34680
MFC after:		2 weeks
2022-04-11 23:30:16 +03:00
Konstantin Belousov
eca39864f7 Add sysctl KERN_LOCKF
reporting the shapshot of the active advisory locks.

A new VFS ops method vfs_report_lockf if provided in the mount point
op table.  If it is NULL, as it is currently for all existing
filesystems, vfs_report_lockf() function is used, which gathers
information from the standard implementation inside kern/kern_lockf.c.

Filesystems implementing its own locking (NFSv4 as example) can provide
a custom implementation.

Reviewed by:	markj, rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34756
2022-04-10 00:43:53 +03:00
Konstantin Belousov
147e4fe3f1 kern_lockf.c: remove no longer neeeded UFS headers
Reviewed by:	markj, rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34756
2022-04-10 00:43:53 +03:00
Konstantin Belousov
59e85819be lockf: remove lf_inode from struct lockf_entry
The UFS-specific struct inode cannot be used in generic advisory lock
code.  It was probably used as a shortcut for the debugging, as the
remnants of the code around it indicates.

Use somewhat more verbose and less concentrated, but universal,
VOP_PRINT(), where needed.

Reviewed by:	markj, rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34756
2022-04-10 00:43:53 +03:00
Gordon Bergling
f171938cd6 jail: Remove a double word in a source code comment
- s/a a/a/

MFC after:	3 days
2022-04-09 14:19:17 +02:00
Gordon Bergling
c3721292e3 kern: Remove a double word in a source code comment
- s/for for/for/

MFC after:	3 days
2022-04-09 10:50:04 +02:00
Gordon Bergling
768f9b8b8b kern: Fix a typo in a source code comment
- s/is is/is/

MFC after:	3 days
2022-04-09 09:14:14 +02:00
Andrew Turner
41e6d2091c Enable subr_physmem_test on supported architectures
Only build where it's supported.

While here add support for amd64 to help with testing.

Sponsored by:	The FreeBSD Foundation
2022-04-07 14:31:51 +01:00
Andrew Turner
d8bff5b67c Handle non-page aligned/sized memory in physmem
In some configurations the firmware may pass memory regions that are
not page sized or aligned, e.g. when using 16k pages on arm64. If this
is the case we will calculate many small regions because the alignment
is applied before being inserted. As we round the start up and end down
this will leave a 1 page hole between what should have been a single
region.

Fix by keeping the original alignment until we are just about to insert
the region into the avail array.

Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34694
2022-04-06 14:13:29 +01:00
Andrew Turner
8c99dfed54 Port subr_physmem to userspace and add tests
These give us some confidience we haven't broken anything in early
boot code that may be running before the console.

Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34691
2022-04-06 14:13:05 +01:00
Mitchell Horne
eb9d205fa6 livedump: add event handler hooks
Add three hooks to the livedump process: before, after, and for each
block of dumped data. This allows, for example, quiescing the system
before the dump begins or protecting data of interest to ensure its
consistency in the final output.

Reviewed by:	markj, kib (previous version)
Reviewed by:	debdrup (manpages)
Reviewed by:	Pau Amma <pauamma@gundo.com> (manpages)
MFC after:	3 weeks
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D34067
2022-04-05 15:35:05 -03:00
Mitchell Horne
c9114f9f86 Add new vnode dumper to support live minidumps
This dumper can instantiate and write the dump's contents to a
file-backed vnode.

Unlike existing disk or network dumpers, the vnode dumper should not be
invoked during a system panic, and therefore is not added to the global
dumper_configs list. Instead, the vnode dumper is constructed ad-hoc
when a live dump is requested using the new ioctl on /dev/mem. This is
similar in spirit to a kgdb session against the live system via
/dev/mem.

As described briefly in the mem(4) man page, live dumps are not
guaranteed to result in a usuable output file, but offer some debugging
value where forcefully panicing a system to dump its memory is not
desirable/feasible.

A future change to savecore(8) will add an option to save a live dump.

Reviewed by:	markj, Pau Amma <pauamma@gundo.com> (manpages)
Discussed with:	kib
MFC after:	3 weeks
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D33813
2022-04-05 15:35:05 -03:00
Mitchell Horne
59c27ea18c Split out dumper allocation from list insertion
Add a new function, dumper_create(), to allocate a dumper.
dumper_insert() will call this function and retains the existing
behaviour.

This is desirable for performing live dumps of the system. Here, there
is a need to allocate and configure a dumper structure that is invoked
outside of the typical debugger context. Therefore, it should be
excluded from the list of panic-time dumpers.

free_single_dumper() is made public and renamed to dumper_destroy().

Reviewed by:	kib, markj
MFC after:	1 week
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D34068
2022-04-05 15:35:05 -03:00
Mateusz Guzik
b7262756e2 vfs: fixup WANTIOCTLCAPS on open
In some cases vn_open_cred overwrites cn_flags, effectively nullifying
initialisation done in NDINIT. This will have to be fixed.

In the meantime make sure the flag is passed.

Reported by:	jenkins
Noted by:	Mathieu <sigsys@gmail.com>
2022-04-02 20:49:01 +02:00
Gordon Bergling
c9b04ee4f8 kern: Fix two typos in source code comments
- s/accomodate/accommodate/

MFC after:	3 days
2022-04-02 14:52:49 +02:00
Gordon Bergling
7181887e82 kern: Fix two typos in source code comments
- s/measurment/measurement/

MFC after:	3 days
2022-04-02 14:15:27 +02:00
Mateusz Guzik
0c805718cb vfs: fix memory leak on lookup with fds with ioctl caps
Reviewed by:	markj
PR:		262515
Noted by:	firk@cantconnect.ru
Differential Revision:	https://reviews.freebsd.org/D34667
2022-04-02 12:09:07 +00:00
Gordon Bergling
669d5ea4e3 kern: Fix a typo in a source code comment
- s/paniced/panicked/

MFC after:	3 days
2022-04-02 10:15:02 +02:00
Ed Maste
e5821a2156 syscalls.master: remove obsolete comment about compatibility tables
Compatibility ABIs no longer use a separate syscalls.master.

Fixes:		be67ea40c5 ("freebsd32: generate from ...")
Sponsored by:	The FreeBSD Foundation
2022-03-30 11:07:00 -04:00
Brooks Davis
8601fca789 sysent: regen for syscallarg_t 2022-03-28 19:43:03 +01:00
Brooks Davis
b1ad6a9000 syscallarg_t: Add a type for system call arguments
This more clearly differentiates system call arguments from integer
registers and return values. On current architectures it has no effect,
but on architectures where pointers are not integers (CHERI) and may
not even share registers (CHERI-MIPS) it is necessiary to differentiate
between system call arguments (syscallarg_t) and integer register values
(register_t).

Obtained from:	CheriBSD

Reviewed by:	imp, kib
Differential Revision:	https://reviews.freebsd.org/D33780
2022-03-28 19:43:03 +01:00
Andrew Turner
f461b95561 Fix a sign mismatch warning in the physmem code
Make sure both sides of a comparison are unsigned. As the values being
compared are size_t make the the value in the for loop size_t too.

Sponsored by:	The FreeBSD Foundation
2022-03-28 11:51:09 +01:00
Mateusz Guzik
2533b5dc82 vfs: add missing bits to vdropl_impl
This completes the patch which was originally meant to go in.

Spotted by:	mhorne
Fixes: c35ec1efdc ("vfs: [1/2] fix stalls in vnode reclaim by not
requeieing from vnlru")
2022-03-27 14:35:37 +00:00
Mateusz Guzik
a4032e2a69 vfs: assorted tidy ups to lookup
No functional changes.
2022-03-26 17:06:09 +00:00
Alexander Leidinger
aeb91e95cf Log euid, rgid and jail on listen queue overflow
If you have numerous jails with multiple similar services running,
this helps to narrow down which services this log is referring to.
2022-03-26 11:17:55 +01:00
Eric van Gyzen
aca2a7faca stack_zero is not needed before stack_save
The man page was recently clarified to commit to this contract.

MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2022-03-25 20:10:38 -05:00
Eric van Gyzen
863070bbf6 ksiginfo_alloc: pass M_WAITOK or M_NOWAIT to uma_zalloc
It expects exactly one of those flags.  A future commit will assert this.

Reviewed by:	rstone
MFC after:	1 month
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D34451
2022-03-25 20:10:37 -05:00
Mateusz Guzik
0f60088399 vfs: set cn_namelen when handling degenerate lookups
Turns out execve looks at it to store binary name, but in order to
trigger the problem one has to be trying to exec '/'. As is the value
would be left uninitialized (or rather set to -1 on debug kernels).

Fixes:	56244d3574 ("vfs: hoist degenerate path lookups out of the
loop")
2022-03-25 18:19:36 +00:00
Mateusz Guzik
4ef6e56ae8 vfs: hoist trailing slash handling out of the loop 2022-03-24 14:36:31 +00:00
Mateusz Guzik
3b6792d28a vfs: factor symlink traversal out of namei
The intent down the road is to eliminate the loop to begin with,
pushing traversal down to vfs_lookup, all while not allocating the
extra buffer.
2022-03-24 13:11:22 +00:00
Mateusz Guzik
d9ea7e2b1e vfs: factor FAILIFEXISTS handling out of vfs_lookup 2022-03-24 11:22:20 +00:00
Mateusz Guzik
56244d3574 vfs: hoist degenerate path lookups out of the loop 2022-03-24 11:22:12 +00:00
Mateusz Guzik
bb92cd7bcd vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd) 2022-03-24 10:20:51 +00:00
Mark Johnston
1babcad6bc elf: Avoid dumping uninitialized bytes in PRSTATUS core dump notes
elf_prstatus_t contains pad space.

Reported by:	KMSAN
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34606
2022-03-23 12:53:49 -04:00
Mark Johnston
7524994da0 callout: Remove the CS_EXECUTING flag
It is now unused.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34626
2022-03-23 12:37:02 -04:00
Mark Johnston
b319171861 setitimer: Fix exit race
We use the p_itcallout callout, interlocked by the proc lock, to
schedule timeouts for the setitimer(2) system call.  When a process
exits, the callout must be stopped before the process struct is
recycled.

Currently we attempt to stop the callout in exit1() with the call
_callout_stop_safe(&p->p_itcallout, CS_EXECUTING).  If this call returns
0, then we sleep in order to drain the callout.  However, this happens
only if the callout is not scheduled at all.  If the callout thread is
blocked on the proc lock, then exit1() will not block and the callout
may execute after the process has fully exited, typically resulting in a
panic.

I cannot see a reason to use the CS_EXECUTING flag here.  Instead, use
the regular callout_stop()/callout_drain() dance to halt the callout.

Reported by:	ler
Tested by:	ler, pho
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34625
2022-03-23 12:36:12 -04:00
Alexander Motin
fd6ca665d2 Fix umtxq_sleep() regression caused by 56070dd2e4.
umtxq_requeue() moves the queue to a different hash chain and different
lock, so we can't rely on msleep_sbt() reacquiring the same old lock.
We have to use PDROP and update the queue chain and so lock pointer.

PR:		262587
MFC after:	2 weeks
2022-03-21 19:55:55 -04:00
firk
bb53dd56c3 kern_tc.c/cputick2usec() (which is used to calculate cputime from
cpu ticks) has some imprecision and, worse, huge timestep (about
20 minutes on 4GHz CPU) near 53.4 days of elapsed time.

kern_time.c/cputick2timespec() (it is used for clock_gettime() for
querying process or thread consumed cpu time) Uses cputick2usec()
and then needlessly converting usec to nsec, obviously losing
precision even with fixed cputick2usec().

kern_time.c/kern_clock_getres() uses some weird (anyway wrong)
formula for getting cputick resolution.

PR:		262215
Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D34558
2022-03-21 09:33:46 -04:00
Andrew Turner
cab496e16c Make SHMMAXPGS an unsigned long
This is used to calculate sizes that are then stored in unsigned long
fields. Make this unsigned long so the calculations use this type and
not an int that can lead to an integer overflow with a large PAGE_SIZE.

This allows building this on arm64 with PAGE_SIZE of 16k. Further work
will be needed if a 32-bit architecture tries to use a similar sized
page.

Sponsored by:	The FreeBSD Foundation
2022-03-21 10:27:35 +00:00
Colin Percival
2406867f5b tslog: Add CTLFLAG_SKIP to sysctls
The timestamp logs are quite large (often much larger than all the
other sysctls combined) so it's unlikely anyone will want to have
them displayed by `sysctl -a`.

MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D34616
2022-03-20 11:31:16 -07:00
Mateusz Guzik
6ff3e8a316 cache: add a comment about a realpath bug 2022-03-19 15:11:25 +00:00
Mateusz Guzik
eb574ba0b6 vfs: replace VFS_NOTIFY_UPPER_* macros with an enum 2022-03-19 13:15:55 +00:00
Mateusz Guzik
cceb91b025 vfs: add missing flags to db show mount 2022-03-19 12:04:44 +00:00
Mateusz Guzik
93a0ba8f49 vfs: retire the no longer used MNTK_LOOKUP_EXCL_DOTDOT flag
Reviewed by:	markj
Tested by:	pho (previous version)
Differential Revision:	https://reviews.freebsd.org/D34466
2022-03-19 10:47:29 +00:00
Mateusz Guzik
1cb0045c97 vfs: add MNTK_UNLOCKED_INSMNTQUE
Can be used when the fs at hand can synchronize insmntque with other
means than the vnode lock.

Reviewed by:	markj
Tested by:	pho (previous version)
Differential Revision:	https://reviews.freebsd.org/D34466
2022-03-19 10:46:40 +00:00
firk
28d08dc7d0 clock_gettime: Fix CLOCK_THREAD_CPUTIME_ID race
Use a spinlock section instead of a critical section to synchronize with
statclock().  Otherwise the CLOCK_THREAD_CPUTIME_ID clock can appear to
go backwards.

PR:		262273
Reviewed by:	markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D34568
2022-03-17 15:39:00 -04:00
Mark Johnston
fc7e121d88 file: Move FILEDESC_FOREACH macros to kern_descrip.c
They are only used in kern_descrip.c, so make them private.  No
functional change intended.

Discussed with:	mjg
Sponsored by:	The FreeBSD Foundation
2022-03-17 15:39:00 -04:00