By calling the function too early we might still have the td_pflags
value cached from the previous struct thread use. cpu_copy_thread()
depends on correct value for TDP_KTHREAD at least on x86.
Reported, bisected, and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D36069
This ensures the filedesc-to-leader code is consistently encapsulated in
kern_descrip.c.
No functional change intended.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35988
The TDA_AST flag is set on td2 unconditionally (as it was TDF_ASTPENDING
before AST rework), so it is not used practically for some time.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D36033
When it was inline it made sense to depend on the existing nested check
in KTRUSERRET() rather than adding a new td_flags flag. However, since
we now have a TDA_KTRACE flag anyway, we might as well check it and
avoid the call.
Suggested by: jhb
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888
Explicitly pass the struct thread argument.
Move the function prototype from sys/systm.h to geom/geom.h, we do not
need almost each kernel source to see the prototype, it is now used
only by kern/vfs_mountroot.c outside geom/geom_event.c, where the
function is defined.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888
Make most AST handlers dynamically registered. This allows to have
subsystem-specific handler source located in the subsystem files,
instead of making subr_trap.c aware of it. For instance, signal
delivery code on return to userspace is now moved to kern_sig.c.
Also, it allows to have some handlers designated as the cleanup (kclear)
type, which are called both at AST and on thread/process exit. For
instance, ast(), exit1(), and NFS server no longer need to be aware
about UFS softdep processing.
The dynamic registration also allows third-party modules to register AST
handlers if needed. There is one caveat with loadable modules: the
code does not make any effort to ensure that the module is not unloaded
before all threads processed through AST handler in it. In fact, this
is already present behavior for hwpmc.ko and ufs.ko. I do not think it
is worth the efforts and the runtime overhead to try to fix it.
Reviewed by: markj
Tested by: emaste (arm64), pho
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888
Implement Linux-variant of MSG_TRUNC input flag used in recv(), recvfrom() and recvmsg().
Posix defines MSG_TRUNC as an output flag, indicating packet/datagram truncation.
Linux extended it a while (~15+ years) ago to act as input flag,
resulting in returning the full packet size regarless of the input
buffer size.
It's a (relatively) popular pattern to do recvmsg( MSG_PEEK | MSG_TRUNC) to get the
packet size, allocate the buffer and issue another call to fetch the packet.
In particular, it's popular in userland netlink code, which is the primary driving factor of this change.
This commit implements the MSG_TRUNC support for SOCK_DGRAM sockets (udp, unix and all soreceive_generic() users).
PR: kern/176322
Reviewed by: pauamma(doc)
Differential Revision: https://reviews.freebsd.org/D35909
MFC after: 1 month
The devmap variants used vm_offset_t for some reason, and a few places
explicitly cast bus addresses to vm_offset_t. (Probably those casts
along with similar casts for vm_size_t should just be removed and
instead permit the compiler to DTRT.)
Reviewed by: markj
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D35961
With clang 15, the following -Werror warning is produced:
sys/kern/vfs_bio.c:3430:11: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
buf_daemon()
^
void
This is because buf_daemon() is declared with a (void) argument list,
but defined with an empty argument list. Make the definition match the
declaration.
MFC after: 3 days
With clang 15, the following -Werror warnings are produced:
sys/kern/sysv_msg.c:213:8: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
msginit()
^
void
sys/kern/sysv_msg.c:316:10: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
msgunload()
^
void
This is because msginit() and msgunload() are declared with (void)
argument lists, but defined with empty argument lists. Make the
definitions match the declarations.
MFC after: 3 days
With clang 15, the following -Werror warning is produced:
sys/kern/subr_bus.c:871:16: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
bus_topo_assert()
^
void
This is because bus_topo_assert() is declared with a (void) argument
list, but defined with an empty argument list. Make the definition match
the declaration.
MFC after: 3 days
With clang 15, the following -Werror warning is produced:
sys/kern/subr_autoconf.c:119:34: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
run_interrupt_driven_config_hooks()
^
void
This is because run_interrupt_driven_config_hooks() is declared with a
(void) argument list, but defined with an empty argument list. Make the
definition match the declaration.
MFC after: 3 days
With clang 15, the following -Werror warnings are produced:
sys/kern/kern_resource.c:1212:10: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
lim_alloc()
^
void
sys/kern/kern_resource.c:1365:11: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
uihashinit()
^
void
This is because lim_alloc() and uihashinit() are declared with (void)
argument lists, but defined with empty argument lists. Make the
definitions match the declarations.
MFC after: 3 days
With clang 15, the following -Werror warnings are produced:
sys/kern/kern_dtrace.c:64:18: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
kdtrace_proc_size()
^
void
sys/kern/kern_dtrace.c:87:20: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
kdtrace_thread_size()
^
void
This is because kdtrace_proc_size() and kdtrace_thread_size() are
declared with (void) argument lists, but defined with empty argument
lists. Make the definitions match the declarations.
MFC after: 3 days
With clang 15, the following -Werror warnings are produced:
sys/kern/kern_cons.c:201:14: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
cninit_finish()
^
void
sys/kern/kern_cons.c:376:7: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
cngrab()
^
void
sys/kern/kern_cons.c:389:9: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
cnungrab()
^
void
sys/kern/kern_cons.c:402:9: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
cnresume()
^
void
This is because cninit_finish(), cngrab(), cnungrab(), and cnresume()
are declared with (void) argument lists, but defined with empty argument
lists. Make the definitions match the declarations.
MFC after: 3 days
Historic FreeBSD behaviour (dating back to 1994-04-02) when rebooting
is to print "Rebooting..." and then
/* wait 1 sec for printf's to complete and be read */
Prior to April 1994, there was a 100 ms delay (added 1993-11-12).
Since (a) most users will already be aware that the system is rebooting
and do not need to take time to read an additional message to that
effect, and (b) most FreeBSD systems don't have anyone actively looking
at the console anyway, this delay no longer serves much purpose.
This commit adds a kern.reboot_wait_time sysctl which defaults to 0;
historic behaviour can be regained by setting it to 1.
Reviewed by: imp
Relnotes: FreeBSD now reboots faster; to restore the traditional
wait after printing "Rebooting..." to the console, set
kern.reboot_wait_time=1 (or more).
Sponsored by: https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D35796
Add three simple hooks to the debugger allowing for a loaded MAC policy
to intervene if desired:
1. Before invoking the kdb backend
2. Before ddb command registration
3. Before ddb command execution
We extend struct db_command with a private pointer and two flag bits
reserved for policy use.
Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35370
This is not completely exhaustive, but covers a large majority of
commands in the tree.
Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35583
The load balancer may force a running thread to reschedule and pick a
new CPU. To do this it sets some flags in the thread running on a
loaded CPU. But the code assumed that a running thread's lock is the
same as that of the corresponding runqueue, and there are small windows
where this is not true. In this case, we can end up with non-atomic
modifications to td_flags.
Since this load balancing is best-effort, simply give up if the thread's
lock doesn't match; in this case the thread is about to enter the
scheduler anyway.
Reviewed by: kib
Reported by: glebius
Fixes: e745d729be ("sched_ule(4): Improve long-term load balancer.")
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35821
It used to be mapped at the top of the UVA.
If the randomization is enabled any address above .data section will be
randomly chosen and a guard page will be inserted in the shared page
default location.
The shared page is now mapped in exec_map_stack, instead of
exec_new_vmspace. The latter function is called before image activator
has a chance to parse ASLR related flags.
The KERN_PROC_VM_LAYOUT sysctl was extended to provide shared page
address.
The feature is enabled by default for 64 bit applications on all
architectures.
It can be toggled kern.elf64.aslr.shared_page sysctl.
Approved by: mw(mentor)
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35349
Store the shared page address in struct vmspace.
Also instead of storing absolute addresses of various shared page
segments save their offsets with respect to the shared page address.
This will be more useful when the shared page address is randomized.
Approved by: mw(mentor)
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35393
Use a getter macro instead of fetching the sigcode address directly
from a sysent of a given process. It assumes that the sigcode is stored
in the shared page, which is true in all cases, except for a.out
binaries. This will be later useful when the shared page address
randomization is introduced.
No functional change intended.
Approved by: mw(mentor)
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35392
The passed cpuid is always equal to the one stored in the callout
structure. No functional change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Replace struct timeval in header with struct timespec.
To differentiate header formats, add a new KTR_VERSIONED flag
set in the header type field similar to the existing KTRDROP flag.
To make it easier to extend ktrace headers in the future,
extend the existing header with a version field (version 0 is
reserved for older records without KTR_VERSIONED) as well as
new fields holding the thread ID and CPU ID.
Reviewed by: jhb, pauamma
Differential Revision: https://reviews.freebsd.org/D35774
MFC after: 2 weeks
Allow high priority hardware interrupts to run at PI_REALTIME via
INTR_TYPE_CLK, but collapse all other hardware interrupt threads to
the next priority level (PI_INTR). Collapse all SWI priorities to
the same priority level (PI_SOFT) just below PI_INTR.
Reviewed by: kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35646
If an interrupt thread runs for a full quantum without yielding the
CPU, demote its priority and schedule a preemption to give other
ithreads a turn.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35645
If an interrupt thread runs for a full quantum without yielding the
CPU, demote its priority and schedule a preemption to give other
ithreads a turn.
Reviewed by: kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35644
Use sched_wakeup instead of sched_add when marking an ithread
runnable. This allows schedulers to reset their internal time slice
tracking state and restore the base ithread priority when an ithread
resumes from idle.
Reviewed by: markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35643
Use it instead of sched_prio when setting the priority of an interrupt
thread.
Reviewed by: kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35642
Different fields in the tdq have different synchronization protocols.
Some are constant, some are accessed only while holding the tdq lock,
some are modified with the lock held but accessed without the lock, some
are accessed only on the tdq's CPU, and some are not synchronized by the
lock at all.
Convert ULE to stop using volatile and instead use atomic_load_* and
atomic_store_* to provide the desired semantics for lockless accesses.
This makes the intent of the code more explicit, gives more freedom to
the compiler when accesses do not need to be qualified, and lets KCSAN
intercept unlocked accesses.
Thus:
- Introduce macros to provide unlocked accessors for certain fields.
- Use atomic_load/store for all accesses of tdq_cpu_idle, which is not
synchronized by the mutex.
- Use atomic_load/store for accesses of the switch count, which is
updated by sched_clock().
- Add some comments to fields of struct tdq describing how accesses are
synchronized.
No functional change intended.
Reviewed by: mav, kib
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35737
The load balancer executes from statclock and periodically tries to move
threads among CPUs in order to balance load. It may move a thread to
the current CPU (the loader balancer always runs on CPU 0). When it
does so, it may need to schedule preemption of the interrupted thread.
Use sched_setpreempt() to do so, same as sched_add().
PR: 264867
Reviewed by: mav, kib, jhb
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35744
Thread switching used to be atomic with respect to the current CPU's
tdq lock. Since commit 686bcb5c14 that is no longer the case. Now
sched_switch() does this:
1. lock tdq (might already be locked)
2. maybe put the current thread in the tdq, choose a new thread to run
2a. update tdq_lowpri
3. unlock tdq
4. switch CPU context, update curthread
Some code paths in ULE will load pc_curthread from a remote CPU with
that CPU's tdq lock held, usually to inspect its priority. But, as of
the aforementioned commit this is racy.
The problem I noticed is in tdq_notify(), which optionally sends an IPI
to a remote CPU when a new thread is added to its runqueue. If the new
thread's priority is higher (lower) than the currently running thread's
priority, then we deliver an IPI. But inspecting
pc_curthread->td_priority doesn't work, since pc_curthread might be
between steps 3 and 4 above. If pc_curthread's priority is higher than
that of the newly added thread, but pc_curthread is switching to a
lower-priority thread, then tdq_notify() might fail to deliever an IPI,
leaving a high priority thread stuck on the runqueue for longer than it
should. This can cause multi-millisecond stalls in
interactive/ithread/realtime threads.
Fix this problem by modifying tdq_add() and tdq_move() to return the
value of tdq_lowpri before the addition of the new thread. This ensures
that tdq_notify() has the correct priority value to compare against.
The other two uses of pc_curthread are susceptible to the same race. To
fix the one in sched_rem()->tdq_setlowpri() we need to have an exact
value for curthread. Thus, introduce a new tdq_curthread field to the
tdq which gets updated any time a new thread is selected to run on the
CPU. Because this field is synchronized by the thread lock, its
priority reflects the correct lowpri value for the tdq.
PR: 264867
Fixes: 686bcb5c14 ("schedlock 4/4")
Reviewed by: mav, kib, jhb
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35736
- Remove a flag variable.
- Convert a runtime check of the passed cpuid to a KASSERT.
- Remove the cc_inited flag. An attempt to schedule a callout before
SI_SUB_CPU will crash anyway since the per-CPU mutexes won't have been
initialized, and that flag was only checked in the case where a cpuid
was explicitly specified by the caller.
No functional change intended.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Stop including the current CPU in all event messages, since it's already
saved in KTR log entries and thus is redundant. All eventtimer traces
occur in a context where CPU migration is not possible.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
In handleevents(), lock the timer state before fetching the time for the
next event. A concurrent callout_cc_add() call might be changing the
next event time, and the race can cause handleevents() to program an
out-of-date time, causing the callout to run later (by an unbounded
period, up to the idle hardclock period of 1s) than requested.
In cpu_idleclock(), call getnextcpuevent() with the timer state mutex
held, for similar reasons. In particular, cpu_idleclock() runs with
interrupts enabled, so an untimely timer interrupt can result in a stale
next event time being programmed. Further, an interrupt can cause
cpu_idleclock() to use a stale value for "now".
In cpu_activeclock(), disable interrupts before loading "now", so as to
avoid going backwards in time when calling handleevents(). It's ok to
leave interrupts enabled when checking "state->idle", since the race at
worst will cause handleevents() to be called unnecessarily. But use an
atomic load to indicate that the test is racy.
PR: 264867
Reviewed by: mav, jhb, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35735
Callers have already loaded the pointer, so these functions don't need
to fetch it again.
No functional change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Some command definitions were forced to use DB_FUNC in order to specify
their required flags, CS_OWN or CS_MORE. Use the new macros to simplify
these.
Reviewed by: markj, jhb
MFC after: 3 days
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35582
o Retire SS_FDREF as it is basically a debug flag on top of already
existing soref()/sorele().
o Convert SS_PROTOREF into soref()/sorele().
o Change reference model for the listen queues, see below.
o Make sofree() private. The correct KPI to use is only sorele().
o Make soabort() respect the model and sorele() instead of sofree().
Note on listening queues. Until now the sockets on a queue had zero
reference count. And the reference were given only upon accept(2). The
assumption was that there is no way to see the queued socket from anywhere
except its head. This is not true, since queued sockets already have pcbs,
which are linked at least into the global pcb lists. With this change we
put the reference right in the sonewconn() and on accept(2) path we just
hand the existing reference to the file descriptor.
Differential revision: https://reviews.freebsd.org/D35679
Rename SS_NOFDREF to SS_FDREF and flip all bitwise operations.
Mark sockets created by socreate() with SS_FDREF.
This change is mostly illustrative. With it we see that SS_FDREF
is a debugging flag, since:
* socreate() takes a reference with soref().
* on accept path solisten_dequeue() takes a reference
with soref() and then soaccept() sets SS_FDREF.
* soclose() checks SS_FDREF, removes it and does sorele().
Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D35678
Change the default from printing a breif kernel thread stack informaton
back to omitting it for non-invariant kernels in response to
SIGINFO/^T. Full and brief stack support can be selected with the
kern.tty_info_kstacks sysctl.
MFC After: 2 weeks
Sponsored by: Netflix
Reviewed by: grembo, jhb
Differential Revision: https://reviews.freebsd.org/D35576
For the kernel this is mostly a non-functional change. However, this
will be useful for simplifying gcore(1).
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D35666
Add shm_remove_prison(), that removes all POSIX shared memory segments
belonging to a prison. Call it from prison_cleanup() so a prison
won't be stuck in a dying state due to the resources still held.
PR: 257555
Reported by: grembo
Currently, when a jail starts dying, either by losing its last user
reference or by being explicitly killed,
osd_jail_call(...PR_METHOD_REMOVE...) is called. Encapsulate this
into a function prison_cleanup() that can then do other cleanup.
Instead of returning EMSGSIZE pass the error code from fdallocn() directly
to userland. That would be EMFILE, which makes much more sense. This
error code is not listed in the specification[1], but the specification
doesn't cover such edge case at all. Meanwhile the specification lists
EMSGSIZE as the error code for invalid value of msg_iovlen, and FreeBSD
follows that, see sys_recmsg(). Differentiating these two cases will make
a developer/admin life much easier when debugging.
[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/recvmsg.html
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35640
OpenVPN Data Channel Offload (DCO) moves OpenVPN data plane processing
(i.e. tunneling and cryptography) into the kernel, rather than using tap
devices.
This avoids significant copying and context switching overhead between
kernel and user space and improves OpenVPN throughput.
In my test setup throughput improved from around 660Mbit/s to around
2Gbit/s.
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34340
A one-to-many unix/dgram socket is a socket that has been bound
with bind(2) and can get multiple connections. A typical example
is /var/run/log bound by syslogd(8) and receiving multiple
connections from libc syslog(3) API. Until now all of these
connections shared the same receive socket buffer of the bound
socket. This made the socket vulnerable to overflow attack.
See 240d5a9b1c for a historical attempt to workaround the problem.
This commit creates a per-connection socket buffer for every single
connected socket and eliminates the problem. The new behavior will
optimize seldom writers over frequent writers. See added test case
scenarios and code comments for more detailed description of the
new behavior.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35303
o Use m_pkthdr.memlen from m_uiotombuf()
o Modify unp_internalize() to keep track of allocated space and memory
as well as pointer to the last buffer.
o Modify unp_addsockcred() to keep track of allocated space and memory
as well as pointer to the last buffer.
o Record the datagram len/memlen/ctllen in the first (from) mbuf of the
chain in uipc_sosend_dgram() and reuse it in uipc_soreceive_dgram().
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35302
Data allocated by m_uiotombuf() usually goes into a socket buffer.
We are interested in the length of useful data to be added to sb_acc,
as well as total memory used by mbufs. The later would be added to
sb_mbcnt. Calculating this value at allocation time allows to save
on extra traversal of the mbuf chain.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35301
This change fully splits away PF_UNIX/SOCK_DGRAM from other socket
buffer implementations, without any behavior changes.
Generic socket implementation is reduced down to one STAILQ and very
little code.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35300
Split struct sockbuf into common shared fields and protocol specific
union, where protocols are free to implement whatever buffer they
want. Such protocols should mark themselves with PR_SOCKBUF and are
expected to initialize their buffers in their pr_attach and tear
them down in pr_detach.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35299
Use this new version in unix/dgram socket when sending to a target
address. This removes extra lock release/acquisition and possible
counter-intuitive ENOTCONN.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35298
This allows to remove one M_NOWAIT allocation and also makes it
more clear what's going on.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35297
With this second step PF_UNIX/SOCK_DGRAM has protocol specific
implementation. This gives some possibility performance
optimizations. However, it still operates on the same struct
socket as all other sockets do.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35296
Remove the dead code. The new uipc_sosend_dgram() handles send()
on PF_UNIX/SOCK_DGRAM in full.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35294
This is first step towards splitting classic BSD socket
implementation into separate classes. The first to be
split is PF_UNIX/SOCK_DGRAM as it has most differencies
to SOCK_STREAM sockets and to PF_INET sockets.
Historically a protocol shall provide two methods for sendmsg(2):
pru_sosend and pru_send. The former is a generic send method,
e.g. sosend_generic() which would internally call the latter,
uipc_send() in our case. There is one important exception, though,
the sendfile(2) code will call pru_send directly. But sendfile
doesn't work on SOCK_DGRAM, so we can do the trick. We will create
socket class specific uipc_sosend_dgram() which will carry only
important bits from sosend_generic() and uipc_send().
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35293
Partially revert the previous change; we need to keep this method as a
specific override for pci_driver subclasses which should not use
pci_rescan_method() -- cardbus and ofw_pcibus. However, change the return
value to ENODEV for the same reasoning given in the original commit, and
use this as the default rescan method in bus_if.m.
Reported by: jhb
Fixes: 36a8572ee8 ("bus_if: provide a default null rescan method")
MFC with: 36a8572ee8
The third argument to this function indicates whether the supplied
ticker is fixed or variable, i.e. requiring calibration. Give this
argument a type and name that better conveys this purpose.
Reviewed by: kib, markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35459
There is an existing helper method in subr_bus.c, but almost no drivers
know to use it. It also returns the same error as an empty method,
making it not very useful. Move this to bus_if.m and return a more
sensible error code.
This gives a slightly more meaningful error message when attempting
'devctl rescan' on buses and devices alike:
"Device not configured" --> "Operation not supported by device"
Reviewed by: imp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35501
Calculate the desired page valid mask using math that will not
overflow the types used.
Sponsored by: Netflix
Reviewed by: mckusick, kib, markj
Differential Revision: https://reviews.freebsd.org/D34837
- Remove the AIO proc zone. This zone gets one allocation per AIO
daemon process, which isn't enough to warrant a dedicated zone. Plus,
unlike other AIO structures, aiops are small (32 bytes with LP64), so
UMA doesn't provide better space efficiency than malloc(9). Change
one of the malloc types in vfs_aio.c to make it more general.
- Don't set the NOFREE flag on the other AIO zones. This flag means
that memory allocated to the AIO subsystem is never freed back to the
VM, so it's always preferable to avoid using it when possible. NOFREE
was set without explanation when AIO was converted to use UMA 20 years
ago, but it does not appear to be required; all of the structures
allocated from UMA (per-process kaioinfo, kaiocb, and aioliojob) keep
track of references and get freed only when none exist. Plus, these
structures will contain dangling pointer after they're freed (e.g.,
the "cred", "fd_file" and "uiop" fields of struct kaiocb), so
use-after-frees are dangerous even when the structures themselves are
type-stable.
Reviewed by: asomers
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35493
Add kf_pipe_buffer_[in/out/size] fields to kf_pipe, and populate them.
Add a kf_kqueue struct to the kf_un union, to allow querying kqueue state,
and populate it.
Populate the kf_sock_rcv_sb_state and kf_sock_snd_sb_state fields in
kf_sock for INET/INET6 sockets, and populate all other fields for all
transport layer protocols, not just TCP.
Bump __FreeBSD_version.
Differential revision: https://reviews.freebsd.org/D34184
Reviewed by: jhb, kib, se
MFC after: 1 week
When locking the knote list for a socket, we check whether the socket is
a listening socket in order to select the appropriate mutex; a listening
socket uses the socket lock, while data sockets use socket buffer
mutexes.
If SOLISTENING(so) is false and the knote lock routine locks a socket
buffer, then it must re-check whether the socket is a listening socket
since solisten_proto() could have changed the socket's identity while we
were blocked on the socket buffer lock.
Reported by: syzkaller
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35492
When the kernel is compiled with -asan-stack=true, the address sanitizer
will emit inline accesses to the shadow map. In other words, some
shadow map accesses are not intercepted by the KASAN runtime, so they
cannot be disabled even if the runtime is not yet initialized by
kasan_init() at the end of hammer_time().
This went unnoticed because the loader will initialize all PML4 entries
of the bootstrap page table to point to the same PDP page, so early
shadow map accesses do not raise a page fault, though they are silently
corrupting memory. In fact, when the loader does not copy the staging
area, we do get a page fault since in that case only the first and last
PML4Es are populated by the loader. But due to another bug, the loader
always treated KASAN kernels as non-relocatable and thus always copied
the staging area.
It is not really practical to annotate hammer_time() and all callees
with __nosanitizeaddress, so instead add some early initialization which
creates a shadow for the boot stack used by hammer_time(). This is only
needed by KASAN, not by KMSAN, but the shared pmap code handles both.
Reported by: mhorne
Reviewed by: kib
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35449
The pointer to the mount values may be null if an error occurred while
copying them in, so fix the assertion condition to reflect that
possibility.
While here, move some initialization code into the error == 0 block. No
functional change intended.
Reported by: syzkaller
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
... rather than setting and clearing flags inline. No functional change
intended.
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35469
Suppose a thread tries to read from an empty pipe. pipe_read() does the
following:
1. pipelock(), possibly sleeping
2. check for buffered data
3. pipeunlock()
4. set PIPE_WANTR and sleep
5. goto 1
pipelock() is an open-coded mutex; if a thread blocks in pipelock(), it
sleeps until the lock holder calls pipeunlock().
Both sleeps use the same wait channel. So if there are multiple threads
in pipe_read(), a thread T1 in step 3 can wake up a thread T2 sleeping
in step 4. Then T1 goes to sleep in step 4, and T2 acquires and
releases the pipelock, waking up T1 again. This can go on indefinitely,
livelocking the process (and potentially starving a would-be writer).
Fix the problem by using a separate wait channel for pipelock().
Reported by: Paul Floyd <paulf2718@gmail.com>
Reviewed by: mjg, kib
PR: 264441
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35415
This is racy because curproc process lock is not used, but allows the
process to exit faster. It is userspace issue to create such race
anyway, and not fullfilling the guarantee that all reaper descendants
are signalled should be fine.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
We drop proctree_lock, which allows the process to exit while memoized
in the list to proceed.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
Recorded reaper might loose its reaper status, so we should not assert
it, but check and avoid signalling if this happens.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 week
Differential revision: https://reviews.freebsd.org/D35310
The failure means that the process does single-threading itself, which
makes our action not needed.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
If we try to single-thread a process which thread entered
procctl(REAP_KILL_SUBTREE), and sleeping waiting for us unlocking
stop_all_proc_blocker, we must be able to finish single-threading. This
requires the sleep to be interruptible.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
markj wrote:
tdsendsignal() may unsuspend a target thread. I think there is at least
one bug there: suppose thread T is suspended in
thread_single(SINGLE_ALLPROC) when trying to kill another process with
REAP_KILL. Suppose a different thread sends SIGKILL to T->td_proc. Then,
tdsendsignal() calls thread_unsuspend(T, T->td_proc). thread_unsuspend()
incorrectly decrements T->td_proc->p_suspcount to -1.
Later, when T->td_proc exits, it will wait forever in
thread_single(SINGLE_EXIT) since T->td_proc->p_suspcount never reaches 1.
Since the thread suspension is bounded by time needed to do
thread_single(), skipping the thread_unsuspend_one() call there should
not affect signal delivery if this thread is selected as target.
Reported by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
Since both self single-threading and remote single-threading rely on
suspending the thread doing thread_single(), it cannot be mixed: thread
doing thread_suspend_switch() might be subject to thread_suspend_one()
and vice versa.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
suspended for SINGLE_ALLPROC mode. There is no need to check for
boundary state. It is only required to see that the suspension comes
from the ALLPROC mode.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
to avoid ALLPROC mode to try to race with any other single-threading
mode.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
Places that will wait for curproc->p_singlethr to become zero (in the
next commit, the counter of number of external single-threading is
to be introduced), must wait for it interruptible, otherwise we
deadlock. On the other hand, a signal delivered during this window,
if directed to the waiting thread, would cause the wait loop to become
a busy loop.
Since we are exiting, it is safe to ignore the signals.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
before the process itself does thread_single(SINGLE_EXIT). We cannot
single-thread such process in ALLPROC (external) mode, and properly
detect and report the failure to do so due to the process becoming
zombie is easier to prevent than handle.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
This avoids the need to drop into the ddb to figure out vnode
usage per file system. It helps to see if they are or are not
being freed. Suggestion to report active vnode count was from
kib@
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35436
Do this by reducing the size of the MBUF_PEXT_MAX_PGS, causing "struct mbuf" to
be bigger than M_SIZE, and also add a missing padding field to ensure 64-bit
alignment.
Reviewed by: gallatin@
Reported by: Elliott Mitchell
Differential revision: https://reviews.freebsd.org/D35339
MFC after: 1 week
Sponsored by: NVIDIA Networking
Basic TLS RX offloading uses the "csum_flags" field in the mbuf packet
header to figure out if an incoming mbuf has been fully offloaded or
not. This information follows the packet stream via the LRO engine, IP
stack and finally to the TCP stack. The TCP stack preserves the mbuf
packet header also when re-assembling packets after packet loss. When
the mbuf goes into the socket buffer the packet header is demoted and
the offload information is transferred to "m_flags" . Later on a
worker thread will analyze the mbuf flags and decide if the mbufs
making up a TLS record indicate a fully-, partially- or not decrypted
TLS record. Based on these three cases the worker thread will either
pass the packet on as-is or recrypt the decrypted bits, if any, or
decrypt the packet as usual.
During packet loss the kernel TLS code will call back into the network
driver using the send tag, informing about the TCP starting sequence
number of every TLS record that is not fully decrypted by the network
interface. The network interface then stores this information in a
compressed table and starts asking the hardware if it has found a
valid TLS header in the TCP data payload. If the hardware has found a
valid TLS header and the referred TLS header is at a valid TCP
sequence number according to the TCP sequence numbers provided by the
kernel TLS code, the network driver then informs the hardware that it
can resume decryption.
Care has been taken to not merge encrypted and decrypted mbuf chains,
in the LRO engine and when appending mbufs to the socket buffer.
The mbuf's leaf network interface pointer is used to figure out from
which network interface the offloading rule should be allocated. Also
this pointer is used to track route changes.
Currently mbuf send tags are used in both transmit and receive
direction, due to convenience, but may get a new name in the future to
better reflect their usage.
Reviewed by: jhb@ and gallatin@
Differential revision: https://reviews.freebsd.org/D32356
Sponsored by: NVIDIA Networking
So that the asserts and the actual code see the same values.
Differential revision: https://reviews.freebsd.org/D32356
MFC after: 1 week
Sponsored by: NVIDIA Networking
When packets are received they may traverse several network interfaces like
vlan(4) and lagg(9). When doing receive side offloads it is important to
know the first network interface entry point, because that is where all
offloading is taking place. This makes it possible to track receive
side route changes for multiport setups, for example when lagg(9) receives
traffic from more than one port. This avoids having to install multiple
offloading rules for the same stream.
This field works similar to the existing "rcvif" mbuf packet header field.
Submitted by: jhb@
Reviewed by: gallatin@ and gnn@
Differential revision: https://reviews.freebsd.org/D35339
Sponsored by: NVIDIA Networking
Sponsored by: Netflix
Make it a complex, but a single for(;;) statement. The previous cycle
with some loop logic in the beginning and some loop logic at the end
was confusing. Both me and markj@ were misleaded to a conclusion that
some checks are unnecessary, while they actually were necessary.
While here, handle an edge case found by Mark, when on 64-bit platform
an incorrect message from userland would underflow length counter, but
return without any error. Provide a test case for such message.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35375
This is slightly more optimized than checking panicstr directly. For
most of these instances performance doesn't matter, but let's make
KERNEL_PANICKED() the common idiom.
Reviewed by: mjg
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D35373
Split cpuset_getaffinity() into a two counterparts, where the
user_cpuset_getaffinity() is intended to operate on the cpuset_t from
user va, while kern_cpuset_getaffinity() expects the cpuset from kernel
va.
Accordingly, the code that clears the high bits is moved to the
user_cpuset_getaffinity(). Linux sched_getaffinity() syscall returns
the size of set copied to the user-space and then glibc wrapper clears
the high bits.
MFC after: 2 weeks
With M_EXTPG mbufs these two counters already do not represent the
reality. As we are moving towards protocol independent socket buffers,
which may not even use mbufs at all, the counters become less and less
relevant. The only userland seeing them was 'netstat -x'.
PR: 264181 (exp-run)
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35334
In this function we always work with mbufs that we previously
created ourselves in unp_internalize(). They must be valid.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35319
Now that we call sbcreatecontrol() with M_WAITOK, we are expected to
pass a valid size. Return same error code, we are returning for an
oversized control from sockargs().
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35317
Specification doesn't list an explicit error code for the control
size specified by msg_control being too large. But it does list
EMSGSIZE as error code for "message is too large to be sent all at
once (as the socket requires)". It also lists EINVAL as code for
the "The sum of the iov_len values overflows an ssize_t." Given
how generic and uninformative EINVAL is, the EMSGSIZE is more
appropriate.
https://pubs.opengroup.org/onlinepubs/9699919799/functions/sendmsg.html
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35316
Suppose a periodic kevent timer fires close to its deadline, so that
now - kc->next is small. Then delta ends up being 1, and the next timer
deadline is set to (delta + 1) * kc->to, where kc->to is the timer
period. This means that the timer fires at half of the requested rate,
and the value returned in kn_data is similarly inaccurate.
PR: 264131
Fixes: 7cb40543e9 ("filt_timerexpire: do not iterate over the interval")
Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35313
Rename the "copyin" and "copyout" fields of struct cpuset_copy_cb to
something less generic, since sanitizers define interceptors for
copyin() and copyout() using #define.
Reported by: syzbot+2db5d644097fc698fb6f@syzkaller.appspotmail.com
Fixes: 47a57144af ("cpuset: Byte swap cpuset for compat32 on big endian architectures")
Sponsored by: The FreeBSD Foundation
Summary:
BITSET uses long as its basic underlying type, which is dependent on the
compile type, meaning on 32-bit builds the basic type is 32 bits, but on
64-bit builds it's 64 bits. On little endian architectures this doesn't
matter, because the LSB is always at the low bit, so the words get
effectively concatenated moving between 32-bit and 64-bit, but on
big-endian architectures it throws a wrench in, as setting bit 0 in
32-bit mode is equivalent to setting bit 32 in 64-bit mode. To
demonstrate:
32-bit mode:
BIT_SET(foo, 0): 0x00000001
64-bit sees: 0x0000000100000000
cpuset is the only system interface that uses bitsets, so solve this
by swapping the integer sub-components at the copyin/copyout points.
Reviewed by: kib
MFC after: 3 days
Sponsored by: Juniper Networks, Inc.
Differential Revision: https://reviews.freebsd.org/D35225
When locking a priority inherit mutex we perform a compare and swap
operation to try and acquire the mutex. This may fail even when the
compare succeeds.
Check and handle this case.
PR: 263825
Reviewed by: kib, markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35150
Kthread worker is a single thread workqueue which can be used in cases
where specific kthread association is necessary, for example, when it
should have RT priority or be assigned to certain cgroup.
This change implements Linux v4.9 interface which mostly hides kthread
internals from users thus allowing to use ordinary taskqueue(9) KPI.
As kthread worker prohibits enqueueing of already pending or canceling
tasks some minimal changes to taskqueue(9) were done.
taskqueue_enqueue_flags() was added to taskqueue KPI which accepts extra
flags parameter. It contains one or more of the following flags:
TASKQUEUE_FAIL_IF_PENDING - taskqueue_enqueue_flags() fails if the task
is already scheduled to execution. EEXIST is returned and the
ta_pending counter value remains unchanged.
TASKQUEUE_FAIL_IF_CANCELING - taskqueue_enqueue_flags() fails if the
task is in the canceling state and ECANCELED is returned.
Required by: drm-kmod 5.10
MFC after: 1 week
Reviewed by: hselasky, Pau Amma (docs)
Differential Revision: https://reviews.freebsd.org/D35051
Without this patch, the MSG_TLSAPPDATA flag would cause
soreceive_generic() to return ENXIO for any non-application
data record in a TLS receive stream.
This works ok for TLS1.2, since Alert records appear to be
the only non-application data records received.
However, for TLS1.3, there can be post-handshake handshake
records, such as NewSessionKey sent to the client from the
server. These handshake records cannot be handled by the
upcall which does an SSL_read() with length == 0.
It appears that the client can simply throw away these
NewSessionKey records, but to do so, it needs to receive
them within the kernel.
This patch modifies the semantics of MSG_TLSAPPDATA slightly,
so that it only applies to Alert records and not Handshake
records. It is needed to allow the krpc to work with KTLS1.3.
Reviewed by: hselasky
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D35170
Per jamie@ rpr can be NULL if the jail is created with sysvsem=disable.
But at least it doesn't appear to be fatal, since rpr is never dereferenced
but is only compared to other prison pointers.
Reviewed by: jamie
Differential revision: https://reviews.freebsd.org/D35198
MFC after: 2 weeks
It is unused, especially now that the underlying d_dumper methods do not
accept the argument.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D35174
The physical address argument is essentially ignored by every dumper
method. In addition, the dump routines don't actually pass a real
address; every call to dump_append() passes a value of zero for
physical.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D35173
It appears to be unused. These days struct disk has a d_dump member,
which is what gets passed to the kernel dump framework.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D35172
This change is also a preparation for further optimization to
allow locked return on success.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35182
Assert that sockets are of the same type. unp_connectat() already did
this check. Add the check to uipc_connect2().
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35181
Since c67f3b8b78 the sockbuf mutexes belong to the containing socket,
and socket buffers just point to it. In 74a68313b5 macros that access
this mutex directly were added. Go over the core socket code and
eliminate code that reaches the mutex by dereferencing the sockbuf
compatibility pointer.
This change requires a KPI change, as some functions were given the
sockbuf pointer only without any hint if it is a receive or send buffer.
This change doesn't cover the whole kernel, many protocols still use
compatibility pointers internally. However, it allows operation of a
protocol that doesn't use them.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35152
First paragraph refers to old past "we used to" and is no longer
important today. Second paragraph has just a wrong statement that
socket buffer is destroyed before pru_detach.
Linux has more tolerant checks of the user supplied cpuset_t's.
Minimum cpuset_t size that the Linux kernel permits in case of
getaffinity() is the maximum CPU id, present in the system / NBBY,
the maximum size is not limited.
For setaffinity(), Linux does not limit the size of the user-provided
cpuset_t, internally using only the meaningful part of the set, where
the upper bound is the maximum CPU id, present in the system, no larger
than the size of the kernel cpuset_t.
Unlike FreeBSD, Linux ignores high bits if set in the setaffinity(),
so clear it in the sched_setaffinity() and Linuxulator itself.
Reviewed by: Pau Amma (man pages)
In collaboration with: jhb
Differential revision: https://reviews.freebsd.org/D34849
MFC after: 2 weeks
The hack can be tracked down to 4.4BSD, where copy was performed
under splimp() and then after splx() dom_dispose was called.
Stevens has a chapter on this function, but he doesn't answer why
this trick is necessary. Why can't we call into dom_dispose under
splimp()? Anyway, with multithreaded kernel the hack seems to be
necessary to avoid LORs between socket buffer lock and different
filesystem locks, especially network file systems.
The new socket buffers KPI sbcut() from 1d2df300e9 allow us to get
rid of the hack.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35125
sorflush() already did the right thing, so only sofree() needed
a fix. Turn check into assertion in our only dom_dispose method.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35124
Today, kern.msgbuf_show_timestamp=1 will give 1 second granularity
timestamps on dmesg lines. When kern.msgbuf_show_timestamp=2, we'll
produce microsecond level graunlarity.
For example:
old (== 1):
[13] Dual Console: Video Primary, Serial Secondary
[14] lo0: link state changed to UP
[15] bxe0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[15] bxe0: link state changed to UP
new (== 2):
[13.807015] Dual Console: Video Primary, Serial Secondary
[14.544150] lo0: link state changed to UP
[15.272044] bxe0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[15.272052] bxe0: link state changed to UP
Sponsored by: Netflix
The old fixed-point arithmetic used for calculating load averages had an
overflow at 1024. So on systems with extremely high load, the observed
load average would actually fall back to 0 and shoot up again, creating
a kind of sawtooth graph.
Fix this by using 64-bit math internally, while still reporting the load
average to userspace as a 32-bit number.
Sponsored by: Axcient
Reviewed by: imp
Differential Revision: https://reviews.freebsd.org/D35134
For future use in the Linux emulation layer for the semtimedop syscall
split the sys_semop syscall into two counterparts and add
struct timespec *timeout argument to the last one.
Reviewed by: jhb, kib
Differential revision: https://reviews.freebsd.org/D35121
MFC after: 2 weeks
When we remove an interface it is first removed from the interface list
V_ifnet (by if_unlink_ifnet()) and marked as IFF_DYING. We then wait for
any possible references to stop being used (i.e.
epoch_wait/epoch_drain_callbacks) before we tear it fully down.
However, the index in ifindex_table is not removed, so m_rcvif_restore()
can still find the (now dying) interface.
This results in panics, for example when dummynet restores the rcvif
pointer and passes a packet to ip6_input() we can panic because the
AF_INET6 domain has already been removed (so we end up dereferencing a
NULL pointer there).
Check that the interface is not dying before we restore it, which is
equivalent to checking its presence in V_ifnet, and thus ensures that
future accesses (while in NET_EPOCH) are safe.
Reviewed by: glebius
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34076
(cherry picked from commit 703e533da5)
Supplement ifindex table with generation count and use it to
serialize & restore an ifnet pointer.
Reviewed by: kp
Differential revision: https://reviews.freebsd.org/D33266
Fun note: git show e6abef0918
(cherry picked from commit e1882428dc)
This reverts commit 703e533da5.
Revert "ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif"
This reverts commit e1882428dc.
Obtained from: github.com/glebius/FreeBSD/commits/backout-ifindex
where we might not yet see a new child when signalling a process.
Ensure that this cannot happen by stopping all reapping subtree,
which ensures that the child is not inside a syscall, in particular
fork(2).
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35014
by repeating iteration over the subtree until there are no new processes
to signal.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35014
stop_allo_proc_block() must be taken before proctree_lock.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35014
It allows to have more than one consumer of thread_signle(SIGNLE_ALLPROC) by
serializing them.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35014
p_reapsibling linkage is protected by proctree_lock, and it is modified
there.
Suggested and reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35014
Our kern_sigtimedwait() calculates absolute sleep timo value as 'uptime+timeout'.
So, when the user specifies a big timeout value (LONG_MAX), the calculated
timo can be less the the current uptime value.
In that case kern_sigtimedwait() returns EAGAIN instead of EINTR, if
unblocked signal was caught.
While here switch to a high-precision sleep method.
Reviewed by: mav, kib
In collaboration with: mav
Differential revision: https://reviews.freebsd.org/D34981
MFC after: 2 weeks
Instead, create a switch structure private to ktls_ocf.c and store a
pointer to the switch in the ocf_session. This will permit adding an
additional function pointer needed for NIC TLS RX without further
bloating ktls_session.
Reviewed by: hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35011
Deduplicate code to iterate over the bpages list in a bus_dmamap_t
freeing bounce pages during bus_dmamap_unload.
Reviewed by: imp
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34967
When pages are freed to a bounce zone, only maps waiting for pages for
that zone can make forward progress. If a map for a different bounce
zone is at the head of the global list, then requests that could
otherwise make forward progress will be stalled waiting on the other
bounce zone. If bounce zones shared bounce pages then a global list
would still make sense to prevent "later" requests from starving an
earlier request but that is not a concern with per-zone bounce page
pools.
Reviewed by: imp
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34966
Rather than using a software interrupt with a single handler, just
create a dedicated kernel process woken up with a simple wakeup().
Reviewed by: imp
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D34965
Add a new PI_SOFTCLOCK for use by softclock threads. Currently this
maps to PI_AV which is the second-highest ithread priority.
Reviewed by: mav, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D33693
15b1eb142c changed the callout code to store the CALLOUT_SHAREDLOCK flag
in c_iflags (where it used to be c_flags), but failed to update the
check in softclock_call_cc(). This resulted in the callout code always
taking the write lock, even if a read lock had been requested (with
the CALLOUT_SHAREDLOCK flag in callout_init_rm()).
Reviewed by: markj
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34959
This permits a driver module structure that doesn't want to store a
pointer to the new driver's devclass.
Reviewed by: imp
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D34962
Rather than using taskqueue_swi_giant which holds Giant for all
deferred destroy_dev calls, create a separate queue for destroyed
devices with D_NEEDGIANT set in the corresponding cdevsw. The task
for this queue holds Giant whild destroying deferred devices while the
task for the default queue does not hold Giant.
In addition, switch to taskqueue_thread for destroy_dev_sched.
Deferred destroy_dev requests don't need to run at an SWI priority.
Reviewed by: imp, markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D34915
Now, since O_PATH-opened file descriptors use use references instead
of the hold references, vrefact() chahges from that revision can be
reverted.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34906
posix_fadvise operates only on a provided fd. Noted by
Mathieu <sigsys@gmail.com> in review D34761.
No new CAP_ rights are added for posix_fadvise(), as 'advice' in
general only influences when I/O happens; the fd must have existing
CAP_ rights for actual data access.
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34903
Problem is that open(O_PATH) on nullfs -o nocache is broken then,
because there is no reference on the vnode after the open syscall exits.
Reported and tested by: ambrisko
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
We may legitimately have tib == NULL if we're at the very end of the
queue.
PR: 215373
Reported by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
To be more compatible to IEEE Std 1003.1-2008 (“POSIX.1”).
Reviewed by: mjg, Pau Amma (doc)
Differential revision: https://reviews.freebsd.org/D34680
MFC after: 2 weeks
reporting the shapshot of the active advisory locks.
A new VFS ops method vfs_report_lockf if provided in the mount point
op table. If it is NULL, as it is currently for all existing
filesystems, vfs_report_lockf() function is used, which gathers
information from the standard implementation inside kern/kern_lockf.c.
Filesystems implementing its own locking (NFSv4 as example) can provide
a custom implementation.
Reviewed by: markj, rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34756
The UFS-specific struct inode cannot be used in generic advisory lock
code. It was probably used as a shortcut for the debugging, as the
remnants of the code around it indicates.
Use somewhat more verbose and less concentrated, but universal,
VOP_PRINT(), where needed.
Reviewed by: markj, rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D34756
In some configurations the firmware may pass memory regions that are
not page sized or aligned, e.g. when using 16k pages on arm64. If this
is the case we will calculate many small regions because the alignment
is applied before being inserted. As we round the start up and end down
this will leave a 1 page hole between what should have been a single
region.
Fix by keeping the original alignment until we are just about to insert
the region into the avail array.
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34694
These give us some confidience we haven't broken anything in early
boot code that may be running before the console.
Reviewed by: emaste
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34691
Add three hooks to the livedump process: before, after, and for each
block of dumped data. This allows, for example, quiescing the system
before the dump begins or protecting data of interest to ensure its
consistency in the final output.
Reviewed by: markj, kib (previous version)
Reviewed by: debdrup (manpages)
Reviewed by: Pau Amma <pauamma@gundo.com> (manpages)
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D34067
This dumper can instantiate and write the dump's contents to a
file-backed vnode.
Unlike existing disk or network dumpers, the vnode dumper should not be
invoked during a system panic, and therefore is not added to the global
dumper_configs list. Instead, the vnode dumper is constructed ad-hoc
when a live dump is requested using the new ioctl on /dev/mem. This is
similar in spirit to a kgdb session against the live system via
/dev/mem.
As described briefly in the mem(4) man page, live dumps are not
guaranteed to result in a usuable output file, but offer some debugging
value where forcefully panicing a system to dump its memory is not
desirable/feasible.
A future change to savecore(8) will add an option to save a live dump.
Reviewed by: markj, Pau Amma <pauamma@gundo.com> (manpages)
Discussed with: kib
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D33813
Add a new function, dumper_create(), to allocate a dumper.
dumper_insert() will call this function and retains the existing
behaviour.
This is desirable for performing live dumps of the system. Here, there
is a need to allocate and configure a dumper structure that is invoked
outside of the typical debugger context. Therefore, it should be
excluded from the list of panic-time dumpers.
free_single_dumper() is made public and renamed to dumper_destroy().
Reviewed by: kib, markj
MFC after: 1 week
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D34068
In some cases vn_open_cred overwrites cn_flags, effectively nullifying
initialisation done in NDINIT. This will have to be fixed.
In the meantime make sure the flag is passed.
Reported by: jenkins
Noted by: Mathieu <sigsys@gmail.com>
This more clearly differentiates system call arguments from integer
registers and return values. On current architectures it has no effect,
but on architectures where pointers are not integers (CHERI) and may
not even share registers (CHERI-MIPS) it is necessiary to differentiate
between system call arguments (syscallarg_t) and integer register values
(register_t).
Obtained from: CheriBSD
Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D33780
Make sure both sides of a comparison are unsigned. As the values being
compared are size_t make the the value in the for loop size_t too.
Sponsored by: The FreeBSD Foundation
This completes the patch which was originally meant to go in.
Spotted by: mhorne
Fixes: c35ec1efdc ("vfs: [1/2] fix stalls in vnode reclaim by not
requeieing from vnlru")
It expects exactly one of those flags. A future commit will assert this.
Reviewed by: rstone
MFC after: 1 month
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D34451
Turns out execve looks at it to store binary name, but in order to
trigger the problem one has to be trying to exec '/'. As is the value
would be left uninitialized (or rather set to -1 on debug kernels).
Fixes: 56244d3574 ("vfs: hoist degenerate path lookups out of the
loop")
We use the p_itcallout callout, interlocked by the proc lock, to
schedule timeouts for the setitimer(2) system call. When a process
exits, the callout must be stopped before the process struct is
recycled.
Currently we attempt to stop the callout in exit1() with the call
_callout_stop_safe(&p->p_itcallout, CS_EXECUTING). If this call returns
0, then we sleep in order to drain the callout. However, this happens
only if the callout is not scheduled at all. If the callout thread is
blocked on the proc lock, then exit1() will not block and the callout
may execute after the process has fully exited, typically resulting in a
panic.
I cannot see a reason to use the CS_EXECUTING flag here. Instead, use
the regular callout_stop()/callout_drain() dance to halt the callout.
Reported by: ler
Tested by: ler, pho
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34625
umtxq_requeue() moves the queue to a different hash chain and different
lock, so we can't rely on msleep_sbt() reacquiring the same old lock.
We have to use PDROP and update the queue chain and so lock pointer.
PR: 262587
MFC after: 2 weeks
cpu ticks) has some imprecision and, worse, huge timestep (about
20 minutes on 4GHz CPU) near 53.4 days of elapsed time.
kern_time.c/cputick2timespec() (it is used for clock_gettime() for
querying process or thread consumed cpu time) Uses cputick2usec()
and then needlessly converting usec to nsec, obviously losing
precision even with fixed cputick2usec().
kern_time.c/kern_clock_getres() uses some weird (anyway wrong)
formula for getting cputick resolution.
PR: 262215
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D34558
This is used to calculate sizes that are then stored in unsigned long
fields. Make this unsigned long so the calculations use this type and
not an int that can lead to an integer overflow with a large PAGE_SIZE.
This allows building this on arm64 with PAGE_SIZE of 16k. Further work
will be needed if a 32-bit architecture tries to use a similar sized
page.
Sponsored by: The FreeBSD Foundation
The timestamp logs are quite large (often much larger than all the
other sysctls combined) so it's unlikely anyone will want to have
them displayed by `sysctl -a`.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D34616
Can be used when the fs at hand can synchronize insmntque with other
means than the vnode lock.
Reviewed by: markj
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D34466
Use a spinlock section instead of a critical section to synchronize with
statclock(). Otherwise the CLOCK_THREAD_CPUTIME_ID clock can appear to
go backwards.
PR: 262273
Reviewed by: markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D34568