The break-before-make requirement poses a problem when promoting or
demoting mappings containing thread structures: a CPU may raise a
translation fault while accessing curthread, and data_abort() accesses
the thread again before pmap_fault() can translate the address and
return.
Normally this isn't a problem because we have a hack to ensure that
slabs used by the thread zone are always accessed via the direct map,
where promotions and demotions are rare. However, this hack doesn't
work properly with UMA_MD_SMALL_ALLOC disabled, as is the case with
KASAN configured (since our KASAN implementation does not shadow the
direct map and so tries to force the use of the kernel map wherever
possible).
Fix the problem by modifying data_abort() to handle translation faults
in the kernel map without dereferencing "td", i.e., curthread, and
without enabling interrupts. pmap_klookup() has special handling for
translation faults which makes it safe to call in this context. Then,
revert the aforementioned hack.
Reviewed by: kevans, alc, kib, andrew
MFC after: 1 month
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37231
First, an sbuf_new() in device_get_path() shadows the sb
passed in by dev_wired_cache_add(), leaving its sb in an
unfinished state, leading to a failed KASSERT(). Fixing this
is as simple as removing the sbuf_new() from device_get_path()
Second, we cannot simply take a pointer to the sbuf memory and
store it in the device location cache, because that sbuf
is freed immediately after we add data to the cache, leading
to a use-after-free and eventually a double-free. Fixing this
requires allocating memory for the path.
After a discussion with jhb, we decided that one malloc was
better than two in dev_wired_cache_add, which is why it changed
so much.
Reviewed by: jhb
Sponsored by: Netflix
MFC after: 14 days
This commit brings back the driver from FreeBSD commit
f187d6dfbf plus subsequent fixes from
upstream.
Relative to upstream this commit includes a few other small fixes such
as additional INET and INET6 #ifdef's, #include cleanups, and updates
for recent API changes in main.
Reviewed by: pauamma, gbe, kevans, emaste
Obtained from: git@git.zx2c4.com:wireguard-freebsd @ 3cc22b2
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36909
Refactor the symlink and mountpoint traversal logic to avoid
repeatedly checking the vnode type; a symlink cannot be a mountpoint
and vice versa. Avoid repeatedly checking cn_flags for NOCROSSMOUNT
and simplify the check which determines whether the vnode is a
mountpoint.
Suggested by: mjg
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D35054
These are of limited use since the crossmp vnode locking ops have not
actually used a lock since commit
a2d3554542. We in fact require that
these operations are always issued with LK_SHARED. Additionally,
these directives can produce a false positive in certain VV_CROSSLOCK
cases which require upgrading of the covered vnode lock from shared
to exclusive.
While here, replace the runtime check of LK_SHARED with a KASSERT and
expand the check to include LK_NOWAIT, which all callers pass.
Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D35054
When a lookup operation crosses into a new mountpoint, the mountpoint
must first be busied before the root vnode can be locked. When a
filesystem is unmounted, the vnode covered by the mountpoint must
first be locked, and then the busy count for the mountpoint drained.
Ordinarily, these two operations work fine if executed concurrently,
but with a stacked filesystem the root vnode may in fact use the
same lock as the covered vnode. By design, this will always be
the case for unionfs (with either the upper or lower root vnode
depending on mount options), and can also be the case for nullfs
if the target and mount point are the same (which admittedly is
very unlikely in practice).
In this case, we have LOR. The lookup path holds the mountpoint
busy while waiting on what is effectively the covered vnode lock,
while a concurrent unmount holds the covered vnode lock and waits
for the mountpoint's busy count to drain.
Attempt to resolve this LOR by allowing the stacked filesystem
to specify a new flag, VV_CROSSLOCK, on a covered vnode as necessary.
Upon observing this flag, the vfs_lookup() will leave the covered
vnode lock held while crossing into the mountpoint. Employ this flag
for unionfs with the caveat that it can't be used for '-o below' mounts
until other unionfs locking issues are resolved.
Reported by: pho
Tested by: pho
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35054
In order to safely reuse excluded memory when it's reserved for special
purpose, we need to test whether or not the memory has been reserved
early in boot. physmem_excluded will return true when the entire range
is excluded, false otherwise.
Sponsored by: Netflix
List of changes:
- Use integer multiplication instead of long multiplication, because the result is an integer.
- Remove multiple if-statements and predict new if-statements.
- Rename local variable name, "ticks" into "retval" to avoid shadowing
the system "ticks" global variable.
Reviewed by: kib@ and imp@
MFC after: 1 week
Sponsored by: NVIDIA Networking
Differential Revision: https://reviews.freebsd.org/D36859
This allows to fix a bug where sbuf allocation done in the context of
dev_wired_cache_match() must use non-sleepable allocations.
Suggested by: jhb
Reviewed by: jhb, takawata
Discussed with: imp
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D36899
Later it would silently converted to ENOMEM always, because any error
was reported as NULL return path.
Reviewed by: jhb, takawata
Discussed with: imp
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D36899
Some virtual machines pass virtio MMIO device parameters via the kernel
command line as a series of virtio_mmio.device=<parameters> options.
These get translated into FreeBSD kernel environment variables; but
unfortunately they all use the same variable name, which resulted in
all but the first such parameter being ignored when the dynamic kernel
environment is set up from the initial environment buffers.
With this commit, duplicate environment settings will instead be stored
as ${name}_1, ${name}_2... ${name}_9999. In the unlikely event that
the same variable is set over 10000 times before the dynamic kernel
environment is set up, we panic.
Variable settings after the dynamic environment is initialized continue
to override the previously-set value; the change is limited to the very
early kernel boot (prior to SI_SUB_KMEM + 1) and changes behaviour from
"ignore" to "store with a different name" only.
Reviewed by: imp
Feedback from: kevans
Sponsored by: https://patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D36187
By convention, EINVAL is returned when validating arguments, not EPERM.
This matches the documented behaviour of sched_setscheduler(3), and that
of SCHED_OTHER.
PR: 227735
MFC after: 1 week
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D37021
It likely won't happen, but is consistent with the other functions of
this KPI.
Reviewed by: imp, jhb
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D33479
It can be static within uart_tty.c. It is an open question whether there
remains any real benefit to having uart instances share a swi thread.
Reviewed by: imp, markj, jhb
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36938
This reverts commit 76e6e4d72f.
Several programs in the tree use -1 instead of INT_MAX to use
the maximum value. Thanks to Eugene Grosbein for pointing this
out.
Ensure that a negative backlog argument is handled as it if was 0.
Reviewed by: markj@, glebius@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D31821
This will resolve a reference and return the appropriate handle, a node
on the simplebus or an ACPI_HANDLE for ACPI. For now we do not try to
further abstract the return type.
MFC after: 2 weeks
Reviewed by: mw
Differential Revision: https://reviews.freebsd.org/D36793
Mechanically cleanup INP_TIMEWAIT from the kernel sources. After
0d7445193a, this commit shall not cause any functional changes.
Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions. Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.
Differential revision: https://reviews.freebsd.org/D36400
In non-periodic mode absolute timers fire at exactly the time given.
When specifying a fast clock, align the firing time so that less
timer interrupt events are needed.
Reviewed by: rrs @
Differential Revision: https://reviews.freebsd.org/D36858
MFC after: 1 week
Sponsored by: NVIDIA Networking
Add local functions to workaround an instruction segment trap (panic)
when the indirect functions copyin and copyout are called by an external
loadable kernel module (i.e. pfsync, zfs and linuxulator). The crash
was triggered by change 47a57144af, but
kernel binary linked with LLD 9 works fine. LLVM bisect points that LLD
behavior chaged after dc06b0bc9ad055d06535462d91bfc2a744b2f589.
This is know to affect powerpc targets only and the final fix is still
being discussed with the LLVM community.
PR: 266730
Reviewed by: luporl, jhibbits (on IRC, previous version)
MFC after: 2 days
Sponsored by: Instituto de Pesquisas Eldorado (eldorado.org.br)
Differential Revision: https://reviews.freebsd.org/D36234
Modify db_show_sysctl_all so that it does not copy more than once the
data of the input oid, and so that what it passes to db_show_oid does
not alarm coverity.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D36847
Protocols such as netlink may need a large socket receive buffer,
measured in tens of megabytes. This change allows netlink to
set larger socket buffers (given the privs are in place), without
requiring user to manuall bump maxsockbuf.
Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D36747
Allow to set custom per-protocol handlers for the socket buffers
ioctls by introducing pr_setsbopt callback with the default value
set to the currently-used sbsetopt().
Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D36746
The implementation of sysctl_search_oid no longer relies on the
initial value of nodes to be all NULL, so remove the comment that
demands it and let the caller stop enforcing it.
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D36768
sysctl_search_old makes several tests in a loop that can be removed.
The first test in the loop is only ever true on the first loop
iteration, and is always true on that iteration, so its work can be
done before the loop begins.
The upper and lower bounds on the loop variable 'indx' are each tested
on each iteration, but 'indx' is changed in one direction or the other
only once within the loop, so only one bound needs to be checked.
Two ways remain in the loop that nodes[indx] can change (after one of
them is put before the loop start), and one of them applies exactly
when indx has been incremented, so no separate test for that case
requires testing.
Restructure and add comments that makes clearer that this is a basic
depth-first search.
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D36741
sysctl_register_oid must check the uniqueness of any newly computed
oid_number in sysctl_register_oid.
Reviewed by: asomers
MFC with: d3f96f6610
Differential Revision: https://reviews.freebsd.org/D36743
To avoid using the sysctl list macros directly in external kernel modules.
Reviewed by: asomers, manu and asiciliano
Differential Revision: https://reviews.freebsd.org/D36748
MFC after: 1 week
Sponsored by: NVIDIA Networking
Specifically, when we receive a violation and we're configured to panic,
kasan_enabled gets unset before we descend into panic(). At this point,
there's no longer any reason to allow marking as kasan_shadow_check() is
disabled -- we have some inherent risk of faulting or panicking if the
system's in a bad enough state with no benefit.
Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D36742
Sysctl OIDs were internally stored in linked lists, triggering O(n^2)
behavior when userland iterates over many of them. The slowdown is
noticeable for MIBs that have > 100 children (for example, vm.uma). But
it's unignorable for kstat.zfs when a pool has > 1000 datasets.
Convert the linked lists into RB trees. This produces a ~25x speedup
for listing kstat.zfs with 4100 datasets, and no measurable penalty for
small dataset counts.
Bump __FreeBSD_version for the KPI change.
Sponsored by: Axcient
Reviewed by: mjg
Differential Revision: https://reviews.freebsd.org/D36500
The vn_rlimit_fsizex() function:
- checks that the write does not exceed RLIMIT_FSIZE limit and fs
maximum supported file size
- truncates write length if it exceeds the RLIMIT_FSIZE or max file size,
but there are some bytes to write
- sends SIGXFSZ if RLIMIT_FSIZE would be exceed otherwise
POSIX mandates the truncated write in case when some bytes can be
written but whole write request fails the RLIMIT_FSIZE check.
The function is supposed to be used from VOP_WRITE()s. Due to
pecularity in the VFS generic write syscall layer, uio_resid must
correctly reflect the written amount (noted by markj). Provide the dual
vn_rlimit_fsizex_res() function to correct uio_resid after the clamp
done in vn_rlimit_fsizex() on VOP_WRITE() return.
PR: 164793
Reviewed by: asomers, jah, markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36625
When a thread switching off-CPU is migrating to a remote CPU,
sched_switch() may trigger a rescheduling of the thread currently
running on that CPU. When doing so, it must ensure that that thread is
locked before modifying thread state. If the thread's lock is not the
scheduler lock, then the thread is in the process of switching off-CPU
and no extra effort is needed, and the initiator does not hold the
thread's lock and thus should not modify any thread state.
Reported and tested by: Steve Kargl
MFC after: 1 week
This matches the return type of pmap_mapdev/bios.
Reviewed by: kib, markj
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D36548
This removes some of the complexity needed to maintain HASBUF and
allows for removing injecting SAVENAME by filesystems.
Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D36542
This is not BSD-4-Clause. It's closer to a modified BSD-2-Clause with 2
added clauses (and the first one has added clauses). Remove
SPDX-License-Idnetifier since this license doesn't match anything in
SPDX.
This was never enabled and only pollutes the code. The issue will
be addressed later in a different manner.
Sponsored by: Rubicon Communications, LLC ("Netgate")
This license has 4 clauses, and shares some text with the BSD-4-Clause
license. However, it omits the standard disclaimer and has 2 clauses all
its own. Remove this tag, since it was made in error and this doesn't
match the SPDX copy of the BSD-4-Clause license.
Sponsored by: Netflix
The interlock was already taken and released when dooming, thus by
API contract locking state cannot be legally installed.
At the same time the state is almost never there to begin with.
This call existed since pre-FreeBSD times, and it is hard to understand
why it was there in the first place. After 6f3caa6d81 it definitely
became necessary always and commit message from f1ee30ccd6 confirms that.
Now that 6f3caa6d81 is effectively backed out by 07285bb4c2, the call
appears to be useful only for sockets that landed on the incomplete queue,
e.g. sockets that have accept_filter(9) enabled on them.
Provide a new TCP flag to mark connections that are known to be on the
incomplete queue, and call soisconnected() only for those connections.
Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D36488
Change RB_INSERT_COLOR and RB_REMOVE_COLOR so that the blocks of code
that are identical except for left and right being exchanged are made
only one block with a variable to indicate left- or right-handedness.
Rename RB macros so that those not intended for external use begin
with an underscore.
Add comments to the balancing code so that another might understand it.
Reviewed by: alc, kib
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D36393
The send tag pointer may be NULL when the ktls_reset_receive_tag()
function is invoked. Add check for this.
Reviewed by: gallatin @
Sponsored by: NVIDIA Networking
Most notably poudriere performs kill -9 -1 in jails for each port
being built. This reduces the scan from hundrends of processes to
literally 1.
Reviewed by: jamie, markj
Differential Revision: https://reviews.freebsd.org/D34522
It allows iteration over processes belonging to given jail instead of
having to walk the entire allproc list.
Note the iteration can miss processes which remains bug-compatible
with previous code.
Reviewed by: jamie (previous version), markj (previous version)
Differential Revision: https://reviews.freebsd.org/D34522
Since 4.4BSD the protosw was used to implement socket types created
by socket(2) syscall and at the same to demultiplex incoming IPv4
datagrams (later copied to IPv6). This story ended with 78b1fc05b2.
These entries (e.g. IPPROTO_ICMP) in inetsw that were added to catch
packets in ip_input(), they would also be returned by pffindproto()
if user says socket(AF_INET, SOCK_RAW, IPPROTO_ICMP). Thus, for raw
sockets to work correctly, all the entries were pointing at raw_usrreq
differentiating only in the value of pr_protocol.
With 78b1fc05b2 all these entries are no longer needed, as ip_protox
is independent of protosw. Any socket syscall requesting SOCK_RAW type
would end up with rip_protosw. And this protosw has its pr_protocol
set to 0, allowing to mark socket with any protocol.
For IPv6 raw socket the change required two small fixes:
o Validate user provided protocol value
o Always use protocol number stored in inp in rip6_attach, instead
of protosw value, which is now always 0.
Differential revision: https://reviews.freebsd.org/D36380
The divert(4) is not a protocol of IPv4. It is a socket to
intercept packets from ipfw(4) to userland and re-inject them
back. It can divert and re-inject IPv4 and IPv6 packets today,
but potentially it is not limited to these two protocols. The
IPPROTO_DIVERT does not belong to known IP protocols, it
doesn't even fit into u_char. I guess, the implementation of
divert(4) was done the way it is done basically because it was
easier to do it this way, back when protocols for sockets were
intertwined with IP protocols and domains were statically
compiled in.
Moving divert(4) out of inetsw accomplished two important things:
1) IPDIVERT is getting much closer to be not dependent on INET.
This will be finalized in following changes.
2) Now divert socket no longer aliases with raw IPv4 socket.
Domain/proto selection code won't need a hack for SOCK_RAW and
multiple entries in inetsw implementing different flavors of
raw socket can merge into one without requirement of raw IPv4
being the last member of dom_protosw.
Differential revision: https://reviews.freebsd.org/D36379
domain_init() called at SI_SUB_PROTO_DOMAIN/SI_ORDER_SECOND is always
called right after domain_add(), that had been called at SI_ORDER_FIRST.
Note that protocols aren't initialized yet at this point, since they are
usually scheduled to initialize at SI_ORDER_THIRD.
After this merge it becomes clear that DOMF_SUPPORTED / DOMF_INITED
can be garbage collected as they are set & checked in the same function.
For initialization of the domain system itself it is now clear that
domaininit() can be garbage collected and static initializer is enough.
o Statically initialize max_linkhdr to default value without relying
on domain(9) code doing that.
o Statically initialize max_protohdr to a sane value, without relying
on TCP being always compiled in.
o Retire max_datalen. Set, but not used.
o Don't make the domain(9) system responsible in validating these
values and updating max_hdr. Instead provide KPI max_linkhdr_grow()
and max_protohdr_grow().
o Call max_linkhdr_grow() from IEEE802.11 and max_protohdr_grow() from
TCP. Those are the only protocols today that may want to grow.
Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D36376
The io:::start and end probes trace individual I/O requests.
Also remove the unimplemented wait-start and wait-done probes.
PR: 266098
MFC after: 1 week
In RB_INSERT_COLOR and RB_REMOVE_COLOR, avoid reading a parent pointer
from memory, and then reading the left-color bit from memory, and then
reading the right-color bit from memory, since they're all in the same
field. The compiler can't infer that only the first read is really
necessary, so write the code in a way so that it doesn't have to.
Drop RB_RED_LEFT and RB_RED_RIGHT macros that reach into memory to get
those bits. Drop RB_COLOR, the only thing left using RB_RED_LEFT and
RB_RED_RIGHT after the other changes, and go straight to DIAGNOSTIC
code in subr_stats to implement RB_COLOR for its single, dubious use
there.
Reviewed by: alc
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D36353
In kernels without MAC, error is not set for sockets whose protocol
layer does not implement the pr_sense hook.
Reported by: Jenkins (powerpc kernel builds)
Fixes: 7c04ca1fad sockets: for stat(2) on a socket don't report hiwat as block size
The Vax supported such things, but FreeBSD does not. This further
implies that MJUMPAGESIZE > MCLBYTES so assert this and remove code
handling them being equal.
Reviewed by: kp, imp, jhb
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D36320
We cannot see a thread with the flag set in unsuspend, after we stopped
doing SINGLE_ALLPROC from user processes.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36207
It does not serve any purpose after we stopped doing
thread_single(SINGLE_ALLPROC) from stoppable user processes.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36207
There is a problem still left after the fixes to REAP_KILL_PROC. The
handling of the stopping signals by sig_suspend_threads() can occur
outside the stopping process context by tdsendsignal(), and it uses
mostly the same mechanism of aborting sleeps as suspension. In other
words, it badly interacts with thread_single(SINGLE_ALLPROC).
But unlike single threading from the process context, we cannot wait by
sleep for other single threading requests to pass, because we own
spinlock(s).
Fix this by moving both the thread_single(p2, SINGLE_ALLPROC), and the
signalling, to the threaded taskqueue which cannot be single-threaded
itself.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36207
The function result is already used as bool.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D36207
We do not check single-threading conditions in trap, or when sleeping
uninterruptible.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D36207
It is sbt now. Also, explain what flags are.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D36207
The relock order is important not only for a signal delivery, but also
for the suspension requests.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Differential revision: https://reviews.freebsd.org/D36207
It is only ever xlocked in drain_dev_clone_events and the only consumer of
that routine does not need it -- eventhandler code already makes sure the
relevant callback is no longer running.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D36268
o Assert that every protosw has pr_attach. Now this structure is
only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw. This was suggested
in 1996 by wollman@ (see 7b187005d1), and later reiterated
in 2006 by rwatson@ (see 6fbb9cf860).
o Make struct domain hold a variable sized array of protosw pointers.
For most protocols these pointers are initialized statically.
Those domains that may have loadable protocols have spacers. IPv4
and IPv6 have 8 spacers each (andre@ dff3237ee5).
o For inetsw and inet6sw leave a comment noting that many protosw
entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36232
The method was called for two different conditions: 1) the VM layer is
low on pages or 2) one of UMA zones of mbuf allocator exhausted.
This change 2) into a new event handler, but all affected network
subsystems modified to subscribe to both, so this change shall not
bring functional changes under different low memory situations.
There were three subsystems still using pr_drain: TCP, SCTP and frag6.
The latter had its protosw entry for the only reason to register its
pr_drain method.
Reviewed by: tuexen, melifaro
Differential revision: https://reviews.freebsd.org/D36164
They were useful many years ago, when the callwheel was not efficient,
and the kernel tried to have as little callout entries scheduled as
possible.
Reviewed by: tuexen, melifaro
Differential revision: https://reviews.freebsd.org/D36163
The protosw KPI historically has implemented two quite orthogonal
things: protocols that implement a certain kind of socket, and
protocols that are IPv4/IPv6 protocol. These two things do not
make one-to-one correspondence. The pr_input and pr_ctlinput methods
were utilized only in IP protocols. This strange duality required
IP protocols that doesn't have a socket to declare protosw, e.g.
carp(4). On the other hand developers of socket protocols thought
that they need to define pr_input/pr_ctlinput always, which lead to
strange dead code, e.g. div_input() or sdp_ctlinput().
With this change pr_input and pr_ctlinput as part of protosw disappear
and IPv4/IPv6 get their private single level protocol switch table
ip_protox[] and ip6_protox[] respectively, pointing at array of
ipproto_input_t functions. The pr_ctlinput that was used for
control input coming from the network (ICMP, ICMPv6) is now represented
by ip_ctlprotox[] and ip6_ctlprotox[].
ipproto_register() becomes the only official way to register in the
table. Those protocols that were always static and unlikely anybody
is interested in making them loadable, are now registered by ip_init(),
ip6_init(). An IP protocol that considers itself unloadable shall
register itself within its own private SYSINIT().
Reviewed by: tuexen, melifaro
Differential revision: https://reviews.freebsd.org/D36157
Stack must be at least readable and writable.
PR: 242570
Reviewed by: kib, markj
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35867
With clang 15, the following -Werror warning is produced:
sys/kern/kern_poll.c:374:16: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
netisr_pollmore()
^
void
This is because netisr_pollmore() is declared with a (void) argument
list, but defined with an empty argument list. Make the definition match
the declaration.
MFC after: 3 days
Add domain_remove() SYSUNINT callback that removes the domain
from the domain list if it has DOMF_UNLOADABLE flag set.
This change is required to support netlink ( D36002 ).
Reviewed by: glebius
Differential Revision: https://reviews.freebsd.org/D36173
For some reason protosw.h is used during world complation and userland
is not aware of caddr_t, a relic from the first version of C. Broken
buildworld is good reason to get rid of yet another caddr_t in kernel.
Fixes: 886fc1e804
This is expensive and useless call. It has been useless since Alexander
melifaro@ moved the forwarding table to nexthops with passive invalidation.
What happens now is that cached route in a inpcb would get invalidated
on next ip_output().
These were the last users of pfctlinput(), so garbage collect it.
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36156
The only place to execute this method was raw_usend(). Only those
protocols that used raw socket were able to actually enter that method.
All pr_output assignments being deleted by this commit were a dead code
for many years.
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36126
With clang 15, the following -Werror warning is produced:
sys/kern/subr_devmap.c:87:19: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
devmap_print_table()
^
void
This is because devmap_print_table() and devmap_lastaddr() are declared
with a (void) argument list, but defined with an empty argument list.
Make the definition match the declaration.
Sponsored by: The FreeBSD Foundation
Currently, subr_bus.c shares logic for (a) maintaining all HW devices
(e.g. discovery/attach/detach logic) and (b) generic devctl notification
layer for devices/PMU/GEOM/interfaces/etc).
These two subsystems share really tiny interaction interface, composed of 3
notification functions. With that in mind, move devctl layer to a
separate file, establishing a clear notification interface between the
sub.c bus layer and the provider (devctl).
The primary driver of this change is netlink implementation (D36002).
The idea is to propagate device-level events to netlink as well, so all
netlink customers can subscribe to these changes.
The long-term goal is to deprecate devctl and to use netlink as the
kernel<> userland transport provided netlink gets enough traction.
Reviewed by: imp, markj
Differential Revision: https://reviews.freebsd.org/D36091
MFC after: 1 month
This streamlines cloning of a socket from a listener. Now we do not
drop the inpcb lock during creation of a new socket, do not do useless
state transitions, and put a fully initialized socket+inpcb+tcpcb into
the listen queue.
Before this change, first we would allocate the socket and inpcb+tcpcb via
tcp_usr_attach() as TCPS_CLOSED, link them into global list of pcbs, unlock
pcb and put this onto incomplete queue (see 6f3caa6d81). Then, after
sonewconn() we would lock it again, transition into TCPS_SYN_RECEIVED,
insert into inpcb hash, finalize initialization of tcpcb. And then, in
call into tcp_do_segment() and upon transition to TCPS_ESTABLISHED call
soisconnected(). This call would lock the listening socket once again
with a LOR protection sequence and then we would relocate the socket onto
the complete queue and only now it is ready for accept(2).
Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D36064
as alternative KPI to sonewconn(). The latter has three stages:
- check the listening socket queue limits
- allocate a new socket
- call into protocol attach method
- link the new socket into the listen queue of the listening socket
The attach method, originally designed for a creation of socket by the
socket(2) syscall has slightly different semantics than attach of a socket
cloned by listener. Make it possible for protocols to call into the
first stage, then perform a different attach, and then call into the
final stage. The first stage, that checks limits and clones a socket
is called solisten_clone(), and the function that enqueues the socket
is solisten_enqueue().
Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D36063
Resulting sbuf_len() from proc_getargv() might return 0 if user mangled
ps_strings enough. Also, sbuf_len() API contract is to return -1 if the
buffer overflowed. The later should not occur because get_ps_strings()
checks for catenated length, but check for this subtle detail explicitly
as well to be more resilent.
The end result is that p_comm is used in this situations.
Approved by: so
Security: FreeBSD-SA-22:09.elf
Reported by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: delphij, markj
admbugs: 988
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35391
By calling the function too early we might still have the td_pflags
value cached from the previous struct thread use. cpu_copy_thread()
depends on correct value for TDP_KTHREAD at least on x86.
Reported, bisected, and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D36069
This ensures the filedesc-to-leader code is consistently encapsulated in
kern_descrip.c.
No functional change intended.
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35988
The TDA_AST flag is set on td2 unconditionally (as it was TDF_ASTPENDING
before AST rework), so it is not used practically for some time.
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D36033
When it was inline it made sense to depend on the existing nested check
in KTRUSERRET() rather than adding a new td_flags flag. However, since
we now have a TDA_KTRACE flag anyway, we might as well check it and
avoid the call.
Suggested by: jhb
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888
Explicitly pass the struct thread argument.
Move the function prototype from sys/systm.h to geom/geom.h, we do not
need almost each kernel source to see the prototype, it is now used
only by kern/vfs_mountroot.c outside geom/geom_event.c, where the
function is defined.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888
Make most AST handlers dynamically registered. This allows to have
subsystem-specific handler source located in the subsystem files,
instead of making subr_trap.c aware of it. For instance, signal
delivery code on return to userspace is now moved to kern_sig.c.
Also, it allows to have some handlers designated as the cleanup (kclear)
type, which are called both at AST and on thread/process exit. For
instance, ast(), exit1(), and NFS server no longer need to be aware
about UFS softdep processing.
The dynamic registration also allows third-party modules to register AST
handlers if needed. There is one caveat with loadable modules: the
code does not make any effort to ensure that the module is not unloaded
before all threads processed through AST handler in it. In fact, this
is already present behavior for hwpmc.ko and ufs.ko. I do not think it
is worth the efforts and the runtime overhead to try to fix it.
Reviewed by: markj
Tested by: emaste (arm64), pho
Discussed with: jhb
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D35888
Implement Linux-variant of MSG_TRUNC input flag used in recv(), recvfrom() and recvmsg().
Posix defines MSG_TRUNC as an output flag, indicating packet/datagram truncation.
Linux extended it a while (~15+ years) ago to act as input flag,
resulting in returning the full packet size regarless of the input
buffer size.
It's a (relatively) popular pattern to do recvmsg( MSG_PEEK | MSG_TRUNC) to get the
packet size, allocate the buffer and issue another call to fetch the packet.
In particular, it's popular in userland netlink code, which is the primary driving factor of this change.
This commit implements the MSG_TRUNC support for SOCK_DGRAM sockets (udp, unix and all soreceive_generic() users).
PR: kern/176322
Reviewed by: pauamma(doc)
Differential Revision: https://reviews.freebsd.org/D35909
MFC after: 1 month
The devmap variants used vm_offset_t for some reason, and a few places
explicitly cast bus addresses to vm_offset_t. (Probably those casts
along with similar casts for vm_size_t should just be removed and
instead permit the compiler to DTRT.)
Reviewed by: markj
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D35961
With clang 15, the following -Werror warning is produced:
sys/kern/vfs_bio.c:3430:11: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
buf_daemon()
^
void
This is because buf_daemon() is declared with a (void) argument list,
but defined with an empty argument list. Make the definition match the
declaration.
MFC after: 3 days
With clang 15, the following -Werror warnings are produced:
sys/kern/sysv_msg.c:213:8: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
msginit()
^
void
sys/kern/sysv_msg.c:316:10: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
msgunload()
^
void
This is because msginit() and msgunload() are declared with (void)
argument lists, but defined with empty argument lists. Make the
definitions match the declarations.
MFC after: 3 days
With clang 15, the following -Werror warning is produced:
sys/kern/subr_bus.c:871:16: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
bus_topo_assert()
^
void
This is because bus_topo_assert() is declared with a (void) argument
list, but defined with an empty argument list. Make the definition match
the declaration.
MFC after: 3 days
With clang 15, the following -Werror warning is produced:
sys/kern/subr_autoconf.c:119:34: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
run_interrupt_driven_config_hooks()
^
void
This is because run_interrupt_driven_config_hooks() is declared with a
(void) argument list, but defined with an empty argument list. Make the
definition match the declaration.
MFC after: 3 days
With clang 15, the following -Werror warnings are produced:
sys/kern/kern_resource.c:1212:10: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
lim_alloc()
^
void
sys/kern/kern_resource.c:1365:11: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
uihashinit()
^
void
This is because lim_alloc() and uihashinit() are declared with (void)
argument lists, but defined with empty argument lists. Make the
definitions match the declarations.
MFC after: 3 days
With clang 15, the following -Werror warnings are produced:
sys/kern/kern_dtrace.c:64:18: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
kdtrace_proc_size()
^
void
sys/kern/kern_dtrace.c:87:20: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
kdtrace_thread_size()
^
void
This is because kdtrace_proc_size() and kdtrace_thread_size() are
declared with (void) argument lists, but defined with empty argument
lists. Make the definitions match the declarations.
MFC after: 3 days
With clang 15, the following -Werror warnings are produced:
sys/kern/kern_cons.c:201:14: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
cninit_finish()
^
void
sys/kern/kern_cons.c:376:7: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
cngrab()
^
void
sys/kern/kern_cons.c:389:9: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
cnungrab()
^
void
sys/kern/kern_cons.c:402:9: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
cnresume()
^
void
This is because cninit_finish(), cngrab(), cnungrab(), and cnresume()
are declared with (void) argument lists, but defined with empty argument
lists. Make the definitions match the declarations.
MFC after: 3 days
Historic FreeBSD behaviour (dating back to 1994-04-02) when rebooting
is to print "Rebooting..." and then
/* wait 1 sec for printf's to complete and be read */
Prior to April 1994, there was a 100 ms delay (added 1993-11-12).
Since (a) most users will already be aware that the system is rebooting
and do not need to take time to read an additional message to that
effect, and (b) most FreeBSD systems don't have anyone actively looking
at the console anyway, this delay no longer serves much purpose.
This commit adds a kern.reboot_wait_time sysctl which defaults to 0;
historic behaviour can be regained by setting it to 1.
Reviewed by: imp
Relnotes: FreeBSD now reboots faster; to restore the traditional
wait after printing "Rebooting..." to the console, set
kern.reboot_wait_time=1 (or more).
Sponsored by: https://www.patreon.com/cperciva
Differential Revision: https://reviews.freebsd.org/D35796
Add three simple hooks to the debugger allowing for a loaded MAC policy
to intervene if desired:
1. Before invoking the kdb backend
2. Before ddb command registration
3. Before ddb command execution
We extend struct db_command with a private pointer and two flag bits
reserved for policy use.
Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35370
This is not completely exhaustive, but covers a large majority of
commands in the tree.
Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35583
The load balancer may force a running thread to reschedule and pick a
new CPU. To do this it sets some flags in the thread running on a
loaded CPU. But the code assumed that a running thread's lock is the
same as that of the corresponding runqueue, and there are small windows
where this is not true. In this case, we can end up with non-atomic
modifications to td_flags.
Since this load balancing is best-effort, simply give up if the thread's
lock doesn't match; in this case the thread is about to enter the
scheduler anyway.
Reviewed by: kib
Reported by: glebius
Fixes: e745d729be ("sched_ule(4): Improve long-term load balancer.")
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35821
It used to be mapped at the top of the UVA.
If the randomization is enabled any address above .data section will be
randomly chosen and a guard page will be inserted in the shared page
default location.
The shared page is now mapped in exec_map_stack, instead of
exec_new_vmspace. The latter function is called before image activator
has a chance to parse ASLR related flags.
The KERN_PROC_VM_LAYOUT sysctl was extended to provide shared page
address.
The feature is enabled by default for 64 bit applications on all
architectures.
It can be toggled kern.elf64.aslr.shared_page sysctl.
Approved by: mw(mentor)
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35349
Store the shared page address in struct vmspace.
Also instead of storing absolute addresses of various shared page
segments save their offsets with respect to the shared page address.
This will be more useful when the shared page address is randomized.
Approved by: mw(mentor)
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35393
Use a getter macro instead of fetching the sigcode address directly
from a sysent of a given process. It assumes that the sigcode is stored
in the shared page, which is true in all cases, except for a.out
binaries. This will be later useful when the shared page address
randomization is introduced.
No functional change intended.
Approved by: mw(mentor)
Sponsored by: Stormshield
Obtained from: Semihalf
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35392
The passed cpuid is always equal to the one stored in the callout
structure. No functional change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Replace struct timeval in header with struct timespec.
To differentiate header formats, add a new KTR_VERSIONED flag
set in the header type field similar to the existing KTRDROP flag.
To make it easier to extend ktrace headers in the future,
extend the existing header with a version field (version 0 is
reserved for older records without KTR_VERSIONED) as well as
new fields holding the thread ID and CPU ID.
Reviewed by: jhb, pauamma
Differential Revision: https://reviews.freebsd.org/D35774
MFC after: 2 weeks
Allow high priority hardware interrupts to run at PI_REALTIME via
INTR_TYPE_CLK, but collapse all other hardware interrupt threads to
the next priority level (PI_INTR). Collapse all SWI priorities to
the same priority level (PI_SOFT) just below PI_INTR.
Reviewed by: kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35646
If an interrupt thread runs for a full quantum without yielding the
CPU, demote its priority and schedule a preemption to give other
ithreads a turn.
Reviewed by: kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35645
If an interrupt thread runs for a full quantum without yielding the
CPU, demote its priority and schedule a preemption to give other
ithreads a turn.
Reviewed by: kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35644
Use sched_wakeup instead of sched_add when marking an ithread
runnable. This allows schedulers to reset their internal time slice
tracking state and restore the base ithread priority when an ithread
resumes from idle.
Reviewed by: markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35643
Use it instead of sched_prio when setting the priority of an interrupt
thread.
Reviewed by: kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35642
Different fields in the tdq have different synchronization protocols.
Some are constant, some are accessed only while holding the tdq lock,
some are modified with the lock held but accessed without the lock, some
are accessed only on the tdq's CPU, and some are not synchronized by the
lock at all.
Convert ULE to stop using volatile and instead use atomic_load_* and
atomic_store_* to provide the desired semantics for lockless accesses.
This makes the intent of the code more explicit, gives more freedom to
the compiler when accesses do not need to be qualified, and lets KCSAN
intercept unlocked accesses.
Thus:
- Introduce macros to provide unlocked accessors for certain fields.
- Use atomic_load/store for all accesses of tdq_cpu_idle, which is not
synchronized by the mutex.
- Use atomic_load/store for accesses of the switch count, which is
updated by sched_clock().
- Add some comments to fields of struct tdq describing how accesses are
synchronized.
No functional change intended.
Reviewed by: mav, kib
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35737
The load balancer executes from statclock and periodically tries to move
threads among CPUs in order to balance load. It may move a thread to
the current CPU (the loader balancer always runs on CPU 0). When it
does so, it may need to schedule preemption of the interrupted thread.
Use sched_setpreempt() to do so, same as sched_add().
PR: 264867
Reviewed by: mav, kib, jhb
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35744
Thread switching used to be atomic with respect to the current CPU's
tdq lock. Since commit 686bcb5c14 that is no longer the case. Now
sched_switch() does this:
1. lock tdq (might already be locked)
2. maybe put the current thread in the tdq, choose a new thread to run
2a. update tdq_lowpri
3. unlock tdq
4. switch CPU context, update curthread
Some code paths in ULE will load pc_curthread from a remote CPU with
that CPU's tdq lock held, usually to inspect its priority. But, as of
the aforementioned commit this is racy.
The problem I noticed is in tdq_notify(), which optionally sends an IPI
to a remote CPU when a new thread is added to its runqueue. If the new
thread's priority is higher (lower) than the currently running thread's
priority, then we deliver an IPI. But inspecting
pc_curthread->td_priority doesn't work, since pc_curthread might be
between steps 3 and 4 above. If pc_curthread's priority is higher than
that of the newly added thread, but pc_curthread is switching to a
lower-priority thread, then tdq_notify() might fail to deliever an IPI,
leaving a high priority thread stuck on the runqueue for longer than it
should. This can cause multi-millisecond stalls in
interactive/ithread/realtime threads.
Fix this problem by modifying tdq_add() and tdq_move() to return the
value of tdq_lowpri before the addition of the new thread. This ensures
that tdq_notify() has the correct priority value to compare against.
The other two uses of pc_curthread are susceptible to the same race. To
fix the one in sched_rem()->tdq_setlowpri() we need to have an exact
value for curthread. Thus, introduce a new tdq_curthread field to the
tdq which gets updated any time a new thread is selected to run on the
CPU. Because this field is synchronized by the thread lock, its
priority reflects the correct lowpri value for the tdq.
PR: 264867
Fixes: 686bcb5c14 ("schedlock 4/4")
Reviewed by: mav, kib, jhb
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35736
- Remove a flag variable.
- Convert a runtime check of the passed cpuid to a KASSERT.
- Remove the cc_inited flag. An attempt to schedule a callout before
SI_SUB_CPU will crash anyway since the per-CPU mutexes won't have been
initialized, and that flag was only checked in the case where a cpuid
was explicitly specified by the caller.
No functional change intended.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Stop including the current CPU in all event messages, since it's already
saved in KTR log entries and thus is redundant. All eventtimer traces
occur in a context where CPU migration is not possible.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
In handleevents(), lock the timer state before fetching the time for the
next event. A concurrent callout_cc_add() call might be changing the
next event time, and the race can cause handleevents() to program an
out-of-date time, causing the callout to run later (by an unbounded
period, up to the idle hardclock period of 1s) than requested.
In cpu_idleclock(), call getnextcpuevent() with the timer state mutex
held, for similar reasons. In particular, cpu_idleclock() runs with
interrupts enabled, so an untimely timer interrupt can result in a stale
next event time being programmed. Further, an interrupt can cause
cpu_idleclock() to use a stale value for "now".
In cpu_activeclock(), disable interrupts before loading "now", so as to
avoid going backwards in time when calling handleevents(). It's ok to
leave interrupts enabled when checking "state->idle", since the race at
worst will cause handleevents() to be called unnecessarily. But use an
atomic load to indicate that the test is racy.
PR: 264867
Reviewed by: mav, jhb, kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35735
Callers have already loaded the pointer, so these functions don't need
to fetch it again.
No functional change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Some command definitions were forced to use DB_FUNC in order to specify
their required flags, CS_OWN or CS_MORE. Use the new macros to simplify
these.
Reviewed by: markj, jhb
MFC after: 3 days
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35582
o Retire SS_FDREF as it is basically a debug flag on top of already
existing soref()/sorele().
o Convert SS_PROTOREF into soref()/sorele().
o Change reference model for the listen queues, see below.
o Make sofree() private. The correct KPI to use is only sorele().
o Make soabort() respect the model and sorele() instead of sofree().
Note on listening queues. Until now the sockets on a queue had zero
reference count. And the reference were given only upon accept(2). The
assumption was that there is no way to see the queued socket from anywhere
except its head. This is not true, since queued sockets already have pcbs,
which are linked at least into the global pcb lists. With this change we
put the reference right in the sonewconn() and on accept(2) path we just
hand the existing reference to the file descriptor.
Differential revision: https://reviews.freebsd.org/D35679
Rename SS_NOFDREF to SS_FDREF and flip all bitwise operations.
Mark sockets created by socreate() with SS_FDREF.
This change is mostly illustrative. With it we see that SS_FDREF
is a debugging flag, since:
* socreate() takes a reference with soref().
* on accept path solisten_dequeue() takes a reference
with soref() and then soaccept() sets SS_FDREF.
* soclose() checks SS_FDREF, removes it and does sorele().
Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D35678
Change the default from printing a breif kernel thread stack informaton
back to omitting it for non-invariant kernels in response to
SIGINFO/^T. Full and brief stack support can be selected with the
kern.tty_info_kstacks sysctl.
MFC After: 2 weeks
Sponsored by: Netflix
Reviewed by: grembo, jhb
Differential Revision: https://reviews.freebsd.org/D35576
For the kernel this is mostly a non-functional change. However, this
will be useful for simplifying gcore(1).
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D35666
Add shm_remove_prison(), that removes all POSIX shared memory segments
belonging to a prison. Call it from prison_cleanup() so a prison
won't be stuck in a dying state due to the resources still held.
PR: 257555
Reported by: grembo
Currently, when a jail starts dying, either by losing its last user
reference or by being explicitly killed,
osd_jail_call(...PR_METHOD_REMOVE...) is called. Encapsulate this
into a function prison_cleanup() that can then do other cleanup.
Instead of returning EMSGSIZE pass the error code from fdallocn() directly
to userland. That would be EMFILE, which makes much more sense. This
error code is not listed in the specification[1], but the specification
doesn't cover such edge case at all. Meanwhile the specification lists
EMSGSIZE as the error code for invalid value of msg_iovlen, and FreeBSD
follows that, see sys_recmsg(). Differentiating these two cases will make
a developer/admin life much easier when debugging.
[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/recvmsg.html
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35640
OpenVPN Data Channel Offload (DCO) moves OpenVPN data plane processing
(i.e. tunneling and cryptography) into the kernel, rather than using tap
devices.
This avoids significant copying and context switching overhead between
kernel and user space and improves OpenVPN throughput.
In my test setup throughput improved from around 660Mbit/s to around
2Gbit/s.
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34340
A one-to-many unix/dgram socket is a socket that has been bound
with bind(2) and can get multiple connections. A typical example
is /var/run/log bound by syslogd(8) and receiving multiple
connections from libc syslog(3) API. Until now all of these
connections shared the same receive socket buffer of the bound
socket. This made the socket vulnerable to overflow attack.
See 240d5a9b1c for a historical attempt to workaround the problem.
This commit creates a per-connection socket buffer for every single
connected socket and eliminates the problem. The new behavior will
optimize seldom writers over frequent writers. See added test case
scenarios and code comments for more detailed description of the
new behavior.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35303
o Use m_pkthdr.memlen from m_uiotombuf()
o Modify unp_internalize() to keep track of allocated space and memory
as well as pointer to the last buffer.
o Modify unp_addsockcred() to keep track of allocated space and memory
as well as pointer to the last buffer.
o Record the datagram len/memlen/ctllen in the first (from) mbuf of the
chain in uipc_sosend_dgram() and reuse it in uipc_soreceive_dgram().
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35302
Data allocated by m_uiotombuf() usually goes into a socket buffer.
We are interested in the length of useful data to be added to sb_acc,
as well as total memory used by mbufs. The later would be added to
sb_mbcnt. Calculating this value at allocation time allows to save
on extra traversal of the mbuf chain.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35301
This change fully splits away PF_UNIX/SOCK_DGRAM from other socket
buffer implementations, without any behavior changes.
Generic socket implementation is reduced down to one STAILQ and very
little code.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35300
Split struct sockbuf into common shared fields and protocol specific
union, where protocols are free to implement whatever buffer they
want. Such protocols should mark themselves with PR_SOCKBUF and are
expected to initialize their buffers in their pr_attach and tear
them down in pr_detach.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35299
Use this new version in unix/dgram socket when sending to a target
address. This removes extra lock release/acquisition and possible
counter-intuitive ENOTCONN.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35298
This allows to remove one M_NOWAIT allocation and also makes it
more clear what's going on.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35297
With this second step PF_UNIX/SOCK_DGRAM has protocol specific
implementation. This gives some possibility performance
optimizations. However, it still operates on the same struct
socket as all other sockets do.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35296
Remove the dead code. The new uipc_sosend_dgram() handles send()
on PF_UNIX/SOCK_DGRAM in full.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35294
This is first step towards splitting classic BSD socket
implementation into separate classes. The first to be
split is PF_UNIX/SOCK_DGRAM as it has most differencies
to SOCK_STREAM sockets and to PF_INET sockets.
Historically a protocol shall provide two methods for sendmsg(2):
pru_sosend and pru_send. The former is a generic send method,
e.g. sosend_generic() which would internally call the latter,
uipc_send() in our case. There is one important exception, though,
the sendfile(2) code will call pru_send directly. But sendfile
doesn't work on SOCK_DGRAM, so we can do the trick. We will create
socket class specific uipc_sosend_dgram() which will carry only
important bits from sosend_generic() and uipc_send().
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35293
Partially revert the previous change; we need to keep this method as a
specific override for pci_driver subclasses which should not use
pci_rescan_method() -- cardbus and ofw_pcibus. However, change the return
value to ENODEV for the same reasoning given in the original commit, and
use this as the default rescan method in bus_if.m.
Reported by: jhb
Fixes: 36a8572ee8 ("bus_if: provide a default null rescan method")
MFC with: 36a8572ee8
The third argument to this function indicates whether the supplied
ticker is fixed or variable, i.e. requiring calibration. Give this
argument a type and name that better conveys this purpose.
Reviewed by: kib, markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35459
There is an existing helper method in subr_bus.c, but almost no drivers
know to use it. It also returns the same error as an empty method,
making it not very useful. Move this to bus_if.m and return a more
sensible error code.
This gives a slightly more meaningful error message when attempting
'devctl rescan' on buses and devices alike:
"Device not configured" --> "Operation not supported by device"
Reviewed by: imp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35501
Calculate the desired page valid mask using math that will not
overflow the types used.
Sponsored by: Netflix
Reviewed by: mckusick, kib, markj
Differential Revision: https://reviews.freebsd.org/D34837
- Remove the AIO proc zone. This zone gets one allocation per AIO
daemon process, which isn't enough to warrant a dedicated zone. Plus,
unlike other AIO structures, aiops are small (32 bytes with LP64), so
UMA doesn't provide better space efficiency than malloc(9). Change
one of the malloc types in vfs_aio.c to make it more general.
- Don't set the NOFREE flag on the other AIO zones. This flag means
that memory allocated to the AIO subsystem is never freed back to the
VM, so it's always preferable to avoid using it when possible. NOFREE
was set without explanation when AIO was converted to use UMA 20 years
ago, but it does not appear to be required; all of the structures
allocated from UMA (per-process kaioinfo, kaiocb, and aioliojob) keep
track of references and get freed only when none exist. Plus, these
structures will contain dangling pointer after they're freed (e.g.,
the "cred", "fd_file" and "uiop" fields of struct kaiocb), so
use-after-frees are dangerous even when the structures themselves are
type-stable.
Reviewed by: asomers
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35493
Add kf_pipe_buffer_[in/out/size] fields to kf_pipe, and populate them.
Add a kf_kqueue struct to the kf_un union, to allow querying kqueue state,
and populate it.
Populate the kf_sock_rcv_sb_state and kf_sock_snd_sb_state fields in
kf_sock for INET/INET6 sockets, and populate all other fields for all
transport layer protocols, not just TCP.
Bump __FreeBSD_version.
Differential revision: https://reviews.freebsd.org/D34184
Reviewed by: jhb, kib, se
MFC after: 1 week
When locking the knote list for a socket, we check whether the socket is
a listening socket in order to select the appropriate mutex; a listening
socket uses the socket lock, while data sockets use socket buffer
mutexes.
If SOLISTENING(so) is false and the knote lock routine locks a socket
buffer, then it must re-check whether the socket is a listening socket
since solisten_proto() could have changed the socket's identity while we
were blocked on the socket buffer lock.
Reported by: syzkaller
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35492
When the kernel is compiled with -asan-stack=true, the address sanitizer
will emit inline accesses to the shadow map. In other words, some
shadow map accesses are not intercepted by the KASAN runtime, so they
cannot be disabled even if the runtime is not yet initialized by
kasan_init() at the end of hammer_time().
This went unnoticed because the loader will initialize all PML4 entries
of the bootstrap page table to point to the same PDP page, so early
shadow map accesses do not raise a page fault, though they are silently
corrupting memory. In fact, when the loader does not copy the staging
area, we do get a page fault since in that case only the first and last
PML4Es are populated by the loader. But due to another bug, the loader
always treated KASAN kernels as non-relocatable and thus always copied
the staging area.
It is not really practical to annotate hammer_time() and all callees
with __nosanitizeaddress, so instead add some early initialization which
creates a shadow for the boot stack used by hammer_time(). This is only
needed by KASAN, not by KMSAN, but the shared pmap code handles both.
Reported by: mhorne
Reviewed by: kib
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35449
The pointer to the mount values may be null if an error occurred while
copying them in, so fix the assertion condition to reflect that
possibility.
While here, move some initialization code into the error == 0 block. No
functional change intended.
Reported by: syzkaller
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
... rather than setting and clearing flags inline. No functional change
intended.
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35469
Suppose a thread tries to read from an empty pipe. pipe_read() does the
following:
1. pipelock(), possibly sleeping
2. check for buffered data
3. pipeunlock()
4. set PIPE_WANTR and sleep
5. goto 1
pipelock() is an open-coded mutex; if a thread blocks in pipelock(), it
sleeps until the lock holder calls pipeunlock().
Both sleeps use the same wait channel. So if there are multiple threads
in pipe_read(), a thread T1 in step 3 can wake up a thread T2 sleeping
in step 4. Then T1 goes to sleep in step 4, and T2 acquires and
releases the pipelock, waking up T1 again. This can go on indefinitely,
livelocking the process (and potentially starving a would-be writer).
Fix the problem by using a separate wait channel for pipelock().
Reported by: Paul Floyd <paulf2718@gmail.com>
Reviewed by: mjg, kib
PR: 264441
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35415
This is racy because curproc process lock is not used, but allows the
process to exit faster. It is userspace issue to create such race
anyway, and not fullfilling the guarantee that all reaper descendants
are signalled should be fine.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
We drop proctree_lock, which allows the process to exit while memoized
in the list to proceed.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
Recorded reaper might loose its reaper status, so we should not assert
it, but check and avoid signalling if this happens.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 week
Differential revision: https://reviews.freebsd.org/D35310
The failure means that the process does single-threading itself, which
makes our action not needed.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
If we try to single-thread a process which thread entered
procctl(REAP_KILL_SUBTREE), and sleeping waiting for us unlocking
stop_all_proc_blocker, we must be able to finish single-threading. This
requires the sleep to be interruptible.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
markj wrote:
tdsendsignal() may unsuspend a target thread. I think there is at least
one bug there: suppose thread T is suspended in
thread_single(SINGLE_ALLPROC) when trying to kill another process with
REAP_KILL. Suppose a different thread sends SIGKILL to T->td_proc. Then,
tdsendsignal() calls thread_unsuspend(T, T->td_proc). thread_unsuspend()
incorrectly decrements T->td_proc->p_suspcount to -1.
Later, when T->td_proc exits, it will wait forever in
thread_single(SINGLE_EXIT) since T->td_proc->p_suspcount never reaches 1.
Since the thread suspension is bounded by time needed to do
thread_single(), skipping the thread_unsuspend_one() call there should
not affect signal delivery if this thread is selected as target.
Reported by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
Since both self single-threading and remote single-threading rely on
suspending the thread doing thread_single(), it cannot be mixed: thread
doing thread_suspend_switch() might be subject to thread_suspend_one()
and vice versa.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
suspended for SINGLE_ALLPROC mode. There is no need to check for
boundary state. It is only required to see that the suspension comes
from the ALLPROC mode.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
to avoid ALLPROC mode to try to race with any other single-threading
mode.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
Places that will wait for curproc->p_singlethr to become zero (in the
next commit, the counter of number of external single-threading is
to be introduced), must wait for it interruptible, otherwise we
deadlock. On the other hand, a signal delivered during this window,
if directed to the waiting thread, would cause the wait loop to become
a busy loop.
Since we are exiting, it is safe to ignore the signals.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
before the process itself does thread_single(SINGLE_EXIT). We cannot
single-thread such process in ALLPROC (external) mode, and properly
detect and report the failure to do so due to the process becoming
zombie is easier to prevent than handle.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
This avoids the need to drop into the ddb to figure out vnode
usage per file system. It helps to see if they are or are not
being freed. Suggestion to report active vnode count was from
kib@
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35436
Do this by reducing the size of the MBUF_PEXT_MAX_PGS, causing "struct mbuf" to
be bigger than M_SIZE, and also add a missing padding field to ensure 64-bit
alignment.
Reviewed by: gallatin@
Reported by: Elliott Mitchell
Differential revision: https://reviews.freebsd.org/D35339
MFC after: 1 week
Sponsored by: NVIDIA Networking
Basic TLS RX offloading uses the "csum_flags" field in the mbuf packet
header to figure out if an incoming mbuf has been fully offloaded or
not. This information follows the packet stream via the LRO engine, IP
stack and finally to the TCP stack. The TCP stack preserves the mbuf
packet header also when re-assembling packets after packet loss. When
the mbuf goes into the socket buffer the packet header is demoted and
the offload information is transferred to "m_flags" . Later on a
worker thread will analyze the mbuf flags and decide if the mbufs
making up a TLS record indicate a fully-, partially- or not decrypted
TLS record. Based on these three cases the worker thread will either
pass the packet on as-is or recrypt the decrypted bits, if any, or
decrypt the packet as usual.
During packet loss the kernel TLS code will call back into the network
driver using the send tag, informing about the TCP starting sequence
number of every TLS record that is not fully decrypted by the network
interface. The network interface then stores this information in a
compressed table and starts asking the hardware if it has found a
valid TLS header in the TCP data payload. If the hardware has found a
valid TLS header and the referred TLS header is at a valid TCP
sequence number according to the TCP sequence numbers provided by the
kernel TLS code, the network driver then informs the hardware that it
can resume decryption.
Care has been taken to not merge encrypted and decrypted mbuf chains,
in the LRO engine and when appending mbufs to the socket buffer.
The mbuf's leaf network interface pointer is used to figure out from
which network interface the offloading rule should be allocated. Also
this pointer is used to track route changes.
Currently mbuf send tags are used in both transmit and receive
direction, due to convenience, but may get a new name in the future to
better reflect their usage.
Reviewed by: jhb@ and gallatin@
Differential revision: https://reviews.freebsd.org/D32356
Sponsored by: NVIDIA Networking