Commit Graph

19592 Commits

Author SHA1 Message Date
Alan Somers
d3f96f6610 Fix O(n^2) behavior in sysctl
Sysctl OIDs were internally stored in linked lists, triggering O(n^2)
behavior when userland iterates over many of them.  The slowdown is
noticeable for MIBs that have > 100 children (for example, vm.uma).  But
it's unignorable for kstat.zfs when a pool has > 1000 datasets.

Convert the linked lists into RB trees.  This produces a ~25x speedup
for listing kstat.zfs with 4100 datasets, and no measurable penalty for
small dataset counts.

Bump __FreeBSD_version for the KPI change.

Sponsored by:	Axcient
Reviewed by:	mjg
Differential Revision: https://reviews.freebsd.org/D36500
2022-09-26 18:03:34 -06:00
Alan Somers
52360ca32f copy_file_range: truncate write if it would exceed RLIMIT_FSIZE
PR:		266611
MFC after:	2 weeks
Reviewed by:	kib
Differential Revision: https://reviews.freebsd.org/D36706
2022-09-26 15:22:29 -06:00
Mitchell Horne
818cae0ff7 kasan: provide bus peek/poke definitions
Reviewed by:	andrew, markj
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D36700
2022-09-26 14:25:05 -05:00
Konstantin Belousov
1b4b75171e Add vn_rlimit_fsizex() and vn_rlimit_fsizex_res()
The vn_rlimit_fsizex() function:
- checks that the write does not exceed RLIMIT_FSIZE limit and fs
  maximum supported file size
- truncates write length if it exceeds the RLIMIT_FSIZE or max file size,
  but there are some bytes to write
- sends SIGXFSZ if RLIMIT_FSIZE would be exceed otherwise

POSIX mandates the truncated write in case when some bytes can be
written but whole write request fails the RLIMIT_FSIZE check.

The function is supposed to be used from VOP_WRITE()s. Due to
pecularity in the VFS generic write syscall layer, uio_resid must
correctly reflect the written amount (noted by markj). Provide the dual
vn_rlimit_fsizex_res() function to correct uio_resid after the clamp
done in vn_rlimit_fsizex() on VOP_WRITE() return.

PR:	164793
Reviewed by:	asomers, jah, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D36625
2022-09-24 19:41:33 +03:00
Konstantin Belousov
2ac083f60f Add vn_rlimit_trunc()
Reviewed by:	asomers, jah, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D36625
2022-09-24 19:41:18 +03:00
Konstantin Belousov
cc65a412ae filesystems: return error from vn_rlimit_fsize() instead of EFBIG
Reviewed by:	asomers, jah, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D36625
2022-09-24 19:41:14 +03:00
Mark Johnston
c2d27b0ec7 sched_4bsd: Fix a racy thread state modification
When a thread switching off-CPU is migrating to a remote CPU,
sched_switch() may trigger a rescheduling of the thread currently
running on that CPU.  When doing so, it must ensure that that thread is
locked before modifying thread state.  If the thread's lock is not the
scheduler lock, then the thread is in the process of switching off-CPU
and no extra effort is needed, and the initiator does not hold the
thread's lock and thus should not modify any thread state.

Reported and tested by:	Steve Kargl
MFC after:	1 week
2022-09-23 20:09:06 -04:00
John Baldwin
f49fd63a6a kmem_malloc/free: Use void * instead of vm_offset_t for kernel pointers.
Reviewed by:	kib, markj
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D36549
2022-09-22 15:09:19 -07:00
John Baldwin
7ae99f80b6 pmap_unmapdev/bios: Accept a pointer instead of a vm_offset_t.
This matches the return type of pmap_mapdev/bios.

Reviewed by:	kib, markj
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D36548
2022-09-22 15:08:52 -07:00
Zhenlei Huang
440217b0af debugnet: Fix parameter order in the calls to m_get()
Reviewed by:	markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D36643
2022-09-21 06:55:20 -04:00
Mateusz Guzik
a3ab1102e3 vfs: silence a bogus LOR in freevnode
Reported by:	imp
2022-09-19 02:14:50 +00:00
Mateusz Guzik
a75d1ddd74 vfs: introduce V_PCATCH to stop abusing PCATCH 2022-09-17 15:41:37 +00:00
Mateusz Guzik
9e4f35ac25 vfs: deperl msleep flag calculation in vn_start_*write 2022-09-17 17:17:20 +02:00
Mateusz Guzik
1c7084fe56 vfs: clean up parse_mount_dev_present 2022-09-17 12:42:46 +00:00
Mateusz Guzik
b77bdfdb67 vfs: fix non-INVARIANTS build after 5b5b7e2ca2
Reported by:	gj
2022-09-17 10:45:12 +00:00
Mateusz Guzik
aede6a9670 vfs: fixup parse_mount_dev_present after 5b5b7e2ca2
Reported by:	kib
2022-09-17 10:35:00 +00:00
Mateusz Guzik
5b5b7e2ca2 vfs: always retain path buffer after lookup
This removes some of the complexity needed to maintain HASBUF and
allows for removing injecting SAVENAME by filesystems.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D36542
2022-09-17 09:10:38 +00:00
Mateusz Guzik
3df3d88cc5 vfs: move cn_nameptr assignment out of namei_getpath 2022-09-17 09:08:34 +00:00
Mateusz Guzik
41a0a99f85 vfs: slightly reorganize error handling in chroot
This avoids duplicated NDFREE_NOTHING which will be of importance
later.
2022-09-17 09:08:34 +00:00
Warner Losh
7cd4984e67 SPDX: Not BSD-4-Clause
This is not BSD-4-Clause. It's closer to a modified BSD-2-Clause with 2
added clauses (and the first one has added clauses). Remove
SPDX-License-Idnetifier since this license doesn't match anything in
SPDX.
2022-09-16 21:49:16 -06:00
Konstantin Belousov
ff41239f58 Add AT_USRSTACK{BASE, LIM} AT vectors, and ELF_BSDF_VMNOOVERCOMMIT flag
Reviewed by:	brooks, imp (previous version)
Discussed with:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D36540
2022-09-16 23:23:26 +03:00
Mateusz Guzik
50176b0296 locks: whack a failed experiment in form of restrict_starvation
This was never enabled and only pollutes the code. The issue will
be addressed later in a different manner.

Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-09-16 17:29:37 +00:00
Gordon Bergling
4771011b8f kern_jail: Fix a typo in a source code comment
- s/paramter/parameter/

MFC after:	3 days
2022-09-15 10:25:19 +02:00
Warner Losh
9baa0817ec SPDX: Not BSD-4-Clause
This license has 4 clauses, and shares some text with the BSD-4-Clause
license. However, it omits the standard disclaimer and has 2 clauses all
its own. Remove this tag, since it was made in error and this doesn't
match the SPDX copy of the BSD-4-Clause license.

Sponsored by:		Netflix
2022-09-14 21:29:31 -06:00
Mateusz Guzik
d04c7f10d4 vfs: make delmntque return with the interlock held
saves on relocking dance -- the lock is taken immediately
afterwards anyway.
2022-09-14 23:30:19 +00:00
Mateusz Guzik
43fbd0e7a7 lockf: elide vnode interlock in the common case in lf_purgelocks
The interlock was already taken and released when dooming, thus by
API contract locking state cannot be legally installed.

At the same time the state is almost never there to begin with.
2022-09-14 23:04:22 +00:00
Mateusz Guzik
a755fb921e vfs: retire the V_MNTREF flag
Reviewed by:	kib, mckusick
Differential Revision:	https://reviews.freebsd.org/D36521
2022-09-14 18:16:36 +00:00
Mateusz Guzik
61a1d5dde2 vfs: stop using the V_MNTREF flag
Reviewed by:	kib, mckusick
Differential Revision:	https://reviews.freebsd.org/D36521
2022-09-14 18:16:23 +00:00
Mateusz Guzik
f7dc4a71da vfs: plug spurious error checks in namei
error is guaranteed 0 at that point
2022-09-13 23:18:30 +00:00
Mateusz Guzik
b4137c9ed1 vfs: make NDVALIDATE private to vfs_lookup.c
it is not used elsewhere.
2022-09-12 22:50:48 +00:00
Allan Jude
b20ec58669 vfs.typenumhash: fix sysctl description
a string continuation was missing a space, resulting in two works
being smushed together.

Sponsored by:	Klara, Inc.
2022-09-10 22:47:51 +00:00
Mateusz Guzik
1760a6950a Fixup build after recent getsock changes 2022-09-10 20:40:43 +00:00
Mateusz Guzik
3be2225fc8 Remove fflag argument from getsock_cap
Interested callers can obtain in other own easily enough
and there is no reason to branch on it.
2022-09-10 19:47:47 +00:00
Mateusz Guzik
3212ad15ab Add getsock
All but one consumers of getsock_cap only pass 4 arguments.
Take advantage of it.
2022-09-10 19:47:47 +00:00
Mateusz Guzik
a2ad70923f Add branch prediction hints to getsock_cap 2022-09-10 19:41:52 +00:00
Gleb Smirnoff
e80062a2d4 tcp: avoid call to soisconnected() on transition to ESTABLISHED
This call existed since pre-FreeBSD times, and it is hard to understand
why it was there in the first place.  After 6f3caa6d81 it definitely
became necessary always and commit message from f1ee30ccd6 confirms that.
Now that 6f3caa6d81 is effectively backed out by 07285bb4c2, the call
appears to be useful only for sockets that landed on the incomplete queue,
e.g. sockets that have accept_filter(9) enabled on them.

Provide a new TCP flag to mark connections that are known to be on the
incomplete queue, and call soisconnected() only for those connections.

Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D36488
2022-09-08 09:16:04 -07:00
Doug Moore
d0354fa7b6 rb_tree: reduce duplication in balancing code
Change RB_INSERT_COLOR and RB_REMOVE_COLOR so that the blocks of code
that are identical except for left and right being exchanged are made
only one block with a variable to indicate left- or right-handedness.

Rename RB macros so that those not intended for external use begin
with an underscore.

Add comments to the balancing code so that another might understand it.

Reviewed by:	alc, kib
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D36393
2022-09-07 23:46:19 -05:00
Mateusz Guzik
3e0b486886 vfs: flip a condition around in kern_statat
error tends to be 0.
2022-09-07 20:06:24 +00:00
Hans Petter Selasky
0e391a3197 ktls: Add missing NULL pointer check for TLS RX hardware offload.
The send tag pointer may be NULL when the ktls_reset_receive_tag()
function is invoked. Add check for this.

Reviewed by:	gallatin @
Sponsored by:	NVIDIA Networking
2022-09-06 13:49:23 +02:00
Mateusz Guzik
69413598d2 signal: use proc_iterate to save on work
Most notably poudriere performs kill -9 -1 in jails for each port
being built. This reduces the scan from hundrends of processes to
literally 1.

Reviewed by:	jamie, markj
Differential Revision:	https://reviews.freebsd.org/D34522
2022-09-05 11:54:47 +00:00
Mateusz Guzik
5ecb5444aa jail: add process linkage
It allows iteration over processes belonging to given jail instead of
having to walk the entire allproc list.

Note the iteration can miss processes which remains bug-compatible
with previous code.

Reviewed by:	jamie (previous version), markj (previous version)
Differential Revision:	https://reviews.freebsd.org/D34522
2022-09-05 11:54:47 +00:00
Gordon Bergling
d744e271eb kern: Remove a double word in a source code comment
- s/that that/that/

MFC after:	3 days
2022-09-04 17:32:10 +02:00
Gordon Bergling
49a033d8cf kern: Correct some typos in source code comments
- s/occured/occurred/
- s/the the/the/

MFC after:	3 days
2022-09-04 13:00:01 +02:00
Gordon Bergling
2b7d656f17 kern: Fix a typo in asource code comment
- s/overriden/overridden/

MFC after:	3 days
2022-09-03 15:26:55 +02:00
Gleb Smirnoff
24af7808fa protosw: repair protocol selection logic in socket(2)
Pointy hat to:	glebius
Fixes:		61f7427f02
2022-08-30 21:19:46 -07:00
Gleb Smirnoff
61f7427f02 protosw: cleanup protocols that existed merely to provide pr_input
Since 4.4BSD the protosw was used to implement socket types created
by socket(2) syscall and at the same to demultiplex incoming IPv4
datagrams (later copied to IPv6).  This story ended with 78b1fc05b2.

These entries (e.g. IPPROTO_ICMP) in inetsw that were added to catch
packets in ip_input(), they would also be returned by pffindproto()
if user says socket(AF_INET, SOCK_RAW, IPPROTO_ICMP).  Thus, for raw
sockets to work correctly, all the entries were pointing at raw_usrreq
differentiating only in the value of pr_protocol.

With 78b1fc05b2 all these entries are no longer needed, as ip_protox
is independent of protosw.  Any socket syscall requesting SOCK_RAW type
would end up with rip_protosw.  And this protosw has its pr_protocol
set to 0, allowing to mark socket with any protocol.

For IPv6 raw socket the change required two small fixes:
o Validate user provided protocol value
o Always use protocol number stored in inp in rip6_attach, instead
  of protosw value, which is now always 0.

Differential revision:	https://reviews.freebsd.org/D36380
2022-08-30 15:09:21 -07:00
Gleb Smirnoff
8624f4347e divert: declare PF_DIVERT domain and stop abusing PF_INET
The divert(4) is not a protocol of IPv4.  It is a socket to
intercept packets from ipfw(4) to userland and re-inject them
back.  It can divert and re-inject IPv4 and IPv6 packets today,
but potentially it is not limited to these two protocols.  The
IPPROTO_DIVERT does not belong to known IP protocols, it
doesn't even fit into u_char.  I guess, the implementation of
divert(4) was done the way it is done basically because it was
easier to do it this way, back when protocols for sockets were
intertwined with IP protocols and domains were statically
compiled in.

Moving divert(4) out of inetsw accomplished two important things:

1) IPDIVERT is getting much closer to be not dependent on INET.
   This will be finalized in following changes.
2) Now divert socket no longer aliases with raw IPv4 socket.
   Domain/proto selection code won't need a hack for SOCK_RAW and
   multiple entries in inetsw implementing different flavors of
   raw socket can merge into one without requirement of raw IPv4
   being the last member of dom_protosw.

Differential revision:	https://reviews.freebsd.org/D36379
2022-08-30 15:09:21 -07:00
Gleb Smirnoff
244e1aeaec domains: merge domain_init() into domain_add()
domain_init() called at SI_SUB_PROTO_DOMAIN/SI_ORDER_SECOND is always
called right after domain_add(), that had been called at SI_ORDER_FIRST.
Note that protocols aren't initialized yet at this point, since they are
usually scheduled to initialize at SI_ORDER_THIRD.

After this merge it becomes clear that DOMF_SUPPORTED / DOMF_INITED
can be garbage collected as they are set & checked in the same function.

For initialization of the domain system itself it is now clear that
domaininit() can be garbage collected and static initializer is enough.
2022-08-29 19:15:01 -07:00
Gleb Smirnoff
e18c5816ea domains: use queue(9) SLIST for linked list of domains 2022-08-29 19:15:01 -07:00
Gleb Smirnoff
d7574c7432 domains: init pr_domain in pr_init() 2022-08-29 19:15:01 -07:00
Gleb Smirnoff
c414347bc5 mbufs: isolate max_linkhdr and max_protohdr handling in the mbuf code
o Statically initialize max_linkhdr to default value without relying
  on domain(9) code doing that.
o Statically initialize max_protohdr to a sane value, without relying
  on TCP being always compiled in.
o Retire max_datalen. Set, but not used.
o Don't make the domain(9) system responsible in validating these
  values and updating max_hdr.  Instead provide KPI max_linkhdr_grow()
  and max_protohdr_grow().
o Call max_linkhdr_grow() from IEEE802.11 and max_protohdr_grow() from
  TCP.  Those are the only protocols today that may want to grow.

Reviewed by:		tuexen
Differential revision:	https://reviews.freebsd.org/D36376
2022-08-29 19:14:25 -07:00
Mark Johnston
32faf071bd devstat: Remove DTrace io probes lacking a BIO reference
The io:::start and end probes trace individual I/O requests.

Also remove the unimplemented wait-start and wait-done probes.

PR:		266098
MFC after:	1 week
2022-08-29 13:22:36 -04:00
Doug Moore
5d91386826 rb_tree: avoid extra reads in rebalancing
In RB_INSERT_COLOR and RB_REMOVE_COLOR, avoid reading a parent pointer
from memory, and then reading the left-color bit from memory, and then
reading the right-color bit from memory, since they're all in the same
field. The compiler can't infer that only the first read is really
necessary, so write the code in a way so that it doesn't have to.

Drop RB_RED_LEFT and RB_RED_RIGHT macros that reach into memory to get
those bits.  Drop RB_COLOR, the only thing left using RB_RED_LEFT and
RB_RED_RIGHT after the other changes, and go straight to DIAGNOSTIC
code in subr_stats to implement RB_COLOR for its single, dubious use
there.

Reviewed by:	alc
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D36353
2022-08-29 11:11:31 -05:00
John Baldwin
e3885a7893 soo_stat: Ensure error is always initialized.
In kernels without MAC, error is not set for sockets whose protocol
layer does not implement the pr_sense hook.

Reported by:	Jenkins (powerpc kernel builds)
Fixes:		7c04ca1fad sockets: for stat(2) on a socket don't report hiwat as block size
2022-08-26 11:17:55 -07:00
Gleb Smirnoff
837b7203f0 domains: use struct domain as argument 2022-08-26 10:35:35 -07:00
firk
768f6373eb Fix compat10 semaphore interface race
Wrong has-waiters and missing unconditional _count==0 check may cause
infinite waiting with already non-zero count.
1) properly clear _has_waiters flag when waiting failed to start
2) always check _count before start waiting

PR:	265997
Reviewed by:	kib
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D36272
2022-08-26 20:34:29 +03:00
Gleb Smirnoff
7c04ca1fad sockets: for stat(2) on a socket don't report hiwat as block size
The code appeared in d8392c6c39 with not good explanation.  It is
very unlikely any software in the world needs that.

Differential revision:	https://reviews.freebsd.org/D36283
2022-08-26 08:16:15 -07:00
Mateusz Guzik
49afea1059 proc: read the pid prior to unlocking in report_alive_proc1
In principle another thread could have reaped the process by that time.
2022-08-25 17:26:49 +00:00
Konstantin Belousov
fce3b1c327 fork_exit(): style comment
Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D36302
2022-08-24 22:12:53 +03:00
Brooks Davis
840327e5dd mbuf: Don't support PAGE_SIZE < 4K
The Vax supported such things, but FreeBSD does not.  This further
implies that MJUMPAGESIZE > MCLBYTES so assert this and remove code
handling them being equal.

Reviewed by:	kp, imp, jhb
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D36320
2022-08-24 18:34:07 +01:00
Mateusz Guzik
9488262679 rms: add rms_assert_rlock_ok
So that callers which opportunistically elide the lock can still
assert that they can take it.

Reviewed by:
Differential Revision:
2022-08-23 19:15:48 +00:00
Robert Wing
3454a7caa0 kqueue: retire knlist_init_rw_reader()
Last usage was removed in afa85850e7.

Reviewed by:	pauamma, melifaro, kib
Differential Revision:	https://reviews.freebsd.org/D36205
2022-08-20 21:17:39 -08:00
Konstantin Belousov
f829268bcc Remove TDF_DOING_SA
We cannot see a thread with the flag set in unsuspend, after we stopped
doing SINGLE_ALLPROC from user processes.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D36207
2022-08-20 20:34:30 +03:00
Konstantin Belousov
5e5675cb4b Remove struct proc p_singlethr member
It does not serve any purpose after we stopped doing
thread_single(SINGLE_ALLPROC) from stoppable user processes.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D36207
2022-08-20 20:34:30 +03:00
Konstantin Belousov
2842ec6d99 REAP_KILL_PROC: kill processes in the threaded taskqueue context
There is a problem still left after the fixes to REAP_KILL_PROC.  The
handling of the stopping signals by sig_suspend_threads() can occur
outside the stopping process context by tdsendsignal(), and it uses
mostly the same mechanism of aborting sleeps as suspension.  In other
words, it badly interacts with thread_single(SINGLE_ALLPROC).

But unlike single threading from the process context, we cannot wait by
sleep for other single threading requests to pass, because we own
spinlock(s).

Fix this by moving both the thread_single(p2, SINGLE_ALLPROC), and the
signalling, to the threaded taskqueue which cannot be single-threaded
itself.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D36207
2022-08-20 20:34:11 +03:00
Konstantin Belousov
5e9bba94bd fork_norfproc(): unlock p1 before retrying
Reported and reviewed by:	markj
Tested by:	pho
Syzkaller:	647212368c3f32c6f13f
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D36207
2022-08-20 20:33:18 +03:00
Konstantin Belousov
0a4f2ac3b7 kern_sig.c: style
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D36207
2022-08-20 20:33:18 +03:00
Konstantin Belousov
cdb58f9d04 ksiginfo_tryfree(): change return type to bool
The function result is already used as bool.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D36207
2022-08-20 20:33:18 +03:00
Konstantin Belousov
cc29f221aa ksiginfo_alloc(): change to directly take M_WAITOK/NOWAIT flags
Also style, and remove unneeded cast.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D36207
2022-08-20 20:33:17 +03:00
Konstantin Belousov
5c78797e42 reap_kill_proc_locked(): remove outdated part of the comment
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D36207
2022-08-20 20:33:17 +03:00
Konstantin Belousov
30b16a6bcf exit1(): update comment about thread_single()
We do not check single-threading conditions in trap, or when sleeping
uninterruptible.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D36207
2022-08-20 20:33:17 +03:00
Konstantin Belousov
f835be5822 sleepq_set_timeout_sbt(): correct comment to not talk about ticks
It is sbt now.  Also, explain what flags are.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D36207
2022-08-20 20:33:17 +03:00
Konstantin Belousov
da39a100db sleepq_check_ast_sc_locked(): update comment
The relock order is important not only for a signal delivery, but also
for the suspension requests.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D36207
2022-08-20 20:33:17 +03:00
Konstantin Belousov
bd76586bb7 fork_norfproc(): style
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
Differential revision:	https://reviews.freebsd.org/D36207
2022-08-20 20:33:17 +03:00
Mateusz Guzik
497240def8 Retire clone_drain_lock
It is only ever xlocked in drain_dev_clone_events and the only consumer of
that routine does not need it -- eventhandler code already makes sure the
relevant callback is no longer running.

Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D36268
2022-08-20 09:44:05 +00:00
Gleb Smirnoff
820bafd0bc unix/dgram: don't panic if socket buffer has negative space
That's a legitimate scenario, although unlikely.

Reported by:	https://syzkaller.appspot.com/bug?extid=6e8be1ec8d77578a3df4
2022-08-19 12:15:38 -07:00
Mateusz Guzik
545db925c3 pipe: fix EOF case for non-blocking fds
In particular unbreaks 'go build'.
2022-08-18 21:23:53 +00:00
Gleb Smirnoff
e7d02be19d protosw: refactor protosw and domain static declaration and load
o Assert that every protosw has pr_attach.  Now this structure is
  only for socket protocols declarations and nothing else.
o Merge struct pr_usrreqs into struct protosw.  This was suggested
  in 1996 by wollman@ (see 7b187005d1), and later reiterated
  in 2006 by rwatson@ (see 6fbb9cf860).
o Make struct domain hold a variable sized array of protosw pointers.
  For most protocols these pointers are initialized statically.
  Those domains that may have loadable protocols have spacers. IPv4
  and IPv6 have 8 spacers each (andre@ dff3237ee5).
o For inetsw and inet6sw leave a comment noting that many protosw
  entries very likely are dead code.
o Refactor pf_proto_[un]register() into protosw_[un]register().
o Isolate pr_*_notsupp() methods into uipc_domain.c

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36232
2022-08-17 11:50:32 -07:00
Gleb Smirnoff
f6dc5aa342 unix: use private enum as argument for unp_connect2()
instead of using historic PRU_ flags that are now not used by anything
rather than TCP debugging.
2022-08-17 11:50:31 -07:00
Gleb Smirnoff
81a34d374e protosw: retire pr_drain and use EVENTHANDLER(9) directly
The method was called for two different conditions: 1) the VM layer is
low on pages or 2) one of UMA zones of mbuf allocator exhausted.
This change 2) into a new event handler, but all affected network
subsystems modified to subscribe to both, so this change shall not
bring functional changes under different low memory situations.

There were three subsystems still using pr_drain: TCP, SCTP and frag6.
The latter had its protosw entry for the only reason to register its
pr_drain method.

Reviewed by:		tuexen, melifaro
Differential revision:	https://reviews.freebsd.org/D36164
2022-08-17 11:50:31 -07:00
Gleb Smirnoff
1922eb3e9c protosw: retire pr_slowtimo and pr_fasttimo
They were useful many years ago, when the callwheel was not efficient,
and the kernel tried to have as little callout entries scheduled as
possible.

Reviewed by:		tuexen, melifaro
Differential revision:	https://reviews.freebsd.org/D36163
2022-08-17 11:50:31 -07:00
Gleb Smirnoff
78b1fc05b2 protosw: separate pr_input and pr_ctlinput out of protosw
The protosw KPI historically has implemented two quite orthogonal
things: protocols that implement a certain kind of socket, and
protocols that are IPv4/IPv6 protocol.  These two things do not
make one-to-one correspondence. The pr_input and pr_ctlinput methods
were utilized only in IP protocols.  This strange duality required
IP protocols that doesn't have a socket to declare protosw, e.g.
carp(4).  On the other hand developers of socket protocols thought
that they need to define pr_input/pr_ctlinput always, which lead to
strange dead code, e.g. div_input() or sdp_ctlinput().

With this change pr_input and pr_ctlinput as part of protosw disappear
and IPv4/IPv6 get their private single level protocol switch table
ip_protox[] and ip6_protox[] respectively, pointing at array of
ipproto_input_t functions.  The pr_ctlinput that was used for
control input coming from the network (ICMP, ICMPv6) is now represented
by ip_ctlprotox[] and ip6_ctlprotox[].

ipproto_register() becomes the only official way to register in the
table.  Those protocols that were always static and unlikely anybody
is interested in making them loadable, are now registered by ip_init(),
ip6_init().  An IP protocol that considers itself unloadable shall
register itself within its own private SYSINIT().

Reviewed by:		tuexen, melifaro
Differential revision:	https://reviews.freebsd.org/D36157
2022-08-17 11:50:31 -07:00
Gleb Smirnoff
489482e276 ipsec: isolate knowledge about protocols that are last header
Retire PR_LASTHDR protosw flag.

Reviewed by:		ae
Differential revision:	https://reviews.freebsd.org/D36155
2022-08-17 08:24:28 -07:00
Mateusz Guzik
9ac6eda6c6 pipe: try to skip locking the pipe if a non-blocking fd is used
Reviewed by:	markj (previous version)
Differential Revision:	https://reviews.freebsd.org/D36082
2022-08-17 14:23:34 +00:00
Ed Maste
fbafa98a94 Disallow invalid PT_GNU_STACK
Stack must be at least readable and writable.

PR:		242570
Reviewed by:	kib, markj
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35867
2022-08-16 15:52:21 -04:00
Mateusz Guzik
84a0be4a23 vfs: plug a dead store in kern_linkat_vp
Reported by:	clang --analyze
2022-08-16 10:46:29 +00:00
Mateusz Guzik
eb9a1f9c68 vfs: plug dead store in vn_io_fault1
Reported by:	clang --analyze
2022-08-16 10:46:20 +00:00
Dimitry Andric
9762d48b7f Adjust function definition in kern_poll.c to avoid clang 15 warning
With clang 15, the following -Werror warning is produced:

    sys/kern/kern_poll.c:374:16: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    netisr_pollmore()
                   ^
                    void

This is because netisr_pollmore() is declared with a (void) argument
list, but defined with an empty argument list. Make the definition match
the declaration.

MFC after:	3 days
2022-08-14 21:27:34 +02:00
Alexander V. Chernikov
9b967bd65d domains: allow domains to be unloaded
Add domain_remove() SYSUNINT callback that removes the domain
 from the domain list if it has DOMF_UNLOADABLE flag set.
This change is required to support netlink ( D36002 ).

Reviewed by:	glebius
Differential Revision: https://reviews.freebsd.org/D36173
2022-08-14 09:22:33 +00:00
Gleb Smirnoff
f277746e13 protosw: change prototype for pr_control
For some reason protosw.h is used during world complation and userland
is not aware of caddr_t, a relic from the first version of C.  Broken
buildworld is good reason to get rid of yet another caddr_t in kernel.

Fixes:	886fc1e804
2022-08-12 12:08:18 -07:00
Gleb Smirnoff
948f31d7b0 netinet: do not broadcast PRC_REDIRECT_HOST on ICMP redirect
This is expensive and useless call.  It has been useless since Alexander
melifaro@ moved the forwarding table to nexthops with passive invalidation.
What happens now is that cached route in a inpcb would get invalidated
on next ip_output().

These were the last users of pfctlinput(), so garbage collect it.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36156
2022-08-12 08:31:29 -07:00
Gleb Smirnoff
8c77967ecc protosw: retire pr_output method
The only place to execute this method was raw_usend(). Only those
protocols that used raw socket were able to actually enter that method.
All pr_output assignments being deleted by this commit were a dead code
for many years.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D36126
2022-08-11 09:19:37 -07:00
Andrew Turner
8e9ca1379e Adjust function definition in subr_devmap.c to avoid clang 15 warning
With clang 15, the following -Werror warning is produced:

    sys/kern/subr_devmap.c:87:19: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    devmap_print_table()
                      ^
                       void

This is because devmap_print_table() and devmap_lastaddr() are declared
with a (void) argument list, but defined with an empty argument list.
Make the definition match the declaration.

Sponsored by:	The FreeBSD Foundation
2022-08-11 14:30:32 +01:00
Alexander V. Chernikov
102e6817f0 devd: move all devd notification logic to a separate file.
Currently, subr_bus.c shares logic for (a) maintaining all HW devices
 (e.g. discovery/attach/detach logic) and (b) generic devctl notification
 layer for devices/PMU/GEOM/interfaces/etc).
These two subsystems share really tiny interaction interface, composed of 3
 notification functions. With that in mind, move devctl layer to a
 separate file, establishing a clear notification interface between the
 sub.c bus layer and the provider (devctl).

The primary driver of this change is netlink implementation (D36002).
The idea is to propagate device-level events to netlink as well, so all
 netlink customers can subscribe to these changes.
The long-term goal is to deprecate devctl and to use netlink as the
 kernel<> userland transport provided netlink gets enough traction.

Reviewed by:	imp, markj
Differential Revision: https://reviews.freebsd.org/D36091
MFC after:	1 month
2022-08-10 18:56:01 +00:00
Gleb Smirnoff
07285bb4c2 tcp: utilize new solisten_clone() and solisten_enqueue()
This streamlines cloning of a socket from a listener.  Now we do not
drop the inpcb lock during creation of a new socket, do not do useless
state transitions, and put a fully initialized socket+inpcb+tcpcb into
the listen queue.

Before this change, first we would allocate the socket and inpcb+tcpcb via
tcp_usr_attach() as TCPS_CLOSED, link them into global list of pcbs, unlock
pcb and put this onto incomplete queue (see 6f3caa6d81).  Then, after
sonewconn() we would lock it again, transition into TCPS_SYN_RECEIVED,
insert into inpcb hash, finalize initialization of tcpcb.  And then, in
call into tcp_do_segment() and upon transition to TCPS_ESTABLISHED call
soisconnected().  This call would lock the listening socket once again
with a LOR protection sequence and then we would relocate the socket onto
the complete queue and only now it is ready for accept(2).

Reviewed by:		rrs, tuexen
Differential revision:	https://reviews.freebsd.org/D36064
2022-08-10 11:09:34 -07:00
Gleb Smirnoff
8f5a0a2e4f sockets: provide solisten_clone(), solisten_enqueue()
as alternative KPI to sonewconn().  The latter has three stages:
- check the listening socket queue limits
- allocate a new socket
- call into protocol attach method
- link the new socket into the listen queue of the listening socket

The attach method, originally designed for a creation of socket by the
socket(2) syscall has slightly different semantics than attach of a socket
cloned by listener.  Make it possible for protocols to call into the
first stage, then perform a different attach, and then call into the
final stage.  The first stage, that checks limits and clones a socket
is called solisten_clone(), and the function that enqueues the socket
is solisten_enqueue().

Reviewed by:		tuexen
Differential revision:	https://reviews.freebsd.org/D36063
2022-08-10 11:09:34 -07:00
Konstantin Belousov
00d17cf342 elf_note_prpsinfo: handle more failures from proc_getargv()
Resulting sbuf_len() from proc_getargv() might return 0 if user mangled
ps_strings enough. Also, sbuf_len() API contract is to return -1 if the
buffer overflowed. The later should not occur because get_ps_strings()
checks for catenated length, but check for this subtle detail explicitly
as well to be more resilent.

The end result is that p_comm is used in this situations.

Approved by:	so
Security:	FreeBSD-SA-22:09.elf
Reported by:	Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by:	delphij, markj
admbugs:	988
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35391
2022-08-09 15:44:45 -04:00
Konstantin Belousov
1b0a4974c5 thread_create(): call cpu_copy_thread() after td_pflags is zeroed
By calling the function too early we might still have the td_pflags
value cached from the previous struct thread use. cpu_copy_thread()
depends on correct value for TDP_KTHREAD at least on x86.

Reported, bisected, and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D36069
2022-08-08 19:44:17 +03:00
Gordon Bergling
fa1ac9693a vnode(9): Fix a typo in a source code comment
- s/paramater/parameter/

MFC after:	3 days
2022-08-07 16:08:43 +02:00
Ed Maste
f0687f3e0e Clarify code comments on ASLR default settings
Sponsored by:	The FreeBSD Foundation
2022-08-05 10:01:16 -04:00
Mark Johnston
d07675a935 file: Move code to share fdtol structs into kern_descrip.c
This ensures the filedesc-to-leader code is consistently encapsulated in
kern_descrip.c.

No functional change intended.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35988
2022-08-04 09:39:25 -04:00
Konstantin Belousov
c53fec7603 sig_suspend_threads(): remove 'sending' arg
The TDA_AST flag is set on td2 unconditionally (as it was TDF_ASTPENDING
before AST rework), so it is not used practically for some time.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D36033
2022-08-03 16:56:23 +03:00
Konstantin Belousov
f2fd7d8bfc ast_sig(): add missed TDAI()
Mask checked was completely wrong

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D36033
2022-08-03 16:56:23 +03:00
Mark Johnston
852695416c domain: Use designated constants for timeout periods
No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-08-02 20:31:29 -04:00
Konstantin Belousov
4a662c9064 ktrace: change AST handler to require AST flag set
When it was inline it made sense to depend on the existing nested check
in KTRUSERRET() rather than adding a new td_flags flag.  However, since
we now have a TDA_KTRACE flag anyway, we might as well check it and
avoid the call.

Suggested by:	jhb
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35888
2022-08-02 21:11:10 +03:00
Konstantin Belousov
c46771a7b7 kern/subr_trap.c: cleanup no longer needed headers
Also bump Foundation' copyright year

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35888
2022-08-02 21:11:10 +03:00
Konstantin Belousov
cc1ec77231 Adjust g_waitidle() visibility and definition
Explicitly pass the struct thread argument.
Move the function prototype from sys/systm.h to geom/geom.h, we do not
need almost each kernel source to see the prototype, it is now used
only by kern/vfs_mountroot.c outside geom/geom_event.c, where the
function is defined.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35888
2022-08-02 21:11:10 +03:00
Konstantin Belousov
4fced8642f sigfastblock_setpend() and fastblock_mask can be static now
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35888
2022-08-02 21:11:10 +03:00
Konstantin Belousov
c6d31b8306 AST: rework
Make most AST handlers dynamically registered.  This allows to have
subsystem-specific handler source located in the subsystem files,
instead of making subr_trap.c aware of it.  For instance, signal
delivery code on return to userspace is now moved to kern_sig.c.

Also, it allows to have some handlers designated as the cleanup (kclear)
type, which are called both at AST and on thread/process exit.  For
instance, ast(), exit1(), and NFS server no longer need to be aware
about UFS softdep processing.

The dynamic registration also allows third-party modules to register AST
handlers if needed.  There is one caveat with loadable modules: the
code does not make any effort to ensure that the module is not unloaded
before all threads processed through AST handler in it.  In fact, this
is already present behavior for hwpmc.ko and ufs.ko.  I do not think it
is worth the efforts and the runtime overhead to try to fix it.

Reviewed by:	markj
Tested by:	emaste (arm64), pho
Discussed with:	jhb
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35888
2022-08-02 21:11:09 +03:00
Alexander V. Chernikov
be1f485d7d sockets: add MSG_TRUNC flag handling for recvfrom()/recvmsg().
Implement Linux-variant of MSG_TRUNC input flag used in recv(), recvfrom() and recvmsg().
Posix defines MSG_TRUNC as an output flag, indicating packet/datagram truncation.
Linux extended it a while (~15+ years) ago to act as input flag,
resulting in returning the full packet size regarless of the input
buffer size.
It's a (relatively) popular pattern to do recvmsg( MSG_PEEK | MSG_TRUNC) to get the
packet size, allocate the buffer and issue another call to fetch the packet.
In particular, it's popular in userland netlink code, which is the primary driving factor of this change.

This commit implements the MSG_TRUNC support for SOCK_DGRAM sockets (udp, unix and all soreceive_generic() users).

PR:		kern/176322
Reviewed by:	pauamma(doc)
Differential Revision: https://reviews.freebsd.org/D35909
MFC after:	1 month
2022-07-30 18:21:51 +00:00
John Baldwin
ea8f128c7c pmap_mapdev: Consistently use vm_paddr_t for the first argument.
The devmap variants used vm_offset_t for some reason, and a few places
explicitly cast bus addresses to vm_offset_t.  (Probably those casts
along with similar casts for vm_size_t should just be removed and
instead permit the compiler to DTRT.)

Reviewed by:	markj
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D35961
2022-07-28 15:55:10 -07:00
Dimitry Andric
a387bd1b6a Adjust function definition in vfs_bio.c to avoid clang 15 warnings
With clang 15, the following -Werror warning is produced:

    sys/kern/vfs_bio.c:3430:11: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    buf_daemon()
              ^
               void

This is because buf_daemon() is declared with a (void) argument list,
but defined with an empty argument list. Make the definition match the
declaration.

MFC after:	3 days
2022-07-26 19:59:57 +02:00
Dimitry Andric
78cfed2de7 Adjust function definitions in sysv_msg.c to avoid clang 15 warnings
With clang 15, the following -Werror warnings are produced:

    sys/kern/sysv_msg.c:213:8: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    msginit()
           ^
            void
    sys/kern/sysv_msg.c:316:10: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    msgunload()
             ^
              void

This is because msginit() and msgunload() are declared with (void)
argument lists, but defined with empty argument lists. Make the
definitions match the declarations.

MFC after:	3 days
2022-07-26 19:59:57 +02:00
Dimitry Andric
b54e962aca Adjust function definition in subr_bus.c to avoid clang 15 warnings
With clang 15, the following -Werror warning is produced:

    sys/kern/subr_bus.c:871:16: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    bus_topo_assert()
                   ^
                    void

This is because bus_topo_assert() is declared with a (void) argument
list, but defined with an empty argument list. Make the definition match
the declaration.

MFC after:	3 days
2022-07-26 19:59:57 +02:00
Dimitry Andric
3c8f0790dd Adjust function definition in subr_autoconf.c to avoid clang 15 warnings
With clang 15, the following -Werror warning is produced:

    sys/kern/subr_autoconf.c:119:34: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    run_interrupt_driven_config_hooks()
                                     ^
                                      void

This is because run_interrupt_driven_config_hooks() is declared with a
(void) argument list, but defined with an empty argument list. Make the
definition match the declaration.

MFC after:	3 days
2022-07-26 19:59:57 +02:00
Dimitry Andric
f2eb09b089 Adjust function definitions in kern_resource.c to avoid clang 15 warnings
With clang 15, the following -Werror warnings are produced:

    sys/kern/kern_resource.c:1212:10: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    lim_alloc()
             ^
              void
    sys/kern/kern_resource.c:1365:11: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    uihashinit()
              ^
               void

This is because lim_alloc() and uihashinit() are declared with (void)
argument lists, but defined with empty argument lists. Make the
definitions match the declarations.

MFC after:	3 days
2022-07-26 19:59:57 +02:00
Dimitry Andric
db8ea61ae2 Adjust function definitions in kern_dtrace.c to avoid clang 15 warnings
With clang 15, the following -Werror warnings are produced:

    sys/kern/kern_dtrace.c:64:18: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    kdtrace_proc_size()
                     ^
                      void
    sys/kern/kern_dtrace.c:87:20: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    kdtrace_thread_size()
                       ^
                        void

This is because kdtrace_proc_size() and kdtrace_thread_size() are
declared with (void) argument lists, but defined with empty argument
lists. Make the definitions match the declarations.

MFC after:	3 days
2022-07-26 19:59:57 +02:00
Dimitry Andric
9806e82a23 Adjust function definitions in kern_cons.c to avoid clang 15 warnings
With clang 15, the following -Werror warnings are produced:

    sys/kern/kern_cons.c:201:14: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    cninit_finish()
                 ^
                  void
    sys/kern/kern_cons.c:376:7: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    cngrab()
          ^
           void
    sys/kern/kern_cons.c:389:9: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    cnungrab()
            ^
             void
    sys/kern/kern_cons.c:402:9: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
    cnresume()
            ^
             void

This is because cninit_finish(), cngrab(), cnungrab(), and cnresume()
are declared with (void) argument lists, but defined with empty argument
lists. Make the definitions match the declarations.

MFC after:	3 days
2022-07-26 19:59:56 +02:00
Ka Ho Ng
8c9aa94b42 Convert runtime param checks to KASSERTs for fo_fspacectl
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D35880
2022-07-23 15:16:23 -04:00
Colin Percival
84ec7df0d7 Add kern.reboot_wait_time sysctl
Historic FreeBSD behaviour (dating back to 1994-04-02) when rebooting
is to print "Rebooting..." and then
	/* wait 1 sec for printf's to complete and be read */

Prior to April 1994, there was a 100 ms delay (added 1993-11-12).

Since (a) most users will already be aware that the system is rebooting
and do not need to take time to read an additional message to that
effect, and (b) most FreeBSD systems don't have anyone actively looking
at the console anyway, this delay no longer serves much purpose.

This commit adds a kern.reboot_wait_time sysctl which defaults to 0;
historic behaviour can be regained by setting it to 1.

Reviewed by:	imp
Relnotes:	FreeBSD now reboots faster; to restore the traditional
		wait after printing "Rebooting..." to the console, set
		kern.reboot_wait_time=1 (or more).
Sponsored by:	https://www.patreon.com/cperciva
Differential Revision:	https://reviews.freebsd.org/D35796
2022-07-18 17:23:25 -07:00
Mitchell Horne
2449b9e5fe mac: kdb/ddb framework hooks
Add three simple hooks to the debugger allowing for a loaded MAC policy
to intervene if desired:
 1. Before invoking the kdb backend
 2. Before ddb command registration
 3. Before ddb command execution

We extend struct db_command with a private pointer and two flag bits
reserved for policy use.

Reviewed by:	markj
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D35370
2022-07-18 22:06:13 +00:00
Mitchell Horne
c84c5e00ac ddb: annotate some commands with DB_CMD_MEMSAFE
This is not completely exhaustive, but covers a large majority of
commands in the tree.

Reviewed by:	markj
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D35583
2022-07-18 22:06:09 +00:00
Mark Johnston
bd980ca847 sched_ule: Ensure we hold the thread lock when modifying td_flags
The load balancer may force a running thread to reschedule and pick a
new CPU.  To do this it sets some flags in the thread running on a
loaded CPU.  But the code assumed that a running thread's lock is the
same as that of the corresponding runqueue, and there are small windows
where this is not true.  In this case, we can end up with non-atomic
modifications to td_flags.

Since this load balancing is best-effort, simply give up if the thread's
lock doesn't match; in this case the thread is about to enter the
scheduler anyway.

Reviewed by:	kib
Reported by:	glebius
Fixes:		e745d729be ("sched_ule(4): Improve long-term load balancer.")
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35821
2022-07-18 15:52:27 -04:00
Kornel Dulęba
939f0b6323 Implement shared page address randomization
It used to be mapped at the top of the UVA.
If the randomization is enabled any address above .data section will be
randomly chosen and a guard page will be inserted in the shared page
default location.
The shared page is now mapped in exec_map_stack, instead of
exec_new_vmspace. The latter function is called before image activator
has a chance to parse ASLR related flags.
The KERN_PROC_VM_LAYOUT sysctl was extended to provide shared page
address.
The feature is enabled by default for 64 bit applications on all
architectures.
It can be toggled kern.elf64.aslr.shared_page sysctl.

Approved by:	mw(mentor)
Sponsored by:	Stormshield
Obtained from:	Semihalf
Reviewed by:	kib
Differential Revision: https://reviews.freebsd.org/D35349
2022-07-18 16:27:37 +02:00
Kornel Dulęba
361971fbca Rework how shared page related data is stored
Store the shared page address in struct vmspace.
Also instead of storing absolute addresses of various shared page
segments save their offsets with respect to the shared page address.
This will be more useful when the shared page address is randomized.

Approved by:	mw(mentor)
Sponsored by:	Stormshield
Obtained from:	Semihalf
Reviewed by:	kib
Differential Revision: https://reviews.freebsd.org/D35393
2022-07-18 16:27:32 +02:00
Kornel Dulęba
f6ac79fb12 Introduce the PROC_SIGCODE() macro
Use a getter macro instead of fetching the sigcode address directly
from a sysent of a given process. It assumes that the sigcode is stored
in the shared page, which is true in all cases, except for a.out
binaries. This will be later useful when the shared page address
randomization is introduced.
No functional change intended.

Approved by:	mw(mentor)
Sponsored by:	Stormshield
Obtained from:	Semihalf
Reviewed by:	kib
Differential Revision: https://reviews.freebsd.org/D35392
2022-07-18 16:27:26 +02:00
Mark Johnston
46eab86035 callout: Simplify the inner loop in callout_process() a bit
- Use LIST_FOREACH_SAFE.
- Simplify control flow.

No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-07-17 13:58:19 -04:00
Mark Johnston
aac7c7ac54 callout: Remove a redundant parameter to callout_cc_add()
The passed cpuid is always equal to the one stored in the callout
structure.  No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-07-17 13:58:19 -04:00
Mateusz Guzik
6eeba7dbd6 ule: unbreak UP builds
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-07-16 12:45:09 +00:00
Dmitry Chagin
fc90f3a281 ktrace: Increase precision of timestamps.
Replace struct timeval in header with struct timespec.
To differentiate header formats, add a new KTR_VERSIONED flag
set in the header type field similar to the existing KTRDROP flag.

To make it easier to extend ktrace headers in the future,
extend the existing header with a version field (version 0 is
reserved for older records without KTR_VERSIONED) as well as
new fields holding the thread ID and CPU ID.

Reviewed by:		jhb, pauamma
Differential Revision:	https://reviews.freebsd.org/D35774
MFC after:		2 weeks
2022-07-16 12:46:12 +03:00
John Baldwin
2cf7870864 Collapse interrupt thread priorities.
Allow high priority hardware interrupts to run at PI_REALTIME via
INTR_TYPE_CLK, but collapse all other hardware interrupt threads to
the next priority level (PI_INTR).  Collapse all SWI priorities to
the same priority level (PI_SOFT) just below PI_INTR.

Reviewed by:	kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35646
2022-07-14 13:14:33 -07:00
John Baldwin
40efe74352 4bsd: Simplistic time-sharing for interrupt threads.
If an interrupt thread runs for a full quantum without yielding the
CPU, demote its priority and schedule a preemption to give other
ithreads a turn.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35645
2022-07-14 13:14:17 -07:00
John Baldwin
954cffe95d ule: Simplistic time-sharing for interrupt threads.
If an interrupt thread runs for a full quantum without yielding the
CPU, demote its priority and schedule a preemption to give other
ithreads a turn.

Reviewed by:	kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35644
2022-07-14 13:13:57 -07:00
John Baldwin
ed998d1c24 ithreads: Support priority adjustment by schedulers.
Use sched_wakeup instead of sched_add when marking an ithread
runnable.  This allows schedulers to reset their internal time slice
tracking state and restore the base ithread priority when an ithread
resumes from idle.

Reviewed by:	markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35643
2022-07-14 13:13:35 -07:00
John Baldwin
fea89a2804 Add sched_ithread_prio to set the base priority of an interrupt thread.
Use it instead of sched_prio when setting the priority of an interrupt
thread.

Reviewed by:	kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35642
2022-07-14 13:13:10 -07:00
Mark Johnston
6cbc4ceb7a sched_ule: Use the correct atomic_load variant for tdq_lowpri
Reported by:	tuexen
Fixes:	11484ad8a2 ("sched_ule: Use explicit atomic accesses for tdq fields")
2022-07-14 15:34:02 -04:00
Mark Johnston
11484ad8a2 sched_ule: Use explicit atomic accesses for tdq fields
Different fields in the tdq have different synchronization protocols.
Some are constant, some are accessed only while holding the tdq lock,
some are modified with the lock held but accessed without the lock, some
are accessed only on the tdq's CPU, and some are not synchronized by the
lock at all.

Convert ULE to stop using volatile and instead use atomic_load_* and
atomic_store_* to provide the desired semantics for lockless accesses.
This makes the intent of the code more explicit, gives more freedom to
the compiler when accesses do not need to be qualified, and lets KCSAN
intercept unlocked accesses.

Thus:
- Introduce macros to provide unlocked accessors for certain fields.
- Use atomic_load/store for all accesses of tdq_cpu_idle, which is not
  synchronized by the mutex.
- Use atomic_load/store for accesses of the switch count, which is
  updated by sched_clock().
- Add some comments to fields of struct tdq describing how accesses are
  synchronized.

No functional change intended.

Reviewed by:	mav, kib
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35737
2022-07-14 10:45:33 -04:00
Mark Johnston
0927ff7814 sched_ule: Enable preemption of curthread in the load balancer
The load balancer executes from statclock and periodically tries to move
threads among CPUs in order to balance load.  It may move a thread to
the current CPU (the loader balancer always runs on CPU 0).  When it
does so, it may need to schedule preemption of the interrupted thread.
Use sched_setpreempt() to do so, same as sched_add().

PR:		264867
Reviewed by:	mav, kib, jhb
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35744
2022-07-14 10:27:58 -04:00
Mark Johnston
6d3f74a14a sched_ule: Fix racy loads of pc_curthread
Thread switching used to be atomic with respect to the current CPU's
tdq lock.  Since commit 686bcb5c14 that is no longer the case.  Now
sched_switch() does this:

1.  lock tdq (might already be locked)
2.  maybe put the current thread in the tdq, choose a new thread to run
2a. update tdq_lowpri
3.  unlock tdq
4.  switch CPU context, update curthread

Some code paths in ULE will load pc_curthread from a remote CPU with
that CPU's tdq lock held, usually to inspect its priority.  But, as of
the aforementioned commit this is racy.

The problem I noticed is in tdq_notify(), which optionally sends an IPI
to a remote CPU when a new thread is added to its runqueue.  If the new
thread's priority is higher (lower) than the currently running thread's
priority, then we deliver an IPI.  But inspecting
pc_curthread->td_priority doesn't work, since pc_curthread might be
between steps 3 and 4 above.  If pc_curthread's priority is higher than
that of the newly added thread, but pc_curthread is switching to a
lower-priority thread, then tdq_notify() might fail to deliever an IPI,
leaving a high priority thread stuck on the runqueue for longer than it
should.  This can cause multi-millisecond stalls in
interactive/ithread/realtime threads.

Fix this problem by modifying tdq_add() and tdq_move() to return the
value of tdq_lowpri before the addition of the new thread.  This ensures
that tdq_notify() has the correct priority value to compare against.

The other two uses of pc_curthread are susceptible to the same race.  To
fix the one in sched_rem()->tdq_setlowpri() we need to have an exact
value for curthread.  Thus, introduce a new tdq_curthread field to the
tdq which gets updated any time a new thread is selected to run on the
CPU.  Because this field is synchronized by the thread lock, its
priority reflects the correct lowpri value for the tdq.

PR:		264867
Fixes:		686bcb5c14 ("schedlock 4/4")
Reviewed by:	mav, kib, jhb
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35736
2022-07-14 10:27:51 -04:00
Mark Johnston
ef221ff645 time: Make realitexpire() local to kern_time.c
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-07-13 09:57:28 -04:00
Mark Johnston
38e1d32dab callout: Simplify cpuid validation in callout_reset_sbt_on()
- Remove a flag variable.
- Convert a runtime check of the passed cpuid to a KASSERT.
- Remove the cc_inited flag.  An attempt to schedule a callout before
  SI_SUB_CPU will crash anyway since the per-CPU mutexes won't have been
  initialized, and that flag was only checked in the case where a cpuid
  was explicitly specified by the caller.

No functional change intended.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2022-07-13 09:47:33 -04:00
Mark Johnston
ece453d5fa eventtimer: Simplify KTR traces
Stop including the current CPU in all event messages, since it's already
saved in KTR log entries and thus is redundant.  All eventtimer traces
occur in a context where CPU migration is not possible.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-07-11 15:58:43 -04:00
Mark Johnston
a889a65ba3 eventtimer: Fix several races in the timer reload code
In handleevents(), lock the timer state before fetching the time for the
next event.  A concurrent callout_cc_add() call might be changing the
next event time, and the race can cause handleevents() to program an
out-of-date time, causing the callout to run later (by an unbounded
period, up to the idle hardclock period of 1s) than requested.

In cpu_idleclock(), call getnextcpuevent() with the timer state mutex
held, for similar reasons.  In particular, cpu_idleclock() runs with
interrupts enabled, so an untimely timer interrupt can result in a stale
next event time being programmed.  Further, an interrupt can cause
cpu_idleclock() to use a stale value for "now".

In cpu_activeclock(), disable interrupts before loading "now", so as to
avoid going backwards in time when calling handleevents().  It's ok to
leave interrupts enabled when checking "state->idle", since the race at
worst will cause handleevents() to be called unnecessarily.  But use an
atomic load to indicate that the test is racy.

PR:		264867
Reviewed by:	mav, jhb, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35735
2022-07-11 15:58:43 -04:00
Mark Johnston
ebb3cb6195 eventtimer: Pass a pcpu state pointer to getnext(cpu)event()
Callers have already loaded the pointer, so these functions don't need
to fetch it again.

No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-07-11 15:58:43 -04:00
Mark Johnston
ba71333f60 sched_ule: Fix a typo in a comment
PR:		226107
MFC after:	1 week
2022-07-11 15:58:43 -04:00
Mark Johnston
ef80894c9d sched_ule: Purge an obsolete comment
The referenced bitmask was removed in commit 62fa74d95a.

MFC after:	 1 week
Sponsored by:	The FreeBSD Foundation
2022-07-11 15:58:43 -04:00
Mark Johnston
35dd6d6cb5 sched_ule: Eliminate a superfluous local variable in tdq_move()
No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-07-11 15:58:43 -04:00
Gleb Smirnoff
c261510ef5 sockets: fix setsockopt(SO_RCVTIMEO) on a listening socket
MFC after:	3 weeks
2022-07-08 11:33:24 -07:00
Mitchell Horne
258958b3c7 ddb: use _FLAGS command macros where appropriate
Some command definitions were forced to use DB_FUNC in order to specify
their required flags, CS_OWN or CS_MORE. Use the new macros to simplify
these.

Reviewed by:	markj, jhb
MFC after:	3 days
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D35582
2022-07-05 11:56:55 -03:00
Gleb Smirnoff
d8596171c5 sockets: use only soref()/sorele() as socket reference count
o Retire SS_FDREF as it is basically a debug flag on top of already
  existing soref()/sorele().
o Convert SS_PROTOREF into soref()/sorele().
o Change reference model for the listen queues, see below.
o Make sofree() private.  The correct KPI to use is only sorele().
o Make soabort() respect the model and sorele() instead of sofree().

Note on listening queues.  Until now the sockets on a queue had zero
reference count.  And the reference were given only upon accept(2).  The
assumption was that there is no way to see the queued socket from anywhere
except its head.  This is not true, since queued sockets already have pcbs,
which are linked at least into the global pcb lists.  With this change we
put the reference right in the sonewconn() and on accept(2) path we just
hand the existing reference to the file descriptor.

Differential revision:	https://reviews.freebsd.org/D35679
2022-07-04 12:40:51 -07:00
Gleb Smirnoff
bc7605647c sockets: use positive flag for file descriptor socket reference
Rename SS_NOFDREF to SS_FDREF and flip all bitwise operations.
Mark sockets created by socreate() with SS_FDREF.

This change is mostly illustrative. With it we see that SS_FDREF
is a debugging flag, since:
* socreate() takes a reference with soref().
* on accept path solisten_dequeue() takes a reference
  with soref() and then soaccept() sets SS_FDREF.
* soclose() checks SS_FDREF, removes it and does sorele().

Reviewed by:		tuexen
Differential revision:	https://reviews.freebsd.org/D35678
2022-07-04 12:40:51 -07:00
Warner Losh
b69996d1d5 tty: Default to printing kernel stack traceback only on INVARIANT kernels
Change the default from printing a breif kernel thread stack informaton
back to omitting it for non-invariant kernels in response to
SIGINFO/^T. Full and brief stack support can be selected with the
kern.tty_info_kstacks sysctl.

MFC After:		2 weeks
Sponsored by:		Netflix
Reviewed by:		grembo, jhb
Differential Revision:	https://reviews.freebsd.org/D35576
2022-07-02 08:02:12 -06:00
John Baldwin
0bd73da206 busdma_bounce: Use PRI_ITHD scheduling class for worker thread.
Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35641
2022-06-30 10:06:04 -07:00
John Baldwin
0288d4277f Add register sets for NT_THRMISC and NT_PTLWPINFO.
For the kernel this is mostly a non-functional change.  However, this
will be useful for simplifying gcore(1).

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D35666
2022-06-30 10:04:56 -07:00
Gleb Smirnoff
66c8e3fccf socket: fix listen(2) on an already listening socket
Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35669
Fixes:			141fe2dcee
2022-06-30 07:50:29 -07:00
Konstantin Belousov
ad175a107b vfs_mount.c: convert explicit panics and KASSERTs to MPASSERT/MPPASS
Reviewed by:	imp, mjg
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35652
2022-06-29 21:31:47 +03:00
Konstantin Belousov
1e54362824 vfs_op_exit(): assert that mnt_vfs_ops stays non-zero for unmount or suspend
Reviewed by:	mjg
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35639
2022-06-29 21:31:47 +03:00
Jamie Gritton
7060da62ff jail: Remove a prison's shared memory when it dies
Add shm_remove_prison(), that removes all POSIX shared memory segments
belonging to a prison.  Call it from prison_cleanup() so a prison
won't be stuck in a dying state due to the resources still held.

PR:		257555
Reported by:	grembo
2022-06-29 10:47:39 -07:00
Jamie Gritton
a9f7455c38 jail: add prison_cleanup() to release resources held by a dying jail
Currently, when a jail starts dying, either by losing its last user
reference or by being explicitly killed,
osd_jail_call(...PR_METHOD_REMOVE...) is called.  Encapsulate this
into a function prison_cleanup() that can then do other cleanup.
2022-06-29 10:33:05 -07:00
Gleb Smirnoff
48a55bbfe9 unix: change error code for recvmsg() failed due to RLIMIT_NOFILE
Instead of returning EMSGSIZE pass the error code from fdallocn() directly
to userland.  That would be EMFILE, which makes much more sense.  This
error code is not listed in the specification[1], but the specification
doesn't cover such edge case at all.  Meanwhile the specification lists
EMSGSIZE as the error code for invalid value of msg_iovlen, and FreeBSD
follows that, see sys_recmsg().  Differentiating these two cases will make
a developer/admin life much easier when debugging.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/recvmsg.html

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35640
2022-06-29 09:42:58 -07:00
Kristof Provost
ab91feabcc ovpn: Introduce OpenVPN DCO support
OpenVPN Data Channel Offload (DCO) moves OpenVPN data plane processing
(i.e. tunneling and cryptography) into the kernel, rather than using tap
devices.
This avoids significant copying and context switching overhead between
kernel and user space and improves OpenVPN throughput.

In my test setup throughput improved from around 660Mbit/s to around
2Gbit/s.

Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D34340
2022-06-28 11:33:10 +02:00
Mateusz Guzik
7388fb714a cache: drop the vfs.cache_rename_add tunable
The functionality has been in use since Jan 2021 -- long enough(tm).
2022-06-27 09:56:20 +02:00
Gleb Smirnoff
458f475df8 unix/dgram: smart socket buffers for one-to-many sockets
A one-to-many unix/dgram socket is a socket that has been bound
with bind(2) and can get multiple connections.  A typical example
is /var/run/log bound by syslogd(8) and receiving multiple
connections from libc syslog(3) API.  Until now all of these
connections shared the same receive socket buffer of the bound
socket.  This made the socket vulnerable to overflow attack.
See 240d5a9b1c for a historical attempt to workaround the problem.

This commit creates a per-connection socket buffer for every single
connected socket and eliminates the problem.  The new behavior will
optimize seldom writers over frequent writers.  See added test case
scenarios and code comments for more detailed description of the
new behavior.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35303
2022-06-24 09:09:11 -07:00
Gleb Smirnoff
1093f16487 unix/dgram: reduce mbuf chain traversals in send(2) and recv(2)
o Use m_pkthdr.memlen from m_uiotombuf()
o Modify unp_internalize() to keep track of allocated space and memory
  as well as pointer to the last buffer.
o Modify unp_addsockcred() to keep track of allocated space and memory
  as well as pointer to the last buffer.
o Record the datagram len/memlen/ctllen in the first (from) mbuf of the
  chain in uipc_sosend_dgram() and reuse it in uipc_soreceive_dgram().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35302
2022-06-24 09:09:11 -07:00
Gleb Smirnoff
9b841b0e23 m_uiotombuf: write total memory length of the allocated chain in pkthdr
Data allocated by m_uiotombuf() usually goes into a socket buffer.
We are interested in the length of useful data to be added to sb_acc,
as well as total memory used by mbufs.  The later would be added to
sb_mbcnt.  Calculating this value at allocation time allows to save
on extra traversal of the mbuf chain.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35301
2022-06-24 09:09:11 -07:00
Gleb Smirnoff
a7444f807e unix/dgram: use minimal possible socket buffer for PF_UNIX/SOCK_DGRAM
This change fully splits away PF_UNIX/SOCK_DGRAM from other socket
buffer implementations, without any behavior changes.

Generic socket implementation is reduced down to one STAILQ and very
little code.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35300
2022-06-24 09:09:11 -07:00
Gleb Smirnoff
a4fc41423f sockets: enable protocol specific socket buffers
Split struct sockbuf into common shared fields and protocol specific
union, where protocols are free to implement whatever buffer they
want.  Such protocols should mark themselves with PR_SOCKBUF and are
expected to initialize their buffers in their pr_attach and tear
them down in pr_detach.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35299
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
315167c0de unix: provide an option to return locked from unp_connectat()
Use this new version in unix/dgram socket when sending to a target
address.  This removes extra lock release/acquisition and possible
counter-intuitive ENOTCONN.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35298
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
5dc8dd5f3a unix/dgram: inline sbappendaddr_locked() into uipc_sosend_dgram()
This allows to remove one M_NOWAIT allocation and also makes it
more clear what's going on.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35297
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
e3fbbf965e unix/dgram: add a specific receive method - uipc_soreceive_dgram
With this second step PF_UNIX/SOCK_DGRAM has protocol specific
implementation.  This gives some possibility performance
optimizations.  However, it still operates on the same struct
socket as all other sockets do.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35296
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
f384a97c83 unix/dgram: cleanup uipc_send of PF_UNIX/SOCK_DGRAM, step 2
Just remove one level of indentation as the case clause always match.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35295
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
7e5b6b391e unix/dgram: cleanup uipc_send of PF_UNIX/SOCK_DGRAM, step 1
Remove the dead code.  The new uipc_sosend_dgram() handles send()
on PF_UNIX/SOCK_DGRAM in full.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35294
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
3464958246 unix/dgram: add a specific send method - uipc_sosend_dgram()
This is first step towards splitting classic BSD socket
implementation into separate classes.  The first to be
split is PF_UNIX/SOCK_DGRAM as it has most differencies
to SOCK_STREAM sockets and to PF_INET sockets.

Historically a protocol shall provide two methods for sendmsg(2):
pru_sosend and pru_send.  The former is a generic send method,
e.g. sosend_generic() which would internally call the latter,
uipc_send() in our case.  There is one important exception, though,
the sendfile(2) code will call pru_send directly.  But sendfile
doesn't work on SOCK_DGRAM, so we can do the trick.  We will create
socket class specific uipc_sosend_dgram() which will carry only
important bits from sosend_generic() and uipc_send().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35293
2022-06-24 09:09:10 -07:00
Mitchell Horne
29afffb942 subr_bus: restore bus_null_rescan()
Partially revert the previous change; we need to keep this method as a
specific override for pci_driver subclasses which should not use
pci_rescan_method() -- cardbus and ofw_pcibus. However, change the return
value to ENODEV for the same reasoning given in the original commit, and
use this as the default rescan method in bus_if.m.

Reported by:	jhb
Fixes:		36a8572ee8 ("bus_if: provide a default null rescan method")
MFC with:	36a8572ee8
2022-06-23 16:07:00 -03:00
Mitchell Horne
8701571df9 set_cputicker: use a bool
The third argument to this function indicates whether the supplied
ticker is fixed or variable, i.e. requiring calibration. Give this
argument a type and name that better conveys this purpose.

Reviewed by:	kib, markj
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35459
2022-06-23 15:15:11 -03:00
Mitchell Horne
36a8572ee8 bus_if: provide a default null rescan method
There is an existing helper method in subr_bus.c, but almost no drivers
know to use it. It also returns the same error as an empty method,
making it not very useful. Move this to bus_if.m and return a more
sensible error code.

This gives a slightly more meaningful error message when attempting
'devctl rescan' on buses and devices alike:
  "Device not configured" --> "Operation not supported by device"

Reviewed by:	imp
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35501
2022-06-23 15:15:10 -03:00
Chuck Silvers
5bd21cbbd1 vfs: fix vfs_bio_clrbuf() for PAGE_SIZE > block size
Calculate the desired page valid mask using math that will not
overflow the types used.

Sponsored by:	Netflix

Reviewed by:	mckusick, kib, markj
Differential Revision:	https://reviews.freebsd.org/D34837
2022-06-21 17:58:52 -07:00
Mark Johnston
9553bc89db aio: Improve UMA usage
- Remove the AIO proc zone.  This zone gets one allocation per AIO
  daemon process, which isn't enough to warrant a dedicated zone.  Plus,
  unlike other AIO structures, aiops are small (32 bytes with LP64), so
  UMA doesn't provide better space efficiency than malloc(9).  Change
  one of the malloc types in vfs_aio.c to make it more general.

- Don't set the NOFREE flag on the other AIO zones.  This flag means
  that memory allocated to the AIO subsystem is never freed back to the
  VM, so it's always preferable to avoid using it when possible.  NOFREE
  was set without explanation when AIO was converted to use UMA 20 years
  ago, but it does not appear to be required; all of the structures
  allocated from UMA (per-process kaioinfo, kaiocb, and aioliojob) keep
  track of references and get freed only when none exist.  Plus, these
  structures will contain dangling pointer after they're freed (e.g.,
  the "cred", "fd_file" and "uiop" fields of struct kaiocb), so
  use-after-frees are dangerous even when the structures themselves are
  type-stable.

Reviewed by:	asomers
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35493
2022-06-20 12:48:13 -04:00
Damjan Jovanovic
8c309d48aa struct kinfo_file changes needed for lsof to work using only usermode APIs`
Add kf_pipe_buffer_[in/out/size] fields to kf_pipe, and populate them.

Add a kf_kqueue struct to the kf_un union, to allow querying kqueue state,
and populate it.

Populate the kf_sock_rcv_sb_state and kf_sock_snd_sb_state fields in
kf_sock for INET/INET6 sockets, and populate all other fields for all
transport layer protocols, not just TCP.

Bump __FreeBSD_version.

Differential revision:	https://reviews.freebsd.org/D34184
Reviewed by:	jhb, kib, se
MFC after:	1 week
2022-06-18 12:34:25 +03:00
Damjan Jovanovic
8ae7694913 KERN_LOCKF: report kl_file_fsid consistently with stat(2)
PR:	264723
Reviewed by:	kib
Discussed with:	markj
MFC after:	1 week
2022-06-18 12:34:17 +03:00
Mark Johnston
f6379f7fde socket: Fix a race between kevent(2) and listen(2)
When locking the knote list for a socket, we check whether the socket is
a listening socket in order to select the appropriate mutex; a listening
socket uses the socket lock, while data sockets use socket buffer
mutexes.

If SOLISTENING(so) is false and the knote lock routine locks a socket
buffer, then it must re-check whether the socket is a listening socket
since solisten_proto() could have changed the socket's identity while we
were blocked on the socket buffer lock.

Reported by:	syzkaller
Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35492
2022-06-16 10:20:04 -04:00
Mark Johnston
756bc3adc5 kasan: Create a shadow for the bootstack prior to hammer_time()
When the kernel is compiled with -asan-stack=true, the address sanitizer
will emit inline accesses to the shadow map.  In other words, some
shadow map accesses are not intercepted by the KASAN runtime, so they
cannot be disabled even if the runtime is not yet initialized by
kasan_init() at the end of hammer_time().

This went unnoticed because the loader will initialize all PML4 entries
of the bootstrap page table to point to the same PDP page, so early
shadow map accesses do not raise a page fault, though they are silently
corrupting memory.  In fact, when the loader does not copy the staging
area, we do get a page fault since in that case only the first and last
PML4Es are populated by the loader.  But due to another bug, the loader
always treated KASAN kernels as non-relocatable and thus always copied
the staging area.

It is not really practical to annotate hammer_time() and all callees
with __nosanitizeaddress, so instead add some early initialization which
creates a shadow for the boot stack used by hammer_time().  This is only
needed by KASAN, not by KMSAN, but the shared pmap code handles both.

Reported by:	mhorne
Reviewed by:	kib
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35449
2022-06-15 11:39:10 -04:00
Doug Ambrisko
ce00b11940 mount: revert the active vnode reporting feature
Revert the computing of active vnode reporting since statfs is used
by a lot of tools.  Only report the vnodes used.

Reported by:	mjg
2022-06-15 07:24:55 -07:00
Mark Johnston
7565431f30 mount: Fix an incorrect assertion in kernel_mount()
The pointer to the mount values may be null if an error occurred while
copying them in, so fix the assertion condition to reflect that
possibility.

While here, move some initialization code into the error == 0 block.  No
functional change intended.

Reported by:	syzkaller
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2022-06-14 12:00:59 -04:00
Mark Johnston
630f633f2a vm_object: Use the vm_object_(set|clear)_flag() helpers
... rather than setting and clearing flags inline.  No functional change
intended.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35469
2022-06-14 12:00:59 -04:00
Mark Johnston
e8955bd643 pipe: Use a distinct wait channel for I/O serialization
Suppose a thread tries to read from an empty pipe.  pipe_read() does the
following:

1. pipelock(), possibly sleeping
2. check for buffered data
3. pipeunlock()
4. set PIPE_WANTR and sleep
5. goto 1

pipelock() is an open-coded mutex; if a thread blocks in pipelock(), it
sleeps until the lock holder calls pipeunlock().

Both sleeps use the same wait channel.  So if there are multiple threads
in pipe_read(), a thread T1 in step 3 can wake up a thread T2 sleeping
in step 4.  Then T1 goes to sleep in step 4, and T2 acquires and
releases the pipelock, waking up T1 again.  This can go on indefinitely,
livelocking the process (and potentially starving a would-be writer).

Fix the problem by using a separate wait channel for pipelock().

Reported by:	Paul Floyd <paulf2718@gmail.com>
Reviewed by:	mjg, kib
PR:		264441
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35415
2022-06-14 12:00:59 -04:00
Cy Schubert
d781401512 kern_thread.c: Fix i386 build
Chase 4493a13e3b by updating static
assertions of struct proc.
2022-06-13 19:35:33 -07:00
Konstantin Belousov
1575804961 reap_kill_proc(): avoid singlethreading any other process if we are exiting
This is racy because curproc process lock is not used, but allows the
process to exit faster.  It is userspace issue to create such race
anyway, and not fullfilling the guarantee that all reaper descendants
are signalled should be fine.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
e0343eacf3 reap_kill_subtree(): hold the reaper when entering it into the queue to handle later
We drop proctree_lock, which allows the process to exit while memoized
in the list to proceed.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
1d4abf2cfa reap_kill_subtree_once(): handle proctree_lock unlock in reap_kill_proc()
Recorded reaper might loose its reaper status, so we should not assert
it, but check and avoid signalling if this happens.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 week
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
addf103ce6 reap_kill_proc: do not retry on thread_single() failure
The failure means that the process does single-threading itself, which
makes our action not needed.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
008b2e6544 Make stop_all_proc_block interruptible to avoid deadlock with parallel suspension
If we try to single-thread a process which thread entered
procctl(REAP_KILL_SUBTREE), and sleeping waiting for us unlocking
stop_all_proc_blocker, we must be able to finish single-threading.  This
requires the sleep to be interruptible.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Mark Johnston
2d5ef216b6 thread_single_end(): consistently maintain p_boundary_count for ALLPROC mode
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 week
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
1b4701fe1e thread_unsuspend(): do not unuspend the suspended leader thread doing SINGLE_ALLPROC
markj wrote:
tdsendsignal() may unsuspend a target thread. I think there is at least
one bug there: suppose thread T is suspended in
thread_single(SINGLE_ALLPROC) when trying to kill another process with
REAP_KILL. Suppose a different thread sends SIGKILL to T->td_proc. Then,
tdsendsignal() calls thread_unsuspend(T, T->td_proc). thread_unsuspend()
incorrectly decrements T->td_proc->p_suspcount to -1.

Later, when T->td_proc exits, it will wait forever in
thread_single(SINGLE_EXIT) since T->td_proc->p_suspcount never reaches 1.

Since the thread suspension is bounded by time needed to do
thread_single(), skipping the thread_unsuspend_one() call there should
not affect signal delivery if this thread is selected as target.

Reported by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
b9009b1789 thread_single(): remove already checked conditional expression
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
4493a13e3b Do not single-thread itself when the process single-threaded some another process
Since both self single-threading and remote single-threading rely on
suspending the thread doing thread_single(), it cannot be mixed: thread
doing thread_suspend_switch() might be subject to thread_suspend_one()
and vice versa.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
dd883e9a7e weed_inhib(): correct the condition to re-suspend a thread
suspended for SINGLE_ALLPROC mode.  There is no need to check for
boundary state.  It is only required to see that the suspension comes
from the ALLPROC mode.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
b9893b3533 weed_inhib(): do not double-suspend already suspended thread if the loop reiterates
In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:02 +03:00
Konstantin Belousov
d7a9e6e740 thread_single: wait for P_STOPPED_SINGLE to pass
to avoid ALLPROC mode to try to race with any other single-threading
mode.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:02 +03:00
Konstantin Belousov
02a2aacbe2 issignal(): ignore signals when process is single-threading for exit
Places that will wait for curproc->p_singlethr to become zero (in the
next commit, the counter of number of external single-threading is
to be introduced), must wait for it interruptible, otherwise we
deadlock.  On the other hand, a signal delivered during this window,
if directed to the waiting thread, would cause the wait loop to become
a busy loop.

Since we are exiting, it is safe to ignore the signals.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:02 +03:00
Konstantin Belousov
d3000939c7 P2_WEXIT: avoid thread_single() for exiting process earlier
before the process itself does thread_single(SINGLE_EXIT).  We cannot
single-thread such process in ALLPROC (external) mode, and properly
detect and report the failure to do so due to the process becoming
zombie is easier to prevent than handle.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:02 +03:00
Doug Ambrisko
6468cd8e0e mount: add vnode usage per file system with mount -v
This avoids the need to drop into the ddb to figure out vnode
usage per file system.  It helps to see if they are or are not
being freed.  Suggestion to report active vnode count was from
kib@

Reviewed by:   	kib
Differential Revision: https://reviews.freebsd.org/D35436
2022-06-13 07:56:38 -07:00
Hans Petter Selasky
b8394039dc mbuf(9): Fix size of mbuf for all 32-bit platforms (i386, ARM, PowerPC and RISCV)
Do this by reducing the size of the MBUF_PEXT_MAX_PGS, causing "struct mbuf" to
be bigger than M_SIZE, and also add a missing padding field to ensure 64-bit
alignment.

Reviewed by:	gallatin@
Reported by:	Elliott Mitchell
Differential revision:	https://reviews.freebsd.org/D35339
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2022-06-07 22:09:10 +02:00
Hans Petter Selasky
fe8c78f0d2 ktls: Add full support for TLS RX offloading via network interface.
Basic TLS RX offloading uses the "csum_flags" field in the mbuf packet
header to figure out if an incoming mbuf has been fully offloaded or
not. This information follows the packet stream via the LRO engine, IP
stack and finally to the TCP stack. The TCP stack preserves the mbuf
packet header also when re-assembling packets after packet loss. When
the mbuf goes into the socket buffer the packet header is demoted and
the offload information is transferred to "m_flags" . Later on a
worker thread will analyze the mbuf flags and decide if the mbufs
making up a TLS record indicate a fully-, partially- or not decrypted
TLS record. Based on these three cases the worker thread will either
pass the packet on as-is or recrypt the decrypted bits, if any, or
decrypt the packet as usual.

During packet loss the kernel TLS code will call back into the network
driver using the send tag, informing about the TCP starting sequence
number of every TLS record that is not fully decrypted by the network
interface. The network interface then stores this information in a
compressed table and starts asking the hardware if it has found a
valid TLS header in the TCP data payload. If the hardware has found a
valid TLS header and the referred TLS header is at a valid TCP
sequence number according to the TCP sequence numbers provided by the
kernel TLS code, the network driver then informs the hardware that it
can resume decryption.

Care has been taken to not merge encrypted and decrypted mbuf chains,
in the LRO engine and when appending mbufs to the socket buffer.

The mbuf's leaf network interface pointer is used to figure out from
which network interface the offloading rule should be allocated. Also
this pointer is used to track route changes.

Currently mbuf send tags are used in both transmit and receive
direction, due to convenience, but may get a new name in the future to
better reflect their usage.

Reviewed by:	jhb@ and gallatin@
Differential revision:	https://reviews.freebsd.org/D32356
Sponsored by:	NVIDIA Networking
2022-06-07 12:58:09 +02:00
Hans Petter Selasky
f0fca64618 ktls: Refer send tag pointer once.
So that the asserts and the actual code see the same values.

Differential revision:	https://reviews.freebsd.org/D32356
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2022-06-07 12:57:03 +02:00
Hans Petter Selasky
4d88d81c31 mbuf(9): Implement a leaf network interface field in the mbuf packet header.
When packets are received they may traverse several network interfaces like
vlan(4) and lagg(9). When doing receive side offloads it is important to
know the first network interface entry point, because that is where all
offloading is taking place. This makes it possible to track receive
side route changes for multiport setups, for example when lagg(9) receives
traffic from more than one port. This avoids having to install multiple
offloading rules for the same stream.

This field works similar to the existing "rcvif" mbuf packet header field.

Submitted by:	jhb@
Reviewed by:	gallatin@ and gnn@
Differential revision:	https://reviews.freebsd.org/D35339
Sponsored by:	NVIDIA Networking
Sponsored by:	Netflix
2022-06-07 12:54:42 +02:00
Gleb Smirnoff
d97922c6c6 unix/*: rewrite unp_internalize() cmsg parsing cycle
Make it a complex, but a single for(;;) statement.  The previous cycle
with some loop logic in the beginning and some loop logic at the end
was confusing.  Both me and markj@ were misleaded to a conclusion that
some checks are unnecessary, while they actually were necessary.

While here, handle an edge case found by Mark, when on 64-bit platform
an incorrect message from userland would underflow length counter, but
return without any error.  Provide a test case for such message.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35375
2022-06-06 10:05:28 -07:00
Yuichiro NAITO
8d95f50052 smp: Use local copies of the setup function pointer and argument
No functional change intended.

PR:		264383
Reviewed by:	jhb, markj
MFC after:	1 week
2022-06-06 11:29:51 -04:00
Gleb Smirnoff
2573e6ced9 unix/dgram: rename unpdg_sendspace to unpdg_maxdgram
Matches the meaning of the variable and sysctl node name.
2022-06-03 12:55:44 -07:00
Gleb Smirnoff
a8e286bb5d sockets: use socket buffer mutexes in struct socket directly
Convert more generic socket code to not use sockbuf compat pointer.
Continuation of 4328318445.
2022-06-03 12:55:44 -07:00
Mitchell Horne
35eb9b10c2 Use KERNEL_PANICKED() in more places
This is slightly more optimized than checking panicstr directly. For
most of these instances performance doesn't matter, but let's make
KERNEL_PANICKED() the common idiom.

Reviewed by:	mjg
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D35373
2022-06-02 10:15:43 -03:00
Gleb Smirnoff
f083739350 soo_aio_*: use socket buffer mutexes in struct socket directly
A miss from commit 4328318445.
2022-05-30 20:46:38 -07:00
Dmitry Chagin
d46174cd88 Finish cpuset_getaffinity() after f35093f8
Split cpuset_getaffinity() into a two counterparts, where the
user_cpuset_getaffinity() is intended to operate on the cpuset_t from
user va, while kern_cpuset_getaffinity() expects the cpuset from kernel
va.
Accordingly, the code that clears the high bits is moved to the
user_cpuset_getaffinity(). Linux sched_getaffinity() syscall returns
the size of set copied to the user-space and then glibc wrapper clears
the high bits.

MFC after:		2 weeks
2022-05-28 20:53:08 +03:00
Dmitry Chagin
31d1b816fe sysent: Get rid of bogus sys/sysent.h include.
Where appropriate hide sysent.h under proper condition.

MFC after:	2 weeks
2022-05-28 20:52:17 +03:00
Gleb Smirnoff
d64f2f42c1 unix: unp_externalize() can M_WAITOK
Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35318
2022-05-27 20:48:38 -07:00
Gleb Smirnoff
d59bc188d6 sockbuf: remove unused mbuf counter and cluster counter
With M_EXTPG mbufs these two counters already do not represent the
reality.  As we are moving towards protocol independent socket buffers,
which may not even use mbufs at all, the counters become less and less
relevant.  The only userland seeing them was 'netstat -x'.

PR:			264181 (exp-run)
Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35334
2022-05-27 08:20:17 -07:00
Gleb Smirnoff
75e7e3ce34 unix: fix incorrect assertion in 4682ac697c
Pointy hat to:	glebius
Fixes:		4682ac697c
2022-05-26 11:35:05 -07:00
Gleb Smirnoff
4682ac697c unix: turn check in unp_externalize() into assertion
In this function we always work with mbufs that we previously
created ourselves in unp_internalize().  They must be valid.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35319
2022-05-25 13:29:20 -07:00
Gleb Smirnoff
579b45e203 unix/*: check new control size in unp_internalize()
Now that we call sbcreatecontrol() with M_WAITOK, we are expected to
pass a valid size.  Return same error code, we are returning for an
oversized control from sockargs().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35317
2022-05-25 13:29:13 -07:00
Gleb Smirnoff
d60ea9a10a sockets: return EMSGSIZE if control part of message is too large
Specification doesn't list an explicit error code for the control
size specified by msg_control being too large.  But it does list
EMSGSIZE as error code for "message is too large to be sent all at
once (as the socket requires)".  It also lists EINVAL as code for
the "The sum of the iov_len values overflows an ssize_t."  Given
how generic and uninformative EINVAL is, the EMSGSIZE is more
appropriate.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/sendmsg.html

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35316
2022-05-25 13:29:04 -07:00
Gleb Smirnoff
ad51c47fb4 sockbuf: fix assertion in sbcreatecontrol()
Fixes:	6890b58814
2022-05-25 00:19:41 -07:00
Mark Johnston
524dadf7a8 kevent: Fix an off-by-one in filt_timerexpire_l()
Suppose a periodic kevent timer fires close to its deadline, so that
now - kc->next is small.  Then delta ends up being 1, and the next timer
deadline is set to (delta + 1) * kc->to, where kc->to is the timer
period.  This means that the timer fires at half of the requested rate,
and the value returned in kn_data is similarly inaccurate.

PR:		264131
Fixes:		7cb40543e9 ("filt_timerexpire: do not iterate over the interval")
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35313
2022-05-24 20:14:33 -04:00
Mateusz Guzik
cdb337b097 vfs: fix copy-pasto in previous
Reported by:	dchagin
2022-05-20 20:58:11 +00:00
Mateusz Guzik
ec3c225711 vfs: call vn_truncate_locked from kern_truncate
This fixes a bug where the syscall would not bump writecount.

PR:	263999
2022-05-20 17:25:51 +00:00
Mateusz Guzik
6b715687bd vfs: make sure truncate always calls NDFREE_*
While here convert it to NDFREE_NOTHING.
2022-05-20 17:25:51 +00:00
Mark Johnston
4a3e51335e cpuset: Fix the KASAN and KMSAN builds
Rename the "copyin" and "copyout" fields of struct cpuset_copy_cb to
something less generic, since sanitizers define interceptors for
copyin() and copyout() using #define.

Reported by:	syzbot+2db5d644097fc698fb6f@syzkaller.appspotmail.com
Fixes:	47a57144af ("cpuset: Byte swap cpuset for compat32 on big endian architectures")
Sponsored by:	The FreeBSD Foundation
2022-05-20 10:34:25 -04:00
Dmitry Chagin
eca368ecb6 Retire sv_transtrap
Call translate_traps directly from sendsig().

MFC after:		2 weeks
2022-05-20 14:54:03 +03:00
Dmitry Chagin
2479e381cd kqueue: Trim trailing whitespace
MFC after:		1 week
2022-05-19 19:52:02 +03:00
Justin Hibbits
47a57144af cpuset: Byte swap cpuset for compat32 on big endian architectures
Summary:
BITSET uses long as its basic underlying type, which is dependent on the
compile type, meaning on 32-bit builds the basic type is 32 bits, but on
64-bit builds it's 64 bits.  On little endian architectures this doesn't
matter, because the LSB is always at the low bit, so the words get
effectively concatenated moving between 32-bit and 64-bit, but on
big-endian architectures it throws a wrench in, as setting bit 0 in
32-bit mode is equivalent to setting bit 32 in 64-bit mode.  To
demonstrate:

32-bit mode:

BIT_SET(foo, 0):        0x00000001

64-bit sees: 0x0000000100000000

cpuset is the only system interface that uses bitsets, so solve this
by swapping the integer sub-components at the copyin/copyout points.

Reviewed by:	kib
MFC after:	3 days
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D35225
2022-05-19 10:49:55 -05:00
Andrew Turner
11a6ecd425 Handle cas failure when the compare succeeds
When locking a priority inherit mutex we perform a compare and swap
operation to try and acquire the mutex. This may fail even when the
compare succeeds.

Check and handle this case.

PR:		263825
Reviewed by:	kib, markj
Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35150
2022-05-19 11:30:21 +01:00
Gleb Smirnoff
6890b58814 sockbuf: improve sbcreatecontrol()
o Constify memory pointer.  Make length unsigned.
o Make it never fail with M_WAITOK and assert that length is sane.
2022-05-17 10:10:42 -07:00
Gleb Smirnoff
b46667c63e sockbuf: merge two versions of sbcreatecontrol() into one
No functional change.
2022-05-17 10:10:42 -07:00
Gleb Smirnoff
eac7f0798b unix: garbage collect unp_dispose_mbuf() for brevity 2022-05-17 10:10:41 -07:00
Gleb Smirnoff
2e5bf7c49f unix: fix mbuf leak on close of socket with data
Fixes:	1f32cef471
2022-05-17 10:10:41 -07:00
Vladimir Kondratyev
b6f87b78b5 LinuxKPI: Implement kthread_worker related functions
Kthread worker is a single thread workqueue which can be used in cases
where specific kthread association is necessary, for example, when it
should have RT priority or be assigned to certain cgroup.

This change implements Linux v4.9 interface which mostly hides kthread
internals from users thus allowing to use ordinary taskqueue(9) KPI.
As kthread worker prohibits enqueueing of already pending or canceling
tasks some minimal changes to taskqueue(9) were done.
taskqueue_enqueue_flags() was added to taskqueue KPI which accepts extra
flags parameter. It contains one or more of the following flags:

TASKQUEUE_FAIL_IF_PENDING - taskqueue_enqueue_flags() fails if the task
    is already scheduled to execution. EEXIST is returned and the
    ta_pending counter value remains unchanged.
TASKQUEUE_FAIL_IF_CANCELING - taskqueue_enqueue_flags() fails if the
    task is in the canceling state and ECANCELED is returned.

Required by:	drm-kmod 5.10

MFC after:	1 week
Reviewed by:	hselasky, Pau Amma (docs)
Differential Revision:	https://reviews.freebsd.org/D35051
2022-05-17 15:10:20 +03:00
Rick Macklem
373511338d uipc_socket.c: Modify MSG_TLSAPPDATA to only do Alert Records
Without this patch, the MSG_TLSAPPDATA flag would cause
soreceive_generic() to return ENXIO for any non-application
data record in a TLS receive stream.

This works ok for TLS1.2, since Alert records appear to be
the only non-application data records received.
However, for TLS1.3, there can be post-handshake handshake
records, such as NewSessionKey sent to the client from the
server. These handshake records cannot be handled by the
upcall which does an SSL_read() with length == 0.

It appears that the client can simply throw away these
NewSessionKey records, but to do so, it needs to receive
them within the kernel.

This patch modifies the semantics of MSG_TLSAPPDATA slightly,
so that it only applies to Alert records and not Handshake
records. It is needed to allow the krpc to work with KTLS1.3.

Reviewed by:	hselasky
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35170
2022-05-14 12:56:50 -07:00
Dmitry Chagin
cb2ae61631 sysvsem: Fix a typo
Per jamie@ rpr can be NULL if the jail is created with sysvsem=disable.
But at least it doesn't appear to be fatal, since rpr is never dereferenced
but is only compared to other prison pointers.

Reviewed by:		jamie
Differential revision:	https://reviews.freebsd.org/D35198
MFC after:		2 weeks
2022-05-14 14:07:20 +03:00
Dmitry Chagin
b6c8f461f0 sysvsem: Style(9)
MFC after:	2 weeks
2022-05-14 14:06:58 +03:00
Dmitry Chagin
f0b0fdf15e sysvsem: Trim traiing whitespace
MFC after:	2 weeks
2022-05-14 14:06:40 +03:00
Mitchell Horne
db71383b88 kerneldump: remove physical from dump routines
It is unused, especially now that the underlying d_dumper methods do not
accept the argument.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35174
2022-05-13 10:43:19 -03:00
Mitchell Horne
489ba22236 kerneldump: remove physical argument from d_dumper
The physical address argument is essentially ignored by every dumper
method. In addition, the dump routines don't actually pass a real
address; every call to dump_append() passes a value of zero for
physical.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35173
2022-05-13 10:42:48 -03:00
Mitchell Horne
0f50da2e09 Drop d_dump from struct cdevsw
It appears to be unused. These days struct disk has a d_dump member,
which is what gets passed to the kernel dump framework.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35172
2022-05-13 10:42:17 -03:00
Gleb Smirnoff
bb35a4e11d unix: microoptimize unp_connectat() - one less lock on success
This change is also a preparation for further optimization to
allow locked return on success.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35182
2022-05-12 13:22:39 -07:00
Gleb Smirnoff
08f17d1432 unix: make unp_connect2() void
Assert that sockets are of the same type.  unp_connectat() already did
this check.  Add the check to uipc_connect2().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35181
2022-05-12 13:22:39 -07:00
Gleb Smirnoff
4328318445 sockets: use socket buffer mutexes in struct socket directly
Since c67f3b8b78 the sockbuf mutexes belong to the containing socket,
and socket buffers just point to it.  In 74a68313b5 macros that access
this mutex directly were added.  Go over the core socket code and
eliminate code that reaches the mutex by dereferencing the sockbuf
compatibility pointer.

This change requires a KPI change, as some functions were given the
sockbuf pointer only without any hint if it is a receive or send buffer.

This change doesn't cover the whole kernel, many protocols still use
compatibility pointers internally.  However, it allows operation of a
protocol that doesn't use them.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35152
2022-05-12 13:22:12 -07:00
Gleb Smirnoff
01235012e5 unix/dgram: uipc_listen() is specific for SOCK_STREAM and SOCK_SEQPACKET
Rely on pr_usrreqs_init() to init SOCK_DGRAM to pru_listen_notsupp().
2022-05-12 11:04:40 -07:00
Gleb Smirnoff
3c87ba3c3b unix/dgram: pru_rcvd never called since PR_WANTRCVD not set 2022-05-12 11:04:40 -07:00
Gleb Smirnoff
2e4e5ee23f sockets: delete stale comment from sofree()
First  paragraph refers to old past "we used to" and is no longer
important today.  Second paragraph has just a wrong statement that
socket buffer is destroyed before pru_detach.
2022-05-12 11:02:50 -07:00
Gleb Smirnoff
1f32cef471 unix: don't call sbrelease() in uipc_detach()
Since a982ce0442 the socket buffer is already cleared and released in
unp_dispose() that is called just before uipc_detach().
2022-05-12 11:02:50 -07:00
Dmitry Chagin
586ed32106 kdump: Decode cpuset_t.
Reviewed by:		jhb
Differential revision:	https://reviews.freebsd.org/D34982
MFC after:		2 weeks
2022-05-11 10:40:39 +03:00
Dmitry Chagin
f35093f8d6 Use Linux semantics for the thread affinity syscalls.
Linux has more tolerant checks of the user supplied cpuset_t's.

Minimum cpuset_t size that the Linux kernel permits in case of
getaffinity() is the maximum CPU id, present in the system / NBBY,
the maximum size is not limited.
For setaffinity(), Linux does not limit the size of the user-provided
cpuset_t, internally using only the meaningful part of the set, where
the upper bound is the maximum CPU id, present in the system, no larger
than the size of the kernel cpuset_t.
Unlike FreeBSD, Linux ignores high bits if set in the setaffinity(),
so clear it in the sched_setaffinity() and Linuxulator itself.

Reviewed by:		Pau Amma (man pages)
In collaboration with:	jhb
Differential revision:	https://reviews.freebsd.org/D34849
MFC after:		2 weeks
2022-05-11 10:36:01 +03:00
Gleb Smirnoff
7db54446c6 sockbufs: make sbrelease_internal() private 2022-05-09 10:43:01 -07:00
Gleb Smirnoff
a982ce0442 sockets: remove the socket-on-stack hack from sorflush()
The hack can be tracked down to 4.4BSD, where copy was performed
under splimp() and then after splx() dom_dispose was called.
Stevens has a chapter on this function, but he doesn't answer why
this trick is necessary.  Why can't we call into dom_dispose under
splimp()?  Anyway, with multithreaded kernel the hack seems to be
necessary to avoid LORs between socket buffer lock and different
filesystem locks, especially network file systems.

The new socket buffers KPI sbcut() from 1d2df300e9 allow us to get
rid of the hack.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35125
2022-05-09 10:43:01 -07:00
Gleb Smirnoff
42f2fa9953 sockets: don't call dom_dispose() on a listening socket
sorflush() already did the right thing, so only sofree() needed
a fix.  Turn check into assertion in our only dom_dispose method.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35124
2022-05-09 10:42:57 -07:00
Gleb Smirnoff
c17418a0ba sockets: assert that any protocol with PR_RIGHTS has dom_dispose()
Through the entire history only PF_UNIX has this feature.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35123
2022-05-09 10:42:48 -07:00
Gleb Smirnoff
24df85d29a unix/*: unp_internalize() can sleep, so allocate mbufs with M_WAITOK 2022-05-09 10:42:48 -07:00
Gleb Smirnoff
97f8198e95 sockets: make SO_SND/SO_RCV a enum
Not a functional change now. The enum will also be used for other socket
buffer related KPIs.
2022-05-09 10:42:47 -07:00
Warner Losh
45ae223ac6 msgbuf: Allow microsecond granularity timestamps
Today, kern.msgbuf_show_timestamp=1 will give 1 second granularity
timestamps on dmesg lines. When kern.msgbuf_show_timestamp=2, we'll
produce microsecond level graunlarity.
For example:
old (== 1):
[13] Dual Console: Video Primary, Serial Secondary
[14] lo0: link state changed to UP
[15] bxe0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[15] bxe0: link state changed to UP
new (== 2):
[13.807015] Dual Console: Video Primary, Serial Secondary
[14.544150] lo0: link state changed to UP
[15.272044] bxe0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[15.272052] bxe0: link state changed to UP

Sponsored by:		Netflix
2022-05-07 09:32:22 -06:00
Alan Somers
1d2421ad8b Correctly measure system load averages > 1024
The old fixed-point arithmetic used for calculating load averages had an
overflow at 1024.  So on systems with extremely high load, the observed
load average would actually fall back to 0 and shoot up again, creating
a kind of sawtooth graph.

Fix this by using 64-bit math internally, while still reporting the load
average to userspace as a 32-bit number.

Sponsored by:	Axcient
Reviewed by:	imp
Differential Revision: https://reviews.freebsd.org/D35134
2022-05-06 17:25:43 -06:00
John Baldwin
2fdcc2ef6f cpufreq: Remove unused devclass argument to DRIVER_MODULE. 2022-05-06 15:46:58 -07:00
Dmitry Chagin
f04534f5c8 sysvsem: Add a timeout argument to the semop.
For future use in the Linux emulation layer for the semtimedop syscall
split the sys_semop syscall into two counterparts and add
struct timespec *timeout argument to the last one.

Reviewed by:		jhb, kib
Differential revision:	https://reviews.freebsd.org/D35121
MFC after:		2 weeks
2022-05-06 19:51:48 +03:00
Kristof Provost
613acc6483 mbuf: do not restore dying interfaces
When we remove an interface it is first removed from the interface list
V_ifnet (by if_unlink_ifnet()) and marked as IFF_DYING. We then wait for
any possible references to stop being used (i.e.
epoch_wait/epoch_drain_callbacks) before we tear it fully down.

However, the index in ifindex_table is not removed, so m_rcvif_restore()
can still find the (now dying) interface.

This results in panics, for example when dummynet restores the rcvif
pointer and passes a packet to ip6_input() we can panic because the
AF_INET6 domain has already been removed (so we end up dereferencing a
NULL pointer there).

Check that the interface is not dying before we restore it, which is
equivalent to checking its presence in V_ifnet, and thus ensures that
future accesses (while in NET_EPOCH) are safe.

Reviewed by:	glebius
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D34076

(cherry picked from commit 703e533da5)
2022-05-05 14:38:08 -04:00
Gleb Smirnoff
4d7a1361ef ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif
Supplement ifindex table with generation count and use it to
serialize & restore an ifnet pointer.

Reviewed by:		kp
Differential revision:	https://reviews.freebsd.org/D33266
Fun note:		git show e6abef0918

(cherry picked from commit e1882428dc)
2022-05-05 14:38:07 -04:00
Marko Zec
6c741ffbfa Revert "mbuf: do not restore dying interfaces"
This reverts commit 703e533da5.

Revert "ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif"

This reverts commit e1882428dc.

Obtained from: github.com/glebius/FreeBSD/commits/backout-ifindex
2022-05-03 19:11:40 +02:00
Konstantin Belousov
6fe78ad434 subr_unit.c: make userspace tests buildable
by defining a placeholder for UNR_NO_MTX

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2022-04-28 03:00:14 +03:00
Konstantin Belousov
709783373e Fix another race between fork(2) and PROC_REAP_KILL subtree
where we might not yet see a new child when signalling a process.
Ensure that this cannot happen by stopping all reapping subtree,
which ensures that the child is not inside a syscall, in particular
fork(2).

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:35 +03:00
Konstantin Belousov
39794d80ad Fix a race between fork(2) and PROC_REAP_KILL subtree
by repeating iteration over the subtree until there are no new processes
to signal.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:35 +03:00
Konstantin Belousov
d1df347368 kern_procctl: add possibility to take stop_all_proc_block() around exec
stop_allo_proc_block() must be taken before proctree_lock.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:35 +03:00
Konstantin Belousov
2e7595ef2f Add stop_all_proc_block(9)
It allows to have more than one consumer of thread_signle(SIGNLE_ALLPROC) by
serializing them.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:35 +03:00
Konstantin Belousov
54a11adbd9 reap_kill(): split children and subtree killers into helpers
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
134529b11b reap_kill(): rename the reap variable to reaper
Suggested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
e4ce431e2a reap_kill(): de-inline LIST_FOREACH(), twice
Suggested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
b9294a3e15 reaper_abandon_children(): upgrade proctree_lock assert to exclusive
p_reapsibling linkage is protected by proctree_lock, and it is modified
there.

Suggested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
e59b940dcb unr(9): allow to avoid internal locking
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
c4be460e84 init_unrhdr(): make it usable by initializing everything
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
John Baldwin
1431239494 Add a __witness_used for variables only used under #ifdef WITNESS.
__diagused is now solely used for variables only used under INVARIANTS.

Reviewed by:	mjg
Differential Revision:	https://reviews.freebsd.org/D35085
2022-04-27 11:46:16 -07:00
Dmitry Chagin
4a700f3c32 sigtimedwait: Prevent timeout math overflows.
Our kern_sigtimedwait() calculates absolute sleep timo value as 'uptime+timeout'.
So, when the user specifies a big timeout value (LONG_MAX), the calculated
timo can be less the the current uptime value.
In that case kern_sigtimedwait() returns EAGAIN instead of EINTR, if
unblocked signal was caught.

While here switch to a high-precision sleep method.

Reviewed by:		mav, kib
In collaboration with:	mav
Differential revision:	https://reviews.freebsd.org/D34981
MFC after:		2 weeks
2022-04-25 10:23:15 +03:00
Dmitry Chagin
91e7bdcdcf Add timespecvalid_interval macro and use it.
Reviewed by:		jhb, imp (early rev)
Differential revision:	https://reviews.freebsd.org/D34848
MFC after:		2 weeks
2022-04-25 10:20:54 +03:00
John Baldwin
a4c5d490f6 KTLS: Move OCF function pointers out of ktls_session.
Instead, create a switch structure private to ktls_ocf.c and store a
pointer to the switch in the ocf_session.  This will permit adding an
additional function pointer needed for NIC TLS RX without further
bloating ktls_session.

Reviewed by:	hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35011
2022-04-22 15:52:12 -07:00
John Baldwin
92e40a9b92 busdma_bounce: Batch bounce page free operations when possible.
Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D34968
2022-04-21 12:01:55 -07:00
John Baldwin
d4ab3a8d4f busdma_bounce: Add free_bounce_pages helper function.
Deduplicate code to iterate over the bpages list in a bus_dmamap_t
freeing bounce pages during bus_dmamap_unload.

Reviewed by:	imp
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D34967
2022-04-21 10:42:14 -07:00
John Baldwin
10fe9a1fb4 busdma_bounce: Make the map waiting list per-bounce-zone.
When pages are freed to a bounce zone, only maps waiting for pages for
that zone can make forward progress.  If a map for a different bounce
zone is at the head of the global list, then requests that could
otherwise make forward progress will be stalled waiting on the other
bounce zone.  If bounce zones shared bounce pages then a global list
would still make sense to prevent "later" requests from starving an
earlier request but that is not a concern with per-zone bounce page
pools.

Reviewed by:	imp
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D34966
2022-04-21 10:41:09 -07:00
John Baldwin
d11f5d4762 busdma_bounce: Use a simple kproc to invoke deferred requests.
Rather than using a software interrupt with a single handler, just
create a dedicated kernel process woken up with a simple wakeup().

Reviewed by:	imp
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D34965
2022-04-21 10:40:35 -07:00
John Baldwin
c7aa0304d5 Run softclock threads at a hardware ithread priority.
Add a new PI_SOFTCLOCK for use by softclock threads.  Currently this
maps to PI_AV which is the second-highest ithread priority.

Reviewed by:	mav, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D33693
2022-04-21 10:40:01 -07:00
John Baldwin
3d7e90fc20 cpufreq_curr_sysctl: Use devclass_find to lookup cpufreq devclass.
Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D35002
2022-04-21 10:29:14 -07:00
Kristof Provost
a879e40ca2 callout: fix using shared rmlocks
15b1eb142c changed the callout code to store the CALLOUT_SHAREDLOCK flag
in c_iflags (where it used to be c_flags), but failed to update the
check in softclock_call_cc(). This resulted in the callout code always
taking the write lock, even if a read lock had been requested (with
the CALLOUT_SHAREDLOCK flag in callout_init_rm()).

Reviewed by:	markj
MFC after:	1 week
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D34959
2022-04-20 13:06:50 +02:00
John Baldwin
5bdea8826b devclass_add_driver: Permit NULL to be passed in dcp.
This permits a driver module structure that doesn't want to store a
pointer to the new driver's devclass.

Reviewed by:	imp
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D34962
2022-04-19 10:43:50 -07:00
Mateusz Guzik
c5c981d443 signals: plug a set-but-not-used var
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-04-19 12:45:57 +00:00
John Baldwin
d139909d6e destroy_dev_sched*: Don't hold Giant for all deferred destroy_dev.
Rather than using taskqueue_swi_giant which holds Giant for all
deferred destroy_dev calls, create a separate queue for destroyed
devices with D_NEEDGIANT set in the corresponding cdevsw.  The task
for this queue holds Giant whild destroying deferred devices while the
task for the default queue does not hold Giant.

In addition, switch to taskqueue_thread for destroy_dev_sched.
Deferred destroy_dev requests don't need to run at an SWI priority.

Reviewed by:	imp, markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D34915
2022-04-18 12:04:30 -07:00
Konstantin Belousov
362ff9867e Revert rest of a5970a529c: use vrefact() when working on fp->f_vnode
Now, since O_PATH-opened file descriptors use use references instead
of the hold references, vrefact() chahges from that revision can be
reverted.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34906
2022-04-15 16:56:20 +03:00
Ed Maste
f99cc5a389 sysent: regen after 52a1d90c8b, posix_fadvise in capmode 2022-04-14 15:17:36 -04:00
Ed Maste
52a1d90c8b Allow posix_fadvise in capability mode
posix_fadvise operates only on a provided fd.  Noted by
Mathieu <sigsys@gmail.com> in review D34761.

No new CAP_ rights are added for posix_fadvise(), as 'advice' in
general only influences when I/O happens; the fd must have existing
CAP_ rights for actual data access.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34903
2022-04-14 15:11:21 -04:00
Konstantin Belousov
bf13db086b Mostly revert a5970a529c: Make files opened with O_PATH to not block non-forced unmount
Problem is that open(O_PATH) on nullfs -o nocache is broken then,
because there is no reference on the vnode after the open syscall exits.

Reported and tested by:	ambrisko
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2022-04-14 02:47:04 +03:00
John Baldwin
36fb372264 kern: Move variables only used for MAC under #ifdef MAC. 2022-04-13 16:08:23 -07:00
John Baldwin
4aec198420 sched_ule: Inline value of ts in sched_thread_priority.
This avoids a set but unused warning in kernels without SMP where
TDQ_CPU() doesn't use its argument.
2022-04-13 16:08:23 -07:00
John Baldwin
8758ac757f sched_4bsd: ts is only used in sched_bind for SMP. 2022-04-13 16:08:22 -07:00
John Baldwin
72ff256c51 sched_4bsd: Remove unused variables. 2022-04-12 14:58:59 -07:00
John Baldwin
dbd51c416a realloc(9): Move slab and zone under #ifndef DEBUG_REDZONE. 2022-04-12 14:58:59 -07:00
Mark Johnston
d769609620 tty: Remove an incorrect assertion from ttyinq_line_iterate()
We may legitimately have tib == NULL if we're at the very end of the
queue.

PR:		215373
Reported by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-04-12 17:30:04 -04:00
Tom Jones
1ea833a572 kdb: set kdb_why when entered via reboot and panic
Reviewed by:	jhb
Sponsored by:   NetApp, Inc.
Sponsored by:   Klara, Inc.
X-NetApp-PR:    #74
Differential Revision:	https://reviews.freebsd.org/D34551
2022-04-12 10:34:40 +01:00