Commit Graph

19325 Commits

Author SHA1 Message Date
Gleb Smirnoff
4682ac697c unix: turn check in unp_externalize() into assertion
In this function we always work with mbufs that we previously
created ourselves in unp_internalize().  They must be valid.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35319
2022-05-25 13:29:20 -07:00
Gleb Smirnoff
579b45e203 unix/*: check new control size in unp_internalize()
Now that we call sbcreatecontrol() with M_WAITOK, we are expected to
pass a valid size.  Return same error code, we are returning for an
oversized control from sockargs().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35317
2022-05-25 13:29:13 -07:00
Gleb Smirnoff
d60ea9a10a sockets: return EMSGSIZE if control part of message is too large
Specification doesn't list an explicit error code for the control
size specified by msg_control being too large.  But it does list
EMSGSIZE as error code for "message is too large to be sent all at
once (as the socket requires)".  It also lists EINVAL as code for
the "The sum of the iov_len values overflows an ssize_t."  Given
how generic and uninformative EINVAL is, the EMSGSIZE is more
appropriate.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/sendmsg.html

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35316
2022-05-25 13:29:04 -07:00
Gleb Smirnoff
ad51c47fb4 sockbuf: fix assertion in sbcreatecontrol()
Fixes:	6890b58814
2022-05-25 00:19:41 -07:00
Mark Johnston
524dadf7a8 kevent: Fix an off-by-one in filt_timerexpire_l()
Suppose a periodic kevent timer fires close to its deadline, so that
now - kc->next is small.  Then delta ends up being 1, and the next timer
deadline is set to (delta + 1) * kc->to, where kc->to is the timer
period.  This means that the timer fires at half of the requested rate,
and the value returned in kn_data is similarly inaccurate.

PR:		264131
Fixes:		7cb40543e9 ("filt_timerexpire: do not iterate over the interval")
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35313
2022-05-24 20:14:33 -04:00
Mateusz Guzik
cdb337b097 vfs: fix copy-pasto in previous
Reported by:	dchagin
2022-05-20 20:58:11 +00:00
Mateusz Guzik
ec3c225711 vfs: call vn_truncate_locked from kern_truncate
This fixes a bug where the syscall would not bump writecount.

PR:	263999
2022-05-20 17:25:51 +00:00
Mateusz Guzik
6b715687bd vfs: make sure truncate always calls NDFREE_*
While here convert it to NDFREE_NOTHING.
2022-05-20 17:25:51 +00:00
Mark Johnston
4a3e51335e cpuset: Fix the KASAN and KMSAN builds
Rename the "copyin" and "copyout" fields of struct cpuset_copy_cb to
something less generic, since sanitizers define interceptors for
copyin() and copyout() using #define.

Reported by:	syzbot+2db5d644097fc698fb6f@syzkaller.appspotmail.com
Fixes:	47a57144af ("cpuset: Byte swap cpuset for compat32 on big endian architectures")
Sponsored by:	The FreeBSD Foundation
2022-05-20 10:34:25 -04:00
Dmitry Chagin
eca368ecb6 Retire sv_transtrap
Call translate_traps directly from sendsig().

MFC after:		2 weeks
2022-05-20 14:54:03 +03:00
Dmitry Chagin
2479e381cd kqueue: Trim trailing whitespace
MFC after:		1 week
2022-05-19 19:52:02 +03:00
Justin Hibbits
47a57144af cpuset: Byte swap cpuset for compat32 on big endian architectures
Summary:
BITSET uses long as its basic underlying type, which is dependent on the
compile type, meaning on 32-bit builds the basic type is 32 bits, but on
64-bit builds it's 64 bits.  On little endian architectures this doesn't
matter, because the LSB is always at the low bit, so the words get
effectively concatenated moving between 32-bit and 64-bit, but on
big-endian architectures it throws a wrench in, as setting bit 0 in
32-bit mode is equivalent to setting bit 32 in 64-bit mode.  To
demonstrate:

32-bit mode:

BIT_SET(foo, 0):        0x00000001

64-bit sees: 0x0000000100000000

cpuset is the only system interface that uses bitsets, so solve this
by swapping the integer sub-components at the copyin/copyout points.

Reviewed by:	kib
MFC after:	3 days
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D35225
2022-05-19 10:49:55 -05:00
Andrew Turner
11a6ecd425 Handle cas failure when the compare succeeds
When locking a priority inherit mutex we perform a compare and swap
operation to try and acquire the mutex. This may fail even when the
compare succeeds.

Check and handle this case.

PR:		263825
Reviewed by:	kib, markj
Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35150
2022-05-19 11:30:21 +01:00
Gleb Smirnoff
6890b58814 sockbuf: improve sbcreatecontrol()
o Constify memory pointer.  Make length unsigned.
o Make it never fail with M_WAITOK and assert that length is sane.
2022-05-17 10:10:42 -07:00
Gleb Smirnoff
b46667c63e sockbuf: merge two versions of sbcreatecontrol() into one
No functional change.
2022-05-17 10:10:42 -07:00
Gleb Smirnoff
eac7f0798b unix: garbage collect unp_dispose_mbuf() for brevity 2022-05-17 10:10:41 -07:00
Gleb Smirnoff
2e5bf7c49f unix: fix mbuf leak on close of socket with data
Fixes:	1f32cef471
2022-05-17 10:10:41 -07:00
Vladimir Kondratyev
b6f87b78b5 LinuxKPI: Implement kthread_worker related functions
Kthread worker is a single thread workqueue which can be used in cases
where specific kthread association is necessary, for example, when it
should have RT priority or be assigned to certain cgroup.

This change implements Linux v4.9 interface which mostly hides kthread
internals from users thus allowing to use ordinary taskqueue(9) KPI.
As kthread worker prohibits enqueueing of already pending or canceling
tasks some minimal changes to taskqueue(9) were done.
taskqueue_enqueue_flags() was added to taskqueue KPI which accepts extra
flags parameter. It contains one or more of the following flags:

TASKQUEUE_FAIL_IF_PENDING - taskqueue_enqueue_flags() fails if the task
    is already scheduled to execution. EEXIST is returned and the
    ta_pending counter value remains unchanged.
TASKQUEUE_FAIL_IF_CANCELING - taskqueue_enqueue_flags() fails if the
    task is in the canceling state and ECANCELED is returned.

Required by:	drm-kmod 5.10

MFC after:	1 week
Reviewed by:	hselasky, Pau Amma (docs)
Differential Revision:	https://reviews.freebsd.org/D35051
2022-05-17 15:10:20 +03:00
Rick Macklem
373511338d uipc_socket.c: Modify MSG_TLSAPPDATA to only do Alert Records
Without this patch, the MSG_TLSAPPDATA flag would cause
soreceive_generic() to return ENXIO for any non-application
data record in a TLS receive stream.

This works ok for TLS1.2, since Alert records appear to be
the only non-application data records received.
However, for TLS1.3, there can be post-handshake handshake
records, such as NewSessionKey sent to the client from the
server. These handshake records cannot be handled by the
upcall which does an SSL_read() with length == 0.

It appears that the client can simply throw away these
NewSessionKey records, but to do so, it needs to receive
them within the kernel.

This patch modifies the semantics of MSG_TLSAPPDATA slightly,
so that it only applies to Alert records and not Handshake
records. It is needed to allow the krpc to work with KTLS1.3.

Reviewed by:	hselasky
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35170
2022-05-14 12:56:50 -07:00
Dmitry Chagin
cb2ae61631 sysvsem: Fix a typo
Per jamie@ rpr can be NULL if the jail is created with sysvsem=disable.
But at least it doesn't appear to be fatal, since rpr is never dereferenced
but is only compared to other prison pointers.

Reviewed by:		jamie
Differential revision:	https://reviews.freebsd.org/D35198
MFC after:		2 weeks
2022-05-14 14:07:20 +03:00
Dmitry Chagin
b6c8f461f0 sysvsem: Style(9)
MFC after:	2 weeks
2022-05-14 14:06:58 +03:00
Dmitry Chagin
f0b0fdf15e sysvsem: Trim traiing whitespace
MFC after:	2 weeks
2022-05-14 14:06:40 +03:00
Mitchell Horne
db71383b88 kerneldump: remove physical from dump routines
It is unused, especially now that the underlying d_dumper methods do not
accept the argument.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35174
2022-05-13 10:43:19 -03:00
Mitchell Horne
489ba22236 kerneldump: remove physical argument from d_dumper
The physical address argument is essentially ignored by every dumper
method. In addition, the dump routines don't actually pass a real
address; every call to dump_append() passes a value of zero for
physical.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35173
2022-05-13 10:42:48 -03:00
Mitchell Horne
0f50da2e09 Drop d_dump from struct cdevsw
It appears to be unused. These days struct disk has a d_dump member,
which is what gets passed to the kernel dump framework.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35172
2022-05-13 10:42:17 -03:00
Gleb Smirnoff
bb35a4e11d unix: microoptimize unp_connectat() - one less lock on success
This change is also a preparation for further optimization to
allow locked return on success.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35182
2022-05-12 13:22:39 -07:00
Gleb Smirnoff
08f17d1432 unix: make unp_connect2() void
Assert that sockets are of the same type.  unp_connectat() already did
this check.  Add the check to uipc_connect2().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35181
2022-05-12 13:22:39 -07:00
Gleb Smirnoff
4328318445 sockets: use socket buffer mutexes in struct socket directly
Since c67f3b8b78 the sockbuf mutexes belong to the containing socket,
and socket buffers just point to it.  In 74a68313b5 macros that access
this mutex directly were added.  Go over the core socket code and
eliminate code that reaches the mutex by dereferencing the sockbuf
compatibility pointer.

This change requires a KPI change, as some functions were given the
sockbuf pointer only without any hint if it is a receive or send buffer.

This change doesn't cover the whole kernel, many protocols still use
compatibility pointers internally.  However, it allows operation of a
protocol that doesn't use them.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35152
2022-05-12 13:22:12 -07:00
Gleb Smirnoff
01235012e5 unix/dgram: uipc_listen() is specific for SOCK_STREAM and SOCK_SEQPACKET
Rely on pr_usrreqs_init() to init SOCK_DGRAM to pru_listen_notsupp().
2022-05-12 11:04:40 -07:00
Gleb Smirnoff
3c87ba3c3b unix/dgram: pru_rcvd never called since PR_WANTRCVD not set 2022-05-12 11:04:40 -07:00
Gleb Smirnoff
2e4e5ee23f sockets: delete stale comment from sofree()
First  paragraph refers to old past "we used to" and is no longer
important today.  Second paragraph has just a wrong statement that
socket buffer is destroyed before pru_detach.
2022-05-12 11:02:50 -07:00
Gleb Smirnoff
1f32cef471 unix: don't call sbrelease() in uipc_detach()
Since a982ce0442 the socket buffer is already cleared and released in
unp_dispose() that is called just before uipc_detach().
2022-05-12 11:02:50 -07:00
Dmitry Chagin
586ed32106 kdump: Decode cpuset_t.
Reviewed by:		jhb
Differential revision:	https://reviews.freebsd.org/D34982
MFC after:		2 weeks
2022-05-11 10:40:39 +03:00
Dmitry Chagin
f35093f8d6 Use Linux semantics for the thread affinity syscalls.
Linux has more tolerant checks of the user supplied cpuset_t's.

Minimum cpuset_t size that the Linux kernel permits in case of
getaffinity() is the maximum CPU id, present in the system / NBBY,
the maximum size is not limited.
For setaffinity(), Linux does not limit the size of the user-provided
cpuset_t, internally using only the meaningful part of the set, where
the upper bound is the maximum CPU id, present in the system, no larger
than the size of the kernel cpuset_t.
Unlike FreeBSD, Linux ignores high bits if set in the setaffinity(),
so clear it in the sched_setaffinity() and Linuxulator itself.

Reviewed by:		Pau Amma (man pages)
In collaboration with:	jhb
Differential revision:	https://reviews.freebsd.org/D34849
MFC after:		2 weeks
2022-05-11 10:36:01 +03:00
Gleb Smirnoff
7db54446c6 sockbufs: make sbrelease_internal() private 2022-05-09 10:43:01 -07:00
Gleb Smirnoff
a982ce0442 sockets: remove the socket-on-stack hack from sorflush()
The hack can be tracked down to 4.4BSD, where copy was performed
under splimp() and then after splx() dom_dispose was called.
Stevens has a chapter on this function, but he doesn't answer why
this trick is necessary.  Why can't we call into dom_dispose under
splimp()?  Anyway, with multithreaded kernel the hack seems to be
necessary to avoid LORs between socket buffer lock and different
filesystem locks, especially network file systems.

The new socket buffers KPI sbcut() from 1d2df300e9 allow us to get
rid of the hack.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35125
2022-05-09 10:43:01 -07:00
Gleb Smirnoff
42f2fa9953 sockets: don't call dom_dispose() on a listening socket
sorflush() already did the right thing, so only sofree() needed
a fix.  Turn check into assertion in our only dom_dispose method.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35124
2022-05-09 10:42:57 -07:00
Gleb Smirnoff
c17418a0ba sockets: assert that any protocol with PR_RIGHTS has dom_dispose()
Through the entire history only PF_UNIX has this feature.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35123
2022-05-09 10:42:48 -07:00
Gleb Smirnoff
24df85d29a unix/*: unp_internalize() can sleep, so allocate mbufs with M_WAITOK 2022-05-09 10:42:48 -07:00
Gleb Smirnoff
97f8198e95 sockets: make SO_SND/SO_RCV a enum
Not a functional change now. The enum will also be used for other socket
buffer related KPIs.
2022-05-09 10:42:47 -07:00
Warner Losh
45ae223ac6 msgbuf: Allow microsecond granularity timestamps
Today, kern.msgbuf_show_timestamp=1 will give 1 second granularity
timestamps on dmesg lines. When kern.msgbuf_show_timestamp=2, we'll
produce microsecond level graunlarity.
For example:
old (== 1):
[13] Dual Console: Video Primary, Serial Secondary
[14] lo0: link state changed to UP
[15] bxe0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[15] bxe0: link state changed to UP
new (== 2):
[13.807015] Dual Console: Video Primary, Serial Secondary
[14.544150] lo0: link state changed to UP
[15.272044] bxe0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[15.272052] bxe0: link state changed to UP

Sponsored by:		Netflix
2022-05-07 09:32:22 -06:00
Alan Somers
1d2421ad8b Correctly measure system load averages > 1024
The old fixed-point arithmetic used for calculating load averages had an
overflow at 1024.  So on systems with extremely high load, the observed
load average would actually fall back to 0 and shoot up again, creating
a kind of sawtooth graph.

Fix this by using 64-bit math internally, while still reporting the load
average to userspace as a 32-bit number.

Sponsored by:	Axcient
Reviewed by:	imp
Differential Revision: https://reviews.freebsd.org/D35134
2022-05-06 17:25:43 -06:00
John Baldwin
2fdcc2ef6f cpufreq: Remove unused devclass argument to DRIVER_MODULE. 2022-05-06 15:46:58 -07:00
Dmitry Chagin
f04534f5c8 sysvsem: Add a timeout argument to the semop.
For future use in the Linux emulation layer for the semtimedop syscall
split the sys_semop syscall into two counterparts and add
struct timespec *timeout argument to the last one.

Reviewed by:		jhb, kib
Differential revision:	https://reviews.freebsd.org/D35121
MFC after:		2 weeks
2022-05-06 19:51:48 +03:00
Kristof Provost
613acc6483 mbuf: do not restore dying interfaces
When we remove an interface it is first removed from the interface list
V_ifnet (by if_unlink_ifnet()) and marked as IFF_DYING. We then wait for
any possible references to stop being used (i.e.
epoch_wait/epoch_drain_callbacks) before we tear it fully down.

However, the index in ifindex_table is not removed, so m_rcvif_restore()
can still find the (now dying) interface.

This results in panics, for example when dummynet restores the rcvif
pointer and passes a packet to ip6_input() we can panic because the
AF_INET6 domain has already been removed (so we end up dereferencing a
NULL pointer there).

Check that the interface is not dying before we restore it, which is
equivalent to checking its presence in V_ifnet, and thus ensures that
future accesses (while in NET_EPOCH) are safe.

Reviewed by:	glebius
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D34076

(cherry picked from commit 703e533da5)
2022-05-05 14:38:08 -04:00
Gleb Smirnoff
4d7a1361ef ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif
Supplement ifindex table with generation count and use it to
serialize & restore an ifnet pointer.

Reviewed by:		kp
Differential revision:	https://reviews.freebsd.org/D33266
Fun note:		git show e6abef0918

(cherry picked from commit e1882428dc)
2022-05-05 14:38:07 -04:00
Marko Zec
6c741ffbfa Revert "mbuf: do not restore dying interfaces"
This reverts commit 703e533da5.

Revert "ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif"

This reverts commit e1882428dc.

Obtained from: github.com/glebius/FreeBSD/commits/backout-ifindex
2022-05-03 19:11:40 +02:00
Konstantin Belousov
6fe78ad434 subr_unit.c: make userspace tests buildable
by defining a placeholder for UNR_NO_MTX

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2022-04-28 03:00:14 +03:00
Konstantin Belousov
709783373e Fix another race between fork(2) and PROC_REAP_KILL subtree
where we might not yet see a new child when signalling a process.
Ensure that this cannot happen by stopping all reapping subtree,
which ensures that the child is not inside a syscall, in particular
fork(2).

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:35 +03:00
Konstantin Belousov
39794d80ad Fix a race between fork(2) and PROC_REAP_KILL subtree
by repeating iteration over the subtree until there are no new processes
to signal.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:35 +03:00
Konstantin Belousov
d1df347368 kern_procctl: add possibility to take stop_all_proc_block() around exec
stop_allo_proc_block() must be taken before proctree_lock.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:35 +03:00
Konstantin Belousov
2e7595ef2f Add stop_all_proc_block(9)
It allows to have more than one consumer of thread_signle(SIGNLE_ALLPROC) by
serializing them.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:35 +03:00
Konstantin Belousov
54a11adbd9 reap_kill(): split children and subtree killers into helpers
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
134529b11b reap_kill(): rename the reap variable to reaper
Suggested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
e4ce431e2a reap_kill(): de-inline LIST_FOREACH(), twice
Suggested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
b9294a3e15 reaper_abandon_children(): upgrade proctree_lock assert to exclusive
p_reapsibling linkage is protected by proctree_lock, and it is modified
there.

Suggested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
e59b940dcb unr(9): allow to avoid internal locking
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
c4be460e84 init_unrhdr(): make it usable by initializing everything
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
John Baldwin
1431239494 Add a __witness_used for variables only used under #ifdef WITNESS.
__diagused is now solely used for variables only used under INVARIANTS.

Reviewed by:	mjg
Differential Revision:	https://reviews.freebsd.org/D35085
2022-04-27 11:46:16 -07:00
Dmitry Chagin
4a700f3c32 sigtimedwait: Prevent timeout math overflows.
Our kern_sigtimedwait() calculates absolute sleep timo value as 'uptime+timeout'.
So, when the user specifies a big timeout value (LONG_MAX), the calculated
timo can be less the the current uptime value.
In that case kern_sigtimedwait() returns EAGAIN instead of EINTR, if
unblocked signal was caught.

While here switch to a high-precision sleep method.

Reviewed by:		mav, kib
In collaboration with:	mav
Differential revision:	https://reviews.freebsd.org/D34981
MFC after:		2 weeks
2022-04-25 10:23:15 +03:00
Dmitry Chagin
91e7bdcdcf Add timespecvalid_interval macro and use it.
Reviewed by:		jhb, imp (early rev)
Differential revision:	https://reviews.freebsd.org/D34848
MFC after:		2 weeks
2022-04-25 10:20:54 +03:00
John Baldwin
a4c5d490f6 KTLS: Move OCF function pointers out of ktls_session.
Instead, create a switch structure private to ktls_ocf.c and store a
pointer to the switch in the ocf_session.  This will permit adding an
additional function pointer needed for NIC TLS RX without further
bloating ktls_session.

Reviewed by:	hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35011
2022-04-22 15:52:12 -07:00
John Baldwin
92e40a9b92 busdma_bounce: Batch bounce page free operations when possible.
Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D34968
2022-04-21 12:01:55 -07:00
John Baldwin
d4ab3a8d4f busdma_bounce: Add free_bounce_pages helper function.
Deduplicate code to iterate over the bpages list in a bus_dmamap_t
freeing bounce pages during bus_dmamap_unload.

Reviewed by:	imp
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D34967
2022-04-21 10:42:14 -07:00
John Baldwin
10fe9a1fb4 busdma_bounce: Make the map waiting list per-bounce-zone.
When pages are freed to a bounce zone, only maps waiting for pages for
that zone can make forward progress.  If a map for a different bounce
zone is at the head of the global list, then requests that could
otherwise make forward progress will be stalled waiting on the other
bounce zone.  If bounce zones shared bounce pages then a global list
would still make sense to prevent "later" requests from starving an
earlier request but that is not a concern with per-zone bounce page
pools.

Reviewed by:	imp
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D34966
2022-04-21 10:41:09 -07:00
John Baldwin
d11f5d4762 busdma_bounce: Use a simple kproc to invoke deferred requests.
Rather than using a software interrupt with a single handler, just
create a dedicated kernel process woken up with a simple wakeup().

Reviewed by:	imp
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D34965
2022-04-21 10:40:35 -07:00
John Baldwin
c7aa0304d5 Run softclock threads at a hardware ithread priority.
Add a new PI_SOFTCLOCK for use by softclock threads.  Currently this
maps to PI_AV which is the second-highest ithread priority.

Reviewed by:	mav, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D33693
2022-04-21 10:40:01 -07:00
John Baldwin
3d7e90fc20 cpufreq_curr_sysctl: Use devclass_find to lookup cpufreq devclass.
Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D35002
2022-04-21 10:29:14 -07:00
Kristof Provost
a879e40ca2 callout: fix using shared rmlocks
15b1eb142c changed the callout code to store the CALLOUT_SHAREDLOCK flag
in c_iflags (where it used to be c_flags), but failed to update the
check in softclock_call_cc(). This resulted in the callout code always
taking the write lock, even if a read lock had been requested (with
the CALLOUT_SHAREDLOCK flag in callout_init_rm()).

Reviewed by:	markj
MFC after:	1 week
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D34959
2022-04-20 13:06:50 +02:00
John Baldwin
5bdea8826b devclass_add_driver: Permit NULL to be passed in dcp.
This permits a driver module structure that doesn't want to store a
pointer to the new driver's devclass.

Reviewed by:	imp
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D34962
2022-04-19 10:43:50 -07:00
Mateusz Guzik
c5c981d443 signals: plug a set-but-not-used var
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-04-19 12:45:57 +00:00
John Baldwin
d139909d6e destroy_dev_sched*: Don't hold Giant for all deferred destroy_dev.
Rather than using taskqueue_swi_giant which holds Giant for all
deferred destroy_dev calls, create a separate queue for destroyed
devices with D_NEEDGIANT set in the corresponding cdevsw.  The task
for this queue holds Giant whild destroying deferred devices while the
task for the default queue does not hold Giant.

In addition, switch to taskqueue_thread for destroy_dev_sched.
Deferred destroy_dev requests don't need to run at an SWI priority.

Reviewed by:	imp, markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D34915
2022-04-18 12:04:30 -07:00
Konstantin Belousov
362ff9867e Revert rest of a5970a529c: use vrefact() when working on fp->f_vnode
Now, since O_PATH-opened file descriptors use use references instead
of the hold references, vrefact() chahges from that revision can be
reverted.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34906
2022-04-15 16:56:20 +03:00
Ed Maste
f99cc5a389 sysent: regen after 52a1d90c8b, posix_fadvise in capmode 2022-04-14 15:17:36 -04:00
Ed Maste
52a1d90c8b Allow posix_fadvise in capability mode
posix_fadvise operates only on a provided fd.  Noted by
Mathieu <sigsys@gmail.com> in review D34761.

No new CAP_ rights are added for posix_fadvise(), as 'advice' in
general only influences when I/O happens; the fd must have existing
CAP_ rights for actual data access.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34903
2022-04-14 15:11:21 -04:00
Konstantin Belousov
bf13db086b Mostly revert a5970a529c: Make files opened with O_PATH to not block non-forced unmount
Problem is that open(O_PATH) on nullfs -o nocache is broken then,
because there is no reference on the vnode after the open syscall exits.

Reported and tested by:	ambrisko
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2022-04-14 02:47:04 +03:00
John Baldwin
36fb372264 kern: Move variables only used for MAC under #ifdef MAC. 2022-04-13 16:08:23 -07:00
John Baldwin
4aec198420 sched_ule: Inline value of ts in sched_thread_priority.
This avoids a set but unused warning in kernels without SMP where
TDQ_CPU() doesn't use its argument.
2022-04-13 16:08:23 -07:00
John Baldwin
8758ac757f sched_4bsd: ts is only used in sched_bind for SMP. 2022-04-13 16:08:22 -07:00
John Baldwin
72ff256c51 sched_4bsd: Remove unused variables. 2022-04-12 14:58:59 -07:00
John Baldwin
dbd51c416a realloc(9): Move slab and zone under #ifndef DEBUG_REDZONE. 2022-04-12 14:58:59 -07:00
Mark Johnston
d769609620 tty: Remove an incorrect assertion from ttyinq_line_iterate()
We may legitimately have tib == NULL if we're at the very end of the
queue.

PR:		215373
Reported by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-04-12 17:30:04 -04:00
Tom Jones
1ea833a572 kdb: set kdb_why when entered via reboot and panic
Reviewed by:	jhb
Sponsored by:   NetApp, Inc.
Sponsored by:   Klara, Inc.
X-NetApp-PR:    #74
Differential Revision:	https://reviews.freebsd.org/D34551
2022-04-12 10:34:40 +01:00
Dmitry Chagin
c6487446d7 getdirentries: return ENOENT for unlinked but still open directory.
To be more compatible to IEEE Std 1003.1-2008 (“POSIX.1”).

Reviewed by:		mjg, Pau Amma (doc)
Differential revision:  https://reviews.freebsd.org/D34680
MFC after:		2 weeks
2022-04-11 23:30:16 +03:00
Konstantin Belousov
eca39864f7 Add sysctl KERN_LOCKF
reporting the shapshot of the active advisory locks.

A new VFS ops method vfs_report_lockf if provided in the mount point
op table.  If it is NULL, as it is currently for all existing
filesystems, vfs_report_lockf() function is used, which gathers
information from the standard implementation inside kern/kern_lockf.c.

Filesystems implementing its own locking (NFSv4 as example) can provide
a custom implementation.

Reviewed by:	markj, rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34756
2022-04-10 00:43:53 +03:00
Konstantin Belousov
147e4fe3f1 kern_lockf.c: remove no longer neeeded UFS headers
Reviewed by:	markj, rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34756
2022-04-10 00:43:53 +03:00
Konstantin Belousov
59e85819be lockf: remove lf_inode from struct lockf_entry
The UFS-specific struct inode cannot be used in generic advisory lock
code.  It was probably used as a shortcut for the debugging, as the
remnants of the code around it indicates.

Use somewhat more verbose and less concentrated, but universal,
VOP_PRINT(), where needed.

Reviewed by:	markj, rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34756
2022-04-10 00:43:53 +03:00
Gordon Bergling
f171938cd6 jail: Remove a double word in a source code comment
- s/a a/a/

MFC after:	3 days
2022-04-09 14:19:17 +02:00
Gordon Bergling
c3721292e3 kern: Remove a double word in a source code comment
- s/for for/for/

MFC after:	3 days
2022-04-09 10:50:04 +02:00
Gordon Bergling
768f9b8b8b kern: Fix a typo in a source code comment
- s/is is/is/

MFC after:	3 days
2022-04-09 09:14:14 +02:00
Andrew Turner
41e6d2091c Enable subr_physmem_test on supported architectures
Only build where it's supported.

While here add support for amd64 to help with testing.

Sponsored by:	The FreeBSD Foundation
2022-04-07 14:31:51 +01:00
Andrew Turner
d8bff5b67c Handle non-page aligned/sized memory in physmem
In some configurations the firmware may pass memory regions that are
not page sized or aligned, e.g. when using 16k pages on arm64. If this
is the case we will calculate many small regions because the alignment
is applied before being inserted. As we round the start up and end down
this will leave a 1 page hole between what should have been a single
region.

Fix by keeping the original alignment until we are just about to insert
the region into the avail array.

Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34694
2022-04-06 14:13:29 +01:00
Andrew Turner
8c99dfed54 Port subr_physmem to userspace and add tests
These give us some confidience we haven't broken anything in early
boot code that may be running before the console.

Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34691
2022-04-06 14:13:05 +01:00
Mitchell Horne
eb9d205fa6 livedump: add event handler hooks
Add three hooks to the livedump process: before, after, and for each
block of dumped data. This allows, for example, quiescing the system
before the dump begins or protecting data of interest to ensure its
consistency in the final output.

Reviewed by:	markj, kib (previous version)
Reviewed by:	debdrup (manpages)
Reviewed by:	Pau Amma <pauamma@gundo.com> (manpages)
MFC after:	3 weeks
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D34067
2022-04-05 15:35:05 -03:00
Mitchell Horne
c9114f9f86 Add new vnode dumper to support live minidumps
This dumper can instantiate and write the dump's contents to a
file-backed vnode.

Unlike existing disk or network dumpers, the vnode dumper should not be
invoked during a system panic, and therefore is not added to the global
dumper_configs list. Instead, the vnode dumper is constructed ad-hoc
when a live dump is requested using the new ioctl on /dev/mem. This is
similar in spirit to a kgdb session against the live system via
/dev/mem.

As described briefly in the mem(4) man page, live dumps are not
guaranteed to result in a usuable output file, but offer some debugging
value where forcefully panicing a system to dump its memory is not
desirable/feasible.

A future change to savecore(8) will add an option to save a live dump.

Reviewed by:	markj, Pau Amma <pauamma@gundo.com> (manpages)
Discussed with:	kib
MFC after:	3 weeks
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D33813
2022-04-05 15:35:05 -03:00
Mitchell Horne
59c27ea18c Split out dumper allocation from list insertion
Add a new function, dumper_create(), to allocate a dumper.
dumper_insert() will call this function and retains the existing
behaviour.

This is desirable for performing live dumps of the system. Here, there
is a need to allocate and configure a dumper structure that is invoked
outside of the typical debugger context. Therefore, it should be
excluded from the list of panic-time dumpers.

free_single_dumper() is made public and renamed to dumper_destroy().

Reviewed by:	kib, markj
MFC after:	1 week
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D34068
2022-04-05 15:35:05 -03:00
Mateusz Guzik
b7262756e2 vfs: fixup WANTIOCTLCAPS on open
In some cases vn_open_cred overwrites cn_flags, effectively nullifying
initialisation done in NDINIT. This will have to be fixed.

In the meantime make sure the flag is passed.

Reported by:	jenkins
Noted by:	Mathieu <sigsys@gmail.com>
2022-04-02 20:49:01 +02:00
Gordon Bergling
c9b04ee4f8 kern: Fix two typos in source code comments
- s/accomodate/accommodate/

MFC after:	3 days
2022-04-02 14:52:49 +02:00
Gordon Bergling
7181887e82 kern: Fix two typos in source code comments
- s/measurment/measurement/

MFC after:	3 days
2022-04-02 14:15:27 +02:00
Mateusz Guzik
0c805718cb vfs: fix memory leak on lookup with fds with ioctl caps
Reviewed by:	markj
PR:		262515
Noted by:	firk@cantconnect.ru
Differential Revision:	https://reviews.freebsd.org/D34667
2022-04-02 12:09:07 +00:00
Gordon Bergling
669d5ea4e3 kern: Fix a typo in a source code comment
- s/paniced/panicked/

MFC after:	3 days
2022-04-02 10:15:02 +02:00
Ed Maste
e5821a2156 syscalls.master: remove obsolete comment about compatibility tables
Compatibility ABIs no longer use a separate syscalls.master.

Fixes:		be67ea40c5 ("freebsd32: generate from ...")
Sponsored by:	The FreeBSD Foundation
2022-03-30 11:07:00 -04:00
Brooks Davis
8601fca789 sysent: regen for syscallarg_t 2022-03-28 19:43:03 +01:00
Brooks Davis
b1ad6a9000 syscallarg_t: Add a type for system call arguments
This more clearly differentiates system call arguments from integer
registers and return values. On current architectures it has no effect,
but on architectures where pointers are not integers (CHERI) and may
not even share registers (CHERI-MIPS) it is necessiary to differentiate
between system call arguments (syscallarg_t) and integer register values
(register_t).

Obtained from:	CheriBSD

Reviewed by:	imp, kib
Differential Revision:	https://reviews.freebsd.org/D33780
2022-03-28 19:43:03 +01:00
Andrew Turner
f461b95561 Fix a sign mismatch warning in the physmem code
Make sure both sides of a comparison are unsigned. As the values being
compared are size_t make the the value in the for loop size_t too.

Sponsored by:	The FreeBSD Foundation
2022-03-28 11:51:09 +01:00
Mateusz Guzik
2533b5dc82 vfs: add missing bits to vdropl_impl
This completes the patch which was originally meant to go in.

Spotted by:	mhorne
Fixes: c35ec1efdc ("vfs: [1/2] fix stalls in vnode reclaim by not
requeieing from vnlru")
2022-03-27 14:35:37 +00:00
Mateusz Guzik
a4032e2a69 vfs: assorted tidy ups to lookup
No functional changes.
2022-03-26 17:06:09 +00:00
Alexander Leidinger
aeb91e95cf Log euid, rgid and jail on listen queue overflow
If you have numerous jails with multiple similar services running,
this helps to narrow down which services this log is referring to.
2022-03-26 11:17:55 +01:00
Eric van Gyzen
aca2a7faca stack_zero is not needed before stack_save
The man page was recently clarified to commit to this contract.

MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2022-03-25 20:10:38 -05:00
Eric van Gyzen
863070bbf6 ksiginfo_alloc: pass M_WAITOK or M_NOWAIT to uma_zalloc
It expects exactly one of those flags.  A future commit will assert this.

Reviewed by:	rstone
MFC after:	1 month
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D34451
2022-03-25 20:10:37 -05:00
Mateusz Guzik
0f60088399 vfs: set cn_namelen when handling degenerate lookups
Turns out execve looks at it to store binary name, but in order to
trigger the problem one has to be trying to exec '/'. As is the value
would be left uninitialized (or rather set to -1 on debug kernels).

Fixes:	56244d3574 ("vfs: hoist degenerate path lookups out of the
loop")
2022-03-25 18:19:36 +00:00
Mateusz Guzik
4ef6e56ae8 vfs: hoist trailing slash handling out of the loop 2022-03-24 14:36:31 +00:00
Mateusz Guzik
3b6792d28a vfs: factor symlink traversal out of namei
The intent down the road is to eliminate the loop to begin with,
pushing traversal down to vfs_lookup, all while not allocating the
extra buffer.
2022-03-24 13:11:22 +00:00
Mateusz Guzik
d9ea7e2b1e vfs: factor FAILIFEXISTS handling out of vfs_lookup 2022-03-24 11:22:20 +00:00
Mateusz Guzik
56244d3574 vfs: hoist degenerate path lookups out of the loop 2022-03-24 11:22:12 +00:00
Mateusz Guzik
bb92cd7bcd vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd) 2022-03-24 10:20:51 +00:00
Mark Johnston
1babcad6bc elf: Avoid dumping uninitialized bytes in PRSTATUS core dump notes
elf_prstatus_t contains pad space.

Reported by:	KMSAN
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34606
2022-03-23 12:53:49 -04:00
Mark Johnston
7524994da0 callout: Remove the CS_EXECUTING flag
It is now unused.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34626
2022-03-23 12:37:02 -04:00
Mark Johnston
b319171861 setitimer: Fix exit race
We use the p_itcallout callout, interlocked by the proc lock, to
schedule timeouts for the setitimer(2) system call.  When a process
exits, the callout must be stopped before the process struct is
recycled.

Currently we attempt to stop the callout in exit1() with the call
_callout_stop_safe(&p->p_itcallout, CS_EXECUTING).  If this call returns
0, then we sleep in order to drain the callout.  However, this happens
only if the callout is not scheduled at all.  If the callout thread is
blocked on the proc lock, then exit1() will not block and the callout
may execute after the process has fully exited, typically resulting in a
panic.

I cannot see a reason to use the CS_EXECUTING flag here.  Instead, use
the regular callout_stop()/callout_drain() dance to halt the callout.

Reported by:	ler
Tested by:	ler, pho
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34625
2022-03-23 12:36:12 -04:00
Alexander Motin
fd6ca665d2 Fix umtxq_sleep() regression caused by 56070dd2e4.
umtxq_requeue() moves the queue to a different hash chain and different
lock, so we can't rely on msleep_sbt() reacquiring the same old lock.
We have to use PDROP and update the queue chain and so lock pointer.

PR:		262587
MFC after:	2 weeks
2022-03-21 19:55:55 -04:00
firk
bb53dd56c3 kern_tc.c/cputick2usec() (which is used to calculate cputime from
cpu ticks) has some imprecision and, worse, huge timestep (about
20 minutes on 4GHz CPU) near 53.4 days of elapsed time.

kern_time.c/cputick2timespec() (it is used for clock_gettime() for
querying process or thread consumed cpu time) Uses cputick2usec()
and then needlessly converting usec to nsec, obviously losing
precision even with fixed cputick2usec().

kern_time.c/kern_clock_getres() uses some weird (anyway wrong)
formula for getting cputick resolution.

PR:		262215
Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D34558
2022-03-21 09:33:46 -04:00
Andrew Turner
cab496e16c Make SHMMAXPGS an unsigned long
This is used to calculate sizes that are then stored in unsigned long
fields. Make this unsigned long so the calculations use this type and
not an int that can lead to an integer overflow with a large PAGE_SIZE.

This allows building this on arm64 with PAGE_SIZE of 16k. Further work
will be needed if a 32-bit architecture tries to use a similar sized
page.

Sponsored by:	The FreeBSD Foundation
2022-03-21 10:27:35 +00:00
Colin Percival
2406867f5b tslog: Add CTLFLAG_SKIP to sysctls
The timestamp logs are quite large (often much larger than all the
other sysctls combined) so it's unlikely anyone will want to have
them displayed by `sysctl -a`.

MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D34616
2022-03-20 11:31:16 -07:00
Mateusz Guzik
6ff3e8a316 cache: add a comment about a realpath bug 2022-03-19 15:11:25 +00:00
Mateusz Guzik
eb574ba0b6 vfs: replace VFS_NOTIFY_UPPER_* macros with an enum 2022-03-19 13:15:55 +00:00
Mateusz Guzik
cceb91b025 vfs: add missing flags to db show mount 2022-03-19 12:04:44 +00:00
Mateusz Guzik
93a0ba8f49 vfs: retire the no longer used MNTK_LOOKUP_EXCL_DOTDOT flag
Reviewed by:	markj
Tested by:	pho (previous version)
Differential Revision:	https://reviews.freebsd.org/D34466
2022-03-19 10:47:29 +00:00
Mateusz Guzik
1cb0045c97 vfs: add MNTK_UNLOCKED_INSMNTQUE
Can be used when the fs at hand can synchronize insmntque with other
means than the vnode lock.

Reviewed by:	markj
Tested by:	pho (previous version)
Differential Revision:	https://reviews.freebsd.org/D34466
2022-03-19 10:46:40 +00:00
firk
28d08dc7d0 clock_gettime: Fix CLOCK_THREAD_CPUTIME_ID race
Use a spinlock section instead of a critical section to synchronize with
statclock().  Otherwise the CLOCK_THREAD_CPUTIME_ID clock can appear to
go backwards.

PR:		262273
Reviewed by:	markj
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D34568
2022-03-17 15:39:00 -04:00
Mark Johnston
fc7e121d88 file: Move FILEDESC_FOREACH macros to kern_descrip.c
They are only used in kern_descrip.c, so make them private.  No
functional change intended.

Discussed with:	mjg
Sponsored by:	The FreeBSD Foundation
2022-03-17 15:39:00 -04:00
Mark Johnston
c702242292 file: Avoid a read-after-free of fd tables in sysctl handlers
Some loops access the fd table of a different process, and drop the
filedesc lock while iterating, so they check the table's refcount.
However, we access the table before the first iteration, in order to get
the number of table entries, and this access can be a use-after-free.

Fix the problem by checking the refcount before we start iterating.

Reported by:	pho
Reviewed by:	mjg
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34575
2022-03-17 15:39:00 -04:00
Mateusz Guzik
0134bbe56f vfs: prefix lookup and relookup with vfs_
Reviewed by:	imp, mckusick
Differential Revision:		https://reviews.freebsd.org/D34530
2022-03-13 14:44:39 +00:00
Mateusz Guzik
02fc4e319c cache: use flexible array member
... instead of 0-sizing the array
2022-03-13 14:43:35 +00:00
John Baldwin
6b71405bfe Store core dump notes for all valid register sets for FreeBSD processes.
In particular, use a generic wrapper around struct regset rather than
requiring per-regset helpers.  This helper replaces the MI
__elfN(note_prstatus) and __elfN(note_fpregset) helpers.  It also
removes the need to explicitly dump NT_ARM_ADDR_MASK in the arm64
__elfN(dump_thread).

Reviewed by:	markj, emaste
Sponsored by:	University of Cambridge, Google, Inc.
Differential Revision:	https://reviews.freebsd.org/D34446
2022-03-10 15:40:19 -08:00
Kornel Duleba
b344de4d0d Extend device_get_property API
In order to support various types of data stored in device
tree properties or ACPI _DSD packages, create a new enum so
the caller can specify the expected type of a property they
want to read, according to the binding. The bus logic will use
that information to process the underlying data.

For example in DT all integer properties are stored in BE format.
In order to get constant results across different platforms we
need to convert its endianness to match the host.

Another example are ACPI_TYPE_INTEGER properties stored
as uint64_t. Before this patch the ACPI logic would refuse
to read them if the provided buffer was smaller than 8 bytes.
Now this can be handled by using DEVICE_PROP_UINT32 type.

Modify the existing consumers of this API to reflect the changes
and update the man pages accordingly.

Reviewed by: mw
Obtained from: Semihalf
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D33457
2022-03-10 12:11:32 +01:00
Kornel Duleba
206dc82bc3 bus_if: Add a default implementation of get_property
There are multiple buses that pretend to be ofw compatible,
e.g ofw_pci, mii_fdt. We now need to provide an implementation
of BUS_GET_PROPERTY for every one of them. Instead of modifying
them one by one it's better to just provide a default
implementation that simply traverses up the device tree.
Remove the now unneeded BUS_GET_PROPERTY implementation in mii_fdt.

Reviewed by: andrew, bz
Obtained from: Semihalf
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D34031
2022-03-10 12:11:32 +01:00
Mateusz Guzik
3a4c5dab92 vfs: [2/2] fix stalls in vnode reclaim by only counting attempts
... and ignoring if they succeded, which matches historical behavior.

Reported by:	pho
2022-03-10 09:41:50 +00:00
Mateusz Guzik
c35ec1efdc vfs: [1/2] fix stalls in vnode reclaim by not requeieing from vnlru
Reported by:	pho
2022-03-10 09:41:50 +00:00
Ed Maste
080b4e8a0c kcov: use __func__ in KASSERT instead of old function name
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-03-07 10:47:27 -05:00
Mark Johnston
afb44cb010 rmlock: Temporarily revert commit c84bb8cd77
It appears to have introduced a regression on arm64, possibly due to the
fact that the pcpu pointer is reloaded outside of the critical section
in _rm_rlock().  Until this is resolved one way or another, let's
revert.

Reported by:	Ronald Klop <ronald-lists@klop.ws>
Sponsored by:	The FreeBSD Foundation
2022-03-07 10:43:19 -05:00
Mark Johnston
8dbae4ce32 linker: Permit CTFv3 containers
Reviewed by:	Domagoj Stolfa
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34362
2022-03-07 10:43:19 -05:00
Mark Johnston
cab9382a2c linker: Simplify CTF container handling
Use sys/ctf.h to provide various definitions required to parse the CTF
header.  No functional change intended.

Reviewed by:	Domagoj Stolfa, emaste
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34359
2022-03-07 10:43:18 -05:00
Konstantin Belousov
1fb00c8f10 buf_alloc(): Stop using LK_NOWAIT, use LK_NOWITNESS
Despite the buffer taken from cache or free list, it still can be
locked, due to 'lockless lookup' in getblkx() potentially operating on
the freed buffers.  The lock is transient, but prevents the use of
LK_NOWAIT there for the goal of neutralizing WITNESS.

Just use LK_NOWITNESS.

Reported and tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	3 days
2022-03-06 10:29:31 -05:00
Alexander Motin
56070dd2e4 Improve timeout precision of pthread_cond_timedwait().
This code was not touched when all other user-space sleep functions were
switched to sbintime_t and decoupled from hardclock.  When it is possible,
convert supplied times into sbinuptime to supply directly to msleep_sbt()
with C_ABSOLUTE.  This provides the timeout resolution of few microseconds
instead of 2 milliseconds, plus avoids few clock reads and conversions.

Reviewed by:	vangyzen
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D34163
2022-03-03 22:03:09 -05:00
John Baldwin
0b25cbc79d Fix the size returned for NT_FPREGSET.
Sponsored by:	University of Cambridge, Google, Inc.
2022-03-03 17:53:06 -08:00
John Baldwin
9af41803cb Use vnsz2log directly in assertion on its relation to sizeof(struct vnode).
This reduces the size of diffs required to support different values of
vnsz2log.  In CheriBSD, kernels for CHERI architectures have vnodes
larger than 512 bytes and require a value of 9.

Reviewed by:	mjg
Obtained from:	CheriBSD
Sponsored by:	University of Cambridge, Google, Inc.
Differential Revision:	https://reviews.freebsd.org/D34418
2022-03-03 17:52:07 -08:00
Mateusz Guzik
afb08a6d07 cache: hide hash stats behind DEBUG_CACHE
They take a long time to dump and hinder sysctl -a when used with
DIAGNOSTIC.
2022-03-03 17:21:58 +00:00
Mateusz Guzik
f3f3e3c44d fd: add close_range(..., CLOSE_RANGE_CLOEXEC)
For compatibility with Linux.

MFC after:	3 days
Reviewed by:	markj
Differential Revision:	https://reviews.freebsd.org/D34424
2022-03-03 17:21:58 +00:00
Mark Johnston
879b0604a8 proc: Remove assertion that P_WEXIT is not set in proc_rwmem()
exit1() sets P_WEXIT before waiting for holding threads to finish,
rather than after, so this assertion is racy.

Fixes:	12fb39ec3e ("proc: Relax proc_rwmem()'s assertion on the process hold count")
Reported by:	Jenkins
2022-03-01 15:09:45 -05:00
Mark Johnston
12fb39ec3e proc: Relax proc_rwmem()'s assertion on the process hold count
This reference ensures that the process and its associated vmspace will
not be destroyed while proc_rwmem() is executing.  If, however, the
calling thread belongs to the target process, then it is unnecessary to
hold the process.  In particular, fasttrap - a module which enables
userspace dtrace - may frequently call proc_rwmem(), and we'd prefer to
avoid the overhead of locking and bumping the hold count when possible.

Thus, make the assertion conditional on "p != curproc".  Also assert
that the process is not already exiting.  No functional change intended.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2022-03-01 12:40:35 -05:00
Warner Losh
b36bd3a906 bus: Create dev_wired_cache
A simple cache to cache differnet locators to the same device.

Sponsored by:		Netflix
Changes Suggested by:	jhb
Differential Revision:	https://reviews.freebsd.org/D32783
2022-03-01 08:06:41 -07:00
Warner Losh
cae7d9ec83 bus: Add ACPI locator support
Add support for printing ACPI paths. This is a bit of a degenerate case
for this interface since it's always just the device handle if the
device has one. But it is illustrtive of how to do this for a few nodes
in the tree.

Sponsored by:		Netflix
Reviewed by:		jhb
Differential Revision:	https://reviews.freebsd.org/D32748
2022-03-01 08:06:41 -07:00
Warner Losh
38e942a345 devctl: Add DEV_GET_PATH
DEV_GET_PATH will get the path to a device based on different locators.

Sponsored by:		Netflix
Reviewed by:		jhb
Differential Revision:	https://reviews.freebsd.org/D32745
2022-03-01 08:06:41 -07:00
Warner Losh
e19db70769 bus: Introduce the bus interface get_device_path
This returns the full path of a the child device requested. Since
there's different ways to recon the entire path, include a 'locator'
method. The default 'FreeBSD' method uses a filesystem-like path name
with each device to the root node separated by /. Other locators will be
UEFI, ACPI and fdt, though others are possible in the future. Make the
locator a string to allow maximum flexibility.

Sponsored by:		Netflix
Reviewed by:		jhb
Differential Revision:	https://reviews.freebsd.org/D32744
2022-03-01 08:06:40 -07:00
Warner Losh
78408171bd devctl2: Change to 644 protections
We make sure that we check for device privs (usually meaning root or
better) for everything. To allow other functions that don't require
this, default to 644 protection.

Sponsored by:		Netflix
Reviewed by:		jhb
Differential Revision:	https://reviews.freebsd.org/D32863
2022-03-01 08:06:40 -07:00
Mark Johnston
89ae8eb74e rmlock: Add required compiler barriers to _rm_runlock()
Also remove excessive whitespace in _rm_rlock().

Reviewed by:	jah, mjg
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34381
2022-03-01 09:38:45 -05:00
Warner Losh
9891cb1e76 Eliminate curlen, it's set but never used
Sponsored by:		Netflix
2022-02-27 09:02:45 -07:00
Mark Johnston
c84bb8cd77 rmlock: Micro-optimize read locking
Use get_pcpu() instead of an open-coded pcpu_find(td->td_oncpu).  This
eliminates some memory accesses and results in a shorter instruction
sequence.  Note that get_pcpu() didn't exist when rmlocks were added.

Reviewed by:	jah, mjg
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34377
2022-02-25 13:55:24 -05:00
Marvin Ma
1517b8d5a7 vfs_unregister: fix error handling
Due to misplaced braces, an error from vfs_uninit() in the VFCF_SBDRY
case was ignored.

Reported by:	Anton Rang <rang@acm.org>
Reviewed by:	Anton Rang <rang@acm.org>, markj
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D34375
2022-02-25 12:19:14 -06:00
Warner Losh
2bfdc1ee9b cons: Use bool for boolean variables
MFC After:		3 days
Sponsored by:		Netflix
2022-02-24 10:58:07 -07:00
Jamie Gritton
d7c4ea7d72 posixshm: Allow jails to use kern.ipc.posix_shm_list
PR:		257554
Reported by:	grembo@
2022-02-24 09:30:49 -08:00
Gleb Smirnoff
d2b3a0ed31 sendto: don't clear transient errors for atomic protocols
The changeset 65572cade3 uncovered the fact that top layer of sendto(2)
would clear a transient error code if some data was copied out of uio.
The clearing of the error makes sense for non-atomic protocols, since
they have sent some data.  The atomic protocols send all or nothing.

The current implementation of unix/dgram uses sosend_generic(), which
would always copyout and only then it may fail to deliver a message.
The sosend_dgram(), currently used by UDP only, also has same behavior.

Reported by:		pho
Reviewed by:		pho, markj
Differential revision:	https://reviews.freebsd.org/D34309
2022-02-23 10:24:14 -08:00
Andrew Turner
d58b79d1ce Built all KCSAN atomic interceptors on arm64
These atomic functions are now supported. Add them to KCSAN.

Sponsored by:	The FreeBSD Foundation
2022-02-23 14:45:47 +00:00
Mateusz Guzik
f17ef28674 fd: rename fget*_locked to fget*_noref
This gets rid of the error prone naming where fget_unlocked returns with
a ref held, while fget_locked requires a lock but provides nothing in
terms of making sure the file lives past unlock.

No functional changes.
2022-02-22 18:53:43 +00:00
Robert Wing
0a2f498234 tty: fix a panic with INVARIANTS
watch'ing a tty triggers a refcount wraparound panic, take a reference
on fp after fget_cap_locked() to fix.

Reported by:    Michael Jung <mikej_at_paymentallianceintl.com>
Reviewed by:	hselasky, mjg
Fixes:          f40dd6c803 ("tty: switch ttyhook_register to use fget_cap_locked")
Differential Revision:	https://reviews.freebsd.org/D34335
2022-02-22 09:37:13 -09:00
Mitchell Horne
5a8fceb3bd boottrace: trace annotations for startup and shutdown
Add trace events for execution of SYSINITs (both static and dynamically
loaded), and to the various steps in the shutdown/panic/reboot paths.

Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
X-NetApp-PR:	#23
Differential Revision:	https://reviews.freebsd.org/D30187
2022-02-21 20:15:57 -04:00
Mitchell Horne
0aa9ffcd9c init_main.c: sort includes
This is preferred by style(9). Do this ahead of adding another include.

Reviewed by:	imp, kevans, allanjude
MFC after:	3 days
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D30186
2022-02-21 20:15:51 -04:00
Mitchell Horne
877eea429b kern_linker.c: sort includes
This is preferred by style(9). Do this ahead of adding another include.

Reviewed by:	imp, kevans
MFC after:	3 days
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D30185
2022-02-21 20:15:51 -04:00
Mitchell Horne
da5b7e90e7 boottrace: a simple boot and shutdown-time tracing facility
Boottrace is a facility for capturing trace events during boot and
shutdown. This includes kernel initialization, as well as rc. It has
been used by NetApp internally for several years, for catching and
diagnosing slow devices or subsystems. It is driven from userspace by
sysctl interface, and the output is a human-readable log of events
(kern.boottrace.log).

This commit adds the core boottrace functionality implementing these
interfaces. Adding the trace annotations themselves to kernel and
userland will happen in follow-up commits. A future commit will also add
a boottrace(4) man page.

For now, boottrace is unconditionally compiled into the kernel but
disabled by default. It can be enabled by setting the
kern.boottrace.enabled tunable to 1 in loader.conf(5).

There is an existing boot-time event tracing facility, which can be
compiled into the kernel with 'options TSLOG'. While there is some
functional overlap between this and boottrace, they are distinct. TSLOG
is suitable for generating detailed timing information and flamegraphs,
and has been used to great success by cperciva@ to diagnose and reduce
the overall system boot time. Boottrace aims to more quickly provide an
overview of timing and resource usage of the boot (and shutdown) process
to a sysadmin who requires this knowledge.

Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
X-NetApp-PR:	#23
Differential Revision:	https://reviews.freebsd.org/D30184
2022-02-21 20:15:45 -04:00
Mateusz Guzik
e68a5225e8 fd: add fde_copy
To dedup handrolled memcpy. This will be used later to make fd code
atomic-clean.
2022-02-15 17:51:08 +00:00
Mateusz Guzik
ec12b4f4ff fd: add missing seqc to dupfdopen 2022-02-15 17:51:08 +00:00
Mateusz Guzik
c9a995994b seqc: rename seqc_consistent_nomb to seqc_consistent_no_fence
For more consistency with other primitives.
2022-02-15 17:51:07 +00:00
John Baldwin
becaf6433b Use vmspace->vm_stacktop in place of sv_usrstack in more places.
Reviewed by:	markj
Obtained from:	CheriBSD
Differential Revision:	https://reviews.freebsd.org/D34174
2022-02-14 10:57:30 -08:00
Gleb Smirnoff
65572cade3 unix/dgram: return EAGAIN instead of ENOBUFS when O_NONBLOCK set
This is behavior what some programs expect and what Linux does.  For
example nginx expects EAGAIN when sending messages to /var/run/log,
which it connects to with O_NONBLOCK.  Particularly with nginx the
problem is magnified by the fact that a ENOBUFS on send(2) is also
logged, so situation creates a log-bomb - a failed log message
triggers another log message.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D34187
2022-02-14 09:21:55 -08:00
Mark Johnston
852ff943b9 sleepqueue: Annotate sleepq_max_depth as static
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-02-14 10:06:47 -05:00
Mark Johnston
893be9d8ac sleepqueue: Address a lock order reversal
After commit 74cf7cae4d ("softclock: Use dedicated ithreads for
running callouts."), there is a lock order reversal between the per-CPU
callout lock and the scheduler lock.  softclock_thread() locks callout
lock then the scheduler lock, when preparing to switch off-CPU, and
sleepq_remove_thread() stops the timed sleep callout while potentially
holding a scheduler lock.  In the latter case, it's the thread itself
that's locked, and if the thread is sleeping then its lock will be a
sleepqueue lock, but if it's still in the process of going to sleep
it'll be a scheduler lock.

We could perhaps change softclock_thread() to try to acquire locks in
the opposite order, but that'd require dropping and re-acquiring the
callout lock, which seems expensive for an operation that will happen
quite frequently.  We can instead perhaps avoid stopping the
td_slpcallout callout if the thread is still going to sleep, which is
what this patch does.  This will result in a spurious call to
sleepq_timeout(), but some counters suggest that this is very rare.

PR:		261198
Fixes:		74cf7cae4d ("softclock: Use dedicated ithreads for running callouts.")
Reported and tested by:	thj
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34204
2022-02-14 10:06:47 -05:00
Mateusz Guzik
6aa246e605 vfs: convert vnsz2log to a macro 2022-02-13 13:07:08 +00:00
Mateusz Guzik
5c31025060 fd: use FILEDESC_FOREACH_{FDE,FP} where appropriate 2022-02-13 13:07:08 +00:00
Mateusz Guzik
809f3121be fd: assign fd_freefile early when copying
This is to simplify an upcomming change.
2022-02-13 13:07:08 +00:00
Mateusz Guzik
893d20c95a fd: move fd table sizing out of fdinit
now it is placed with the rest of actual initialisation
2022-02-13 13:07:08 +00:00
Mateusz Guzik
4103c3cd5b fd: drop volatile keyword from refcounts
While here move a comment where it belongs and do small whitespace clean up.
2022-02-13 13:07:08 +00:00
Mateusz Guzik
b53133a778 proc: load/store p_cowgen using atomic primitives 2022-02-13 13:07:08 +00:00
Mateusz Guzik
29ee49f66b thread: remove dead store from thread_cow_update 2022-02-13 13:07:08 +00:00
John Baldwin
cd0525f615 ktls: Write-lock the INP when changing a transmit TLS session.
The TCP rate pacing code relies on being able to read this pointer
safely while holding an INP lock.  The initial TLS session pointer is
set while holding the write lock already.

Reviewed by:	gallatin, hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D34086
2022-02-11 15:16:25 -08:00
Mateusz Guzik
1d65a9b47e cache: improve vnode vs name assertion in cache_enter_time 2022-02-11 12:29:26 +00:00
Mateusz Guzik
611470a515 cache: remove NOCACHE handling from cache_fplookup_noentry
It was copy-pasted from locked lookup. As LOOKUP operation cannot have
the flag set it was always ending up setting MAKEENTRY.
2022-02-11 12:29:26 +00:00
Mateusz Guzik
513c7a6e0c fd: make fget_unlocked take a thread argument
Just like other fget routines. This enables embedding fd table pointer
in struct thread, avoiding taking a trip through proc.
2022-02-11 12:29:26 +00:00
Mateusz Guzik
45bb8beacc fd: elide one acquire fence in fget_unlocked_seq
Still validate we got the stable state before returning an error though.
2022-02-11 12:29:26 +00:00
Mateusz Guzik
62849eef5b fd: split fget_unlocked_seq depending on CAPABILITIES
This will simplify an upcoming change.
2022-02-11 12:27:22 +00:00
Mateusz Guzik
b937908e41 fd: split fget_cap depending on CAPABILITIES
This will simplify an upcoming change.
2022-02-11 12:13:27 +00:00
Mateusz Guzik
f40dd6c803 tty: switch ttyhook_register to use fget_cap_locked
It is still wrong-ish as fget* funcs don't expect to operate on abitrary
file descriptor tables, but this at least moves it out of the way of an
upcoming change while being bug-compatible.
2022-02-11 12:13:27 +00:00
Mateusz Guzik
93288e2445 Employ thread_cow_synced in setrlimit
In order to avoid proc lock/unlock on next kernel entry.
2022-02-11 11:44:07 +00:00
Mateusz Guzik
32114b639f Add PROC_COW_CHANGECOUNT and thread_cow_synced
Combined they can be used to avoid a proc lock/unlock cycle in the
syscall handler for curthread, see upcoming examples.
2022-02-11 11:44:07 +00:00
Mateusz Guzik
8a0cb04df4 Add lim_cowsync, similar to crcowsync 2022-02-11 11:44:07 +00:00
Justin Hibbits
6db44b0158 Fix gzip compressed core dumps on big endian architectures
The gzip trailer words (size and CRC) are both little-endian per the spec.

MFC after:	3 days
Sponsored by:	Juniper Networks, Inc.
2022-02-10 09:34:37 -06:00
Dimitry Andric
7d8a4eb943 tty_info: Avoid warning by using logical instead of bitwise operators
Since TD_IS_RUNNING() and TS_ON_RUNQ() are defined as logical
expressions involving '==', clang 14 warns about them being checked with
a bitwise operator instead of a logical one:

```
sys/kern/tty_info.c:124:9: error: use of bitwise '|' with boolean operands [-Werror,-Wbitwise-instead-of-logical]
        runa = TD_IS_RUNNING(td) | TD_ON_RUNQ(td);
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                 ||
sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING'
                                ^
sys/kern/tty_info.c:124:9: note: cast one or both operands to int to silence this warning
sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING'
                                ^
sys/kern/tty_info.c:129:9: error: use of bitwise '|' with boolean operands [-Werror,-Wbitwise-instead-of-logical]
        runb = TD_IS_RUNNING(td2) | TD_ON_RUNQ(td2);
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                  ||
sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING'
                                ^
sys/kern/tty_info.c:129:9: note: cast one or both operands to int to silence this warning
sys/sys/proc.h:562:27: note: expanded from macro 'TD_IS_RUNNING'
                                ^
```

Fix this by using logical operators instead. No functional change
intended.

Reviewed by:	cem, emaste, kevans, markj
MFC after:	3 days
Differential Revision: https://reviews.freebsd.org/D34186
2022-02-08 21:21:04 +01:00
Mark Johnston
5de79eeddb ktls: Disallow transmitting empty frames outside of TLS 1.0/CBC mode
There was nothing preventing one from sending an empty fragment on an
arbitrary KTLS TX-enabled socket, but ktls_frame() asserts that this
could not happen.  Though the transmit path handles this case for TLS
1.0 with AES-CBC, we should be strict and allow empty fragments only in
modes where it is explicitly allowed.

Modify sosend_generic() to reject writes to a KTLS-enabled socket if the
number of data bytes is zero, so that userspace cannot trigger the
aforementioned assertion.

Add regression tests to exercise this case.

Reported by:	syzkaller
Reviewed by:	gallatin, jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34195
2022-02-08 12:40:41 -05:00
Mark Johnston
300cfb96fc file: Make fget*() and getvnode*() consistent about initializing *fpp
Most fget*() functions initialize the output parameter to NULL.  Make
the externally visible interface behave consistently, and make
fget_unlocked_seq() private to kern_descrip.c.

This fixes at least one bug in a consumer, _filemon_wrapper_openat(),
which assumes that getvnode() sets the output file pointer to NULL upon
an error.

Reported by:	syzbot+01c0459408f896a5933a@syzkaller.appspotmail.com
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34190
2022-02-08 12:40:41 -05:00
Sebastian Huber
3ec0dc367b kern_ntptime.c: Remove ntp_init()
The ntp_init() function did set a couple of global objects to zero.  These
objects are in the .bss section and already initialized to zero during kernel
or module loading.
2022-02-07 14:16:16 -07:00
John Baldwin
949e395966 Trim duplicate code for copying in iovecs for PT_[GS]ETREGSET.
Reviewed by:	andrew, emaste
Differential Revision:	https://reviews.freebsd.org/D34177
2022-02-07 11:49:29 -08:00
Gordon Bergling
5a78ec9e7c kern_fflock: Fix a typo in a source code comment
- s/foward/forward/

MFC after:	3 days
2022-02-06 17:29:43 +01:00
Gordon Bergling
a9bee9c77a kern_racct: Fix a typo in a source code comment
- s/maxumum/maximum/

MFC after:	3 days
2022-02-06 17:28:27 +01:00
Gleb Smirnoff
c999e3481d dmesg: detect wrapped msgbuf on the kernel side and if so, skip first line
Since 59f256ec35 dmesg(8) will always skip first line of the message
buffer, cause it might be incomplete.  The problem is that in most cases
it is complete, valid and contains the "---<<BOOT>>---" marker.  This
skip can be disabled with '-a', but that would also unhide all non-kernel
messages.  Move this functionality from dmesg(8) to kernel, since kernel
actually knows if wrap has happened or not.

The main motivation for the change is not actually the value of the
"---<<BOOT>>---" marker.  The problem breaks unit tests, that clear
message buffer, perform a test and then check the message buffer for
a result.  Example of such test is sys/kern/sonewconn_overflow.
2022-02-05 13:35:31 -08:00
Kyle Evans
642701abc8 kern: harvest entropy from callouts
74cf7cae4d ("softclock: Use dedicated ithreads for running callouts.")
switched callouts away from the swi infrastructure.  It turns out that
this was a major source of entropy in early boot, which we've now lost.

As a result, first boot on hardware without a 'fast' entropy source
would block waiting for fortuna to be seeded with little hope of
progressing without manual intervention.

Let's resolve it by explicitly harvesting entropy in callout_process()
if we've handled any callouts.  cc/curthread/now seem to be reasonable
sources of entropy, so use those.

Discussed with:	jhb (also proposed initial patch)
Reported by:	many
Reviewed by:	cem, markm (both csprng)
Differential Revision:	https://reviews.freebsd.org/D34150
2022-02-03 10:05:06 -06:00
Konstantin Belousov
c02780b78c Add GB_NOWITNESS flag
It prevents WITNESS from recording the lock order for the buffer lock
acquired by getblkx().

Reviewed by:	mckusick
Discussed with:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34073
2022-02-01 06:54:50 +02:00
John Baldwin
d958bc7963 ktls: Try to enable TOE TLS after marking existing data not ready.
At the moment this is mostly a no-op but in the future there will be
in-flight encrypted data which requires software decryption.  This
same setup is also needed for NIC TLS RX.

Note that this does break TOE TLS RX for AES-CBC ciphers since there
is no software fallback for AES-CBC receive.  This will be resolved
one way or another before 14.0 is released.

Reviewed by:	hselasky
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D34082
2022-01-31 16:39:21 -08:00
Konstantin Belousov
66c5fbca77 insmntque1(): remove useless arguments
Also remove once-used functions to clean up after failed insmntque1(),
which were destructor callbacks in previous life.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D34071
2022-01-31 16:49:08 +02:00
Konstantin Belousov
3d68c4e175 syncer VOP_FSYNC(): unlock syncer vnode around call to VFS_SYNC()
The lock is unneccessary since the mount point is busied, which prevents
unmount and syncer vnode deallocation.  Having the vnode locked causes
innocent LoRs and complicates debugging.

Also stop starting write accounting around it.  Any caller of
VOP_FSYNC() must do it already, and sync_vnode() does.

Reported and tested by:	pho
Reviewed by:	markj, mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34072
2022-01-31 04:46:21 +02:00
Konstantin Belousov
5875b94c74 buf_alloc(): lock the buffer with LK_NOWAIT
The buffer must not be accessed by any other thread, it is freshly
allocated.  As such, LK_NOWAIT should be nop but also it prevents
recording the order between the buffer lock and any other locks we might
own in the call to getnewbuf().  In particular, if we own FFS snap lock,
it should avoid triggering false positive warning.

Reviewed by:	markj, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34072
2022-01-31 04:46:21 +02:00
Konstantin Belousov
531f8cfea0 Use dedicated lock name for pbufs
Also remove a pointer to array variable, use array address directly.

Reviewed by:	markj, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34072
2022-01-31 04:46:14 +02:00
Kristof Provost
703e533da5 mbuf: do not restore dying interfaces
When we remove an interface it is first removed from the interface list
V_ifnet (by if_unlink_ifnet()) and marked as IFF_DYING. We then wait for
any possible references to stop being used (i.e.
epoch_wait/epoch_drain_callbacks) before we tear it fully down.

However, the index in ifindex_table is not removed, so m_rcvif_restore()
can still find the (now dying) interface.

This results in panics, for example when dummynet restores the rcvif
pointer and passes a packet to ip6_input() we can panic because the
AF_INET6 domain has already been removed (so we end up dereferencing a
NULL pointer there).

Check that the interface is not dying before we restore it, which is
equivalent to checking its presence in V_ifnet, and thus ensures that
future accesses (while in NET_EPOCH) are safe.

Reviewed by:	glebius
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D34076
2022-01-28 23:09:08 +01:00
Mateusz Guzik
2a7e4cf843 Revert b58ca5df0b ("vfs: remove the now unused insmntque1")
I was somehow convinced that insmntque calls insmntque1 with a NULL
destructor. Unfortunately this worked well enough to not immediately
blow up in simple testing.

Keep not using the destructor in previously patched filesystems though
as it avoids unnecessary casts.

Noted by:	kib
Reported by:	pho
2022-01-27 16:32:22 +00:00
Andrew Turner
548a2ec49b Add PT_GETREGSET
This adds the PT_GETREGSET and PT_SETREGSET ptrace types. These can be
used to access all the registers from a specified core dump note type.
The NT_PRSTATUS and NT_FPREGSET notes are initially supported. Other
machine-dependant types are expected to be added in the future.

The ptrace addr points to a struct iovec pointing at memory to hold the
registers along with its length. On success the length in the iovec is
updated to tell userspace the actual length the kernel wrote or, if the
base address is NULL, the length the kernel would have written.

Because the data field is an int the arguments are backwards when
compared to the Linux PTRACE_GETREGSET call.

Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D19831
2022-01-27 11:40:34 +00:00
Gleb Smirnoff
e1882428dc ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif
Supplement ifindex table with generation count and use it to
serialize & restore an ifnet pointer.

Reviewed by:		kp
Differential revision:	https://reviews.freebsd.org/D33266
Fun note:		git show e6abef0918
2022-01-26 21:58:50 -08:00
Mateusz Guzik
b58ca5df0b vfs: remove the now unused insmntque1
Bump __FreeBSD_version to 1400052.
2022-01-27 01:00:24 +01:00
Kyle Evans
773fa8cd13 execve: disallow argc == 0
The manpage has contained the following verbiage on the matter for just
under 31 years:

"At least one argument must be present in the array"

Previous to this version, it had been prefaced with the weakening phrase
"By convention."

Carry through and document it the rest of the way.  Allowing argc == 0
has been a source of security issues in the past, and it's hard to
imagine a valid use-case for allowing it.  Toss back EINVAL if we ended
up not copying in any args for *execve().

The manpage change can be considered "Obtained from: OpenBSD"

Reviewed by:	emaste, kib, markj (all previous version)
Differential Revision:	https://reviews.freebsd.org/D34045
2022-01-26 13:40:27 -06:00
Hans Petter Selasky
9e2cce7e6a Implement a function to get the next TCP- and TLS- receive sequence number.
This function will be used by coming TLS hardware receive offload support.

Differential Revision:	https://reviews.freebsd.org/D32356
Discussed with:	jhb@
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2022-01-26 12:55:00 +01:00
Hans Petter Selasky
17cbcf33c3 mbuf(9): Assert receive mbufs don't carry a send tag.
Else we would start leaking reference counts.

Discussed with:	jhb@
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2022-01-26 12:55:00 +01:00
John Baldwin
308fc7e5b1 user_getpeername: Use 'bool' for the compat argument.
This matches user_getsockname.

Reviewed by:	brooks, kib
Sponsored by:	The University of Cambridge, Google Inc.
Differential Revision:	https://reviews.freebsd.org/D33987
2022-01-24 09:51:35 -08:00
Konstantin Belousov
fe6db72708 Add security.bsd.allow_ptrace sysctl
that disables any access to ptrace(2) for all processes.

Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33986
2022-01-22 19:36:56 +02:00
Konstantin Belousov
55a0aa2162 p_candebug(), p_cansee(): always allow for curproc
Privilege checks in both functions should allow the current process to
infer information about itself, as well as use the interfaces that are
proclaimed 'debugging', for instance, procctl(2).

Note that in p_cansee() case, explicit comparision of curproc and p
avoids a race where the process might change credentials and cause
thread to compare its cached stale credentials against updated process
creds, effectively disallowing the process to observe itself.

Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33986
2022-01-22 19:36:56 +02:00
Mark Johnston
6be8944d96 ktls: Zero out TLS_GET_RECORD control messages
Otherwise we end up copying one uninitialized byte into the socket
buffer.

Reported by:	KMSAN
Reviewed by:	jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33953
2022-01-20 15:42:46 -05:00
Mark Johnston
c3196306f0 clockcalib: Fix an overflow bug
tc_counter_mask is an unsigned int and in the TSC timecounter is equal
to UINT_MAX, so the addition tc->tc_counter_mask + 1 can overflow to 0,
resulting in a hang during boot.

Fixes:		c2705ceaeb ("x86: Speed up clock calibration")
Reviewed by:	cperciva
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33956
2022-01-20 08:23:38 -05:00
Alexander Motin
b7ff445ffa Reduce bufdaemon/bufspacedaemon shutdown time.
Before this change bufdaemon and bufspacedaemon threads used
kthread_shutdown() to stop activity on system shutdown.  The problem is
that kthread_shutdown() has no idea about the wait channel and lock used
by specific thread to wake them up reliably.  As result, up to 9 threads
could consume up to 9 seconds to shutdown for no good reason.

This change introduces specific shutdown functions, knowing how to
properly wake up specific threads, reducing wait for those threads on
shutdown/reboot from average 4 seconds to effectively zero.

MFC after:	2 weeks
Reviewed by:	kib, markj
Differential Revision:  https://reviews.freebsd.org/D33936
2022-01-18 19:26:16 -05:00
Mark Johnston
3ce04aca49 proc: Add a sysctl to fetch virtual address space layout info
This provides information about fixed regions of the target process'
user memory map.

Reviewed by:	kib
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33708
2022-01-17 16:12:43 -05:00
Mark Johnston
1811c1e957 exec: Reimplement stack address randomization
The approach taken by the stack gap implementation was to insert a
random gap between the top of the fixed stack mapping and the true top
of the main process stack.  This approach was chosen so as to avoid
randomizing the previously fixed address of certain process metadata
stored at the top of the stack, but had some shortcomings.  In
particular, mlockall(2) calls would wire the gap, bloating the process'
memory usage, and RLIMIT_STACK included the size of the gap so small
(< several MB) limits could not be used.

There is little value in storing each process' ps_strings at a fixed
location, as only very old programs hard-code this address; consumers
were converted decades ago to use a sysctl-based interface for this
purpose.  Thus, this change re-implements stack address randomization by
simply breaking the convention of storing ps_strings at a fixed
location, and randomizing the location of the entire stack mapping.
This implementation is simpler and avoids the problems mentioned above,
while being unlikely to break compatibility anywhere the default ASLR
settings are used.

The kern.elfN.aslr.stack_gap sysctl is renamed to kern.elfN.aslr.stack,
and is re-enabled by default.

PR:		260303
Reviewed by:	kib
Discussed with:	emaste, mw
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33704
2022-01-17 16:12:36 -05:00
Mark Johnston
758d98debe exec: Remove the stack gap implementation
ASLR stack randomization will reappear in a forthcoming commit.  Rather
than inserting a random gap into the stack mapping, the entire stack
mapping itself will be randomized in the same way that other mappings
are when ASLR is enabled.

No functional change intended, as the stack gap implementation is
currently disabled by default.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33704
2022-01-17 16:11:54 -05:00
Mark Johnston
706f4a81a8 exec: Introduce the PROC_PS_STRINGS() macro
Rather than fetching the ps_strings address directly from a process'
sysentvec, use this macro.  With stack address randomization the
ps_strings address is no longer fixed.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33704
2022-01-17 16:11:54 -05:00
Mark Johnston
5a8413e779 setrlimit: Remove special handling for RLIMIT_STACK with a stack gap
This will not be required with a forthcoming reimplementation of ASLR
stack randomization.  Moreover, this change was not sufficient to enable
the use of a stack size limit smaller than the stack gap itself.

PR:		260303
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33704
2022-01-17 11:42:13 -05:00
Mark Johnston
3fc21fdd5f sysent: Add a sv_psstringssz field to struct sysentvec
The size of the ps_strings structure varies between ABIs, so this is
useful for computing the address of the ps_strings structure relative to
the top of the stack when stack address randomization is enabled.

Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33704
2022-01-17 11:42:07 -05:00
Mark Johnston
1544f5add8 Revert "kern_exec: Add kern.stacktop sysctl."
The current ASLR stack gap feature will be removed, and with that the
need for the kern.stacktop sysctl is gone.  All consumers have been
removed.

This reverts commit a97d697122.

Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33704
2022-01-17 11:41:58 -05:00
Mark Johnston
dc7526170d posixshm: Report output buffer truncation from kern.ipc.posix_shm_list
PR:		240573
Reviewed by:	kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33912
2022-01-17 08:35:19 -05:00
Alexander Motin
e76c010899 Fix inverse sleep logic in buf_daemon().
Before commit 3cec5c77d6 buf_daemon() went to longer 1s sleep if
numdirtybuffers <= lodirtybuffers.  After that commit new condition
!BIT_EMPTY(BUF_DOMAINS, &bdlodirty) got opposite -- true when one
or more more domains is above lodirtybuffers.  As result, on freshly
booted system with no dirty buffers buf_daemon() wakes up 10 times
per second and probably only 1 time per second when there is actual
work to do.

MFC after:	1 week
Reviewed by:	kib, markj
Tested by:	pho
Differential revision:	https://reviews.freebsd.org/D33890
2022-01-15 19:32:36 -05:00
Simon J. Gerraty
bacb140f31 Ignore calcru: runtime went backwards for vm_guest
VM's have little control over CPU speed, don't make matters worse
by constantly spaming console.

Reviewed by:	jhb
Differential Revision:	https://reviews.freebsd.org/D33902
2022-01-14 16:07:43 -08:00
Brooks Davis
0910a41ef3 Revert "syscallarg_t: Add a type for system call arguments"
Missed issues in truss on at least armv7 and powerpcspe need to be
resolved before recommit.

This reverts commit 3889fb8af0.
This reverts commit 1544e0f5d1.
2022-01-12 23:29:20 +00:00
Brooks Davis
3889fb8af0 sysent: regen for syscallarg_t 2022-01-12 22:51:25 +00:00
Brooks Davis
1544e0f5d1 syscallarg_t: Add a type for system call arguments
This more clearly differentiates system call arguments from integer
registers and return values. On current architectures it has no effect,
but on architectures where pointers are not integers (CHERI) and may
not even share registers (CHERI-MIPS) it is necessiary to differentiate
between system call arguments (syscallarg_t) and integer register values
(register_t).

Obtained from:	CheriBSD

Reviewed by:	imp, kib
Differential Revision:	https://reviews.freebsd.org/D33780
2022-01-12 22:51:25 +00:00
Colin Percival
c2705ceaeb x86: Speed up clock calibration
Prior to this commit, the TSC and local APIC frequencies were calibrated
at boot time by measuring the clocks before and after a one-second sleep.
This was simple and effective, but had the disadvantage of *requiring a
one-second sleep*.

Rather than making two clock measurements (before and after sleeping) we
now perform many measurements; and rather than simply subtracting the
starting count from the ending count, we calculate a best-fit regression
between the target clock and the reference clock (for which the current
best available timecounter is used). While we do this, we keep track
of an estimate of the uncertainty in the regression slope (aka. the ratio
of clock speeds), and stop measuring when we believe the uncertainty is
less than 1 PPM.

In order to avoid the risk of aliasing resulting from the data-gathering
loop synchronizing with (a multiple of) the frequency of the reference
clock, we add some additional spinning depending upon the iteration number.

For numerical stability and simplicity of implementation, we make use of
floating-point arithmetic for the statistical calculations.

On the author's Dell laptop, this reduces the time spent in calibration
from 2000 ms to 29 ms; on an EC2 c5.xlarge instance, it is reduced from
2000 ms to 2.5 ms.

Reviewed by:	bde (previous version), kib
MFC after:	1 month
Sponsored by:	https://www.patreon.com/cperciva
Differential Revision:	https://reviews.freebsd.org/D33802
2022-01-12 12:34:07 -08:00
Konstantin Belousov
a24afbb4e6 Ignore debugger-injected signals left after detaching
PR:	261010
Reported by:	Martin Simmons <martin@lispworks.com>
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33787
2022-01-12 07:33:30 +02:00
Alexander Motin
cb1f5d1136 Reduce minimum idle hardclock rate from 2Hz to 1Hz.
On idle 80-thread system it allows to improve package-level idle state
residency and so power consumption by several percent.

MFC after:	2 weeks
2022-01-09 19:25:56 -05:00
Konstantin Belousov
4a4b059a97 Add vfs_remount_ro()
a helper to remount filesystem from rw to ro.

Tested by:	pho
Reviewed by:	markj, mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D33721
2022-01-08 05:41:44 +02:00
John Baldwin
7def1e10b3 bus_dma: Deduplicate locking helper functions.
- Move busdma_lock_mutex to subr_bus_dma.c.

- Move _busdma_lock_dflt to subr_bus_dma.c.  This function was named a
  couple of different things previously.  It is not a public API but
  an internal helper used in place of a NULL pointer.  The prototype
  is in <sys/bus_dma.h> as not all backends include
  <sys/bus_dma_internal.h>.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D33694
2022-01-05 13:50:40 -08:00
John Baldwin
85b4607324 Deduplicate bus_dma bounce code.
Move mostly duplicated code in various MD bus_dma backends to support
bounce pages into sys/kern/subr_busdma_bounce.c.  This file is
currently #include'd into the backends rather than compiled standalone
since it requires access to internal members of opaque bus_dma
structures such as bus_dmamap_t and bus_dma_tag_t.

Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D33684
2022-01-05 13:50:40 -08:00
John Baldwin
753c851387 sigev_findtd: Fix whitespace nit in argument list.
Obtained from:	CheriBSD
2022-01-04 13:37:39 -08:00
Gleb Smirnoff
644ca0846d domains: make domain_init() initialize only global state
Now that each module handles its global and VNET initialization
itself, there is no VNET related stuff left to do in domain_init().

Differential revision:	https://reviews.freebsd.org/D33541
2022-01-03 10:15:22 -08:00
Gleb Smirnoff
24e1c6ae7d domains: init with standard SYSINIT(9) or VNET_SYSINIT()
There left only three modules that used dom_init().  And netipsec
was the last one to use dom_destroy().

Differential revision:	https://reviews.freebsd.org/D33540
2022-01-03 10:15:22 -08:00
Gleb Smirnoff
340c7343f4 protocols: don't execute protosw_init() for every VNET
The function now modifies pr_usrreqs only, which are always
global.  Rename it to pr_usrreqs_init().

Differential revision:	https://reviews.freebsd.org/D33538
2022-01-03 10:15:21 -08:00
Gleb Smirnoff
89128ff3e4 protocols: init with standard SYSINIT(9) or VNET_SYSINIT
The historical BSD network stack loop that rolls over domains and
over protocols has no advantages over more modern SYSINIT(9).
While doing the sweep, split global and per-VNET initializers.

Getting rid of pr_init allows to achieve several things:
o Get rid of ifdef's that protect against double foo_init() when
  both INET and INET6 are compiled in.
o Isolate initializers statically to the module they init.
o Makes code easier to understand and maintain.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D33537
2022-01-03 10:15:21 -08:00
Jessica Clarke
a3e828c91d intrng: Use less confusing return value for intr_pic_add_handler
Currently intr_pic_add_handler either returns the PIC you gave it (which
is useless and risks causing confusion about whether it's creating
another PIC) or, on error, NULL. Instead, convert it to return an int
error code as one would expect.

Note that the only consumer of this API, arm64's gicv3_its, does not use
the return value, so no uses need updating to work with the revised API.

Reviewed by:	markj, mmel
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D33341
2022-01-03 17:08:44 +00:00
Stefan Eßer
ec3af9d0ca sys/kern/sched_4bsd.c: fix typo introduced in previous commit 2022-01-01 15:33:38 +01:00