Commit Graph

19352 Commits

Author SHA1 Message Date
Mark Johnston
38e1d32dab callout: Simplify cpuid validation in callout_reset_sbt_on()
- Remove a flag variable.
- Convert a runtime check of the passed cpuid to a KASSERT.
- Remove the cc_inited flag.  An attempt to schedule a callout before
  SI_SUB_CPU will crash anyway since the per-CPU mutexes won't have been
  initialized, and that flag was only checked in the case where a cpuid
  was explicitly specified by the caller.

No functional change intended.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2022-07-13 09:47:33 -04:00
Mark Johnston
ece453d5fa eventtimer: Simplify KTR traces
Stop including the current CPU in all event messages, since it's already
saved in KTR log entries and thus is redundant.  All eventtimer traces
occur in a context where CPU migration is not possible.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-07-11 15:58:43 -04:00
Mark Johnston
a889a65ba3 eventtimer: Fix several races in the timer reload code
In handleevents(), lock the timer state before fetching the time for the
next event.  A concurrent callout_cc_add() call might be changing the
next event time, and the race can cause handleevents() to program an
out-of-date time, causing the callout to run later (by an unbounded
period, up to the idle hardclock period of 1s) than requested.

In cpu_idleclock(), call getnextcpuevent() with the timer state mutex
held, for similar reasons.  In particular, cpu_idleclock() runs with
interrupts enabled, so an untimely timer interrupt can result in a stale
next event time being programmed.  Further, an interrupt can cause
cpu_idleclock() to use a stale value for "now".

In cpu_activeclock(), disable interrupts before loading "now", so as to
avoid going backwards in time when calling handleevents().  It's ok to
leave interrupts enabled when checking "state->idle", since the race at
worst will cause handleevents() to be called unnecessarily.  But use an
atomic load to indicate that the test is racy.

PR:		264867
Reviewed by:	mav, jhb, kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35735
2022-07-11 15:58:43 -04:00
Mark Johnston
ebb3cb6195 eventtimer: Pass a pcpu state pointer to getnext(cpu)event()
Callers have already loaded the pointer, so these functions don't need
to fetch it again.

No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-07-11 15:58:43 -04:00
Mark Johnston
ba71333f60 sched_ule: Fix a typo in a comment
PR:		226107
MFC after:	1 week
2022-07-11 15:58:43 -04:00
Mark Johnston
ef80894c9d sched_ule: Purge an obsolete comment
The referenced bitmask was removed in commit 62fa74d95a.

MFC after:	 1 week
Sponsored by:	The FreeBSD Foundation
2022-07-11 15:58:43 -04:00
Mark Johnston
35dd6d6cb5 sched_ule: Eliminate a superfluous local variable in tdq_move()
No functional change intended.

MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-07-11 15:58:43 -04:00
Gleb Smirnoff
c261510ef5 sockets: fix setsockopt(SO_RCVTIMEO) on a listening socket
MFC after:	3 weeks
2022-07-08 11:33:24 -07:00
Mitchell Horne
258958b3c7 ddb: use _FLAGS command macros where appropriate
Some command definitions were forced to use DB_FUNC in order to specify
their required flags, CS_OWN or CS_MORE. Use the new macros to simplify
these.

Reviewed by:	markj, jhb
MFC after:	3 days
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D35582
2022-07-05 11:56:55 -03:00
Gleb Smirnoff
d8596171c5 sockets: use only soref()/sorele() as socket reference count
o Retire SS_FDREF as it is basically a debug flag on top of already
  existing soref()/sorele().
o Convert SS_PROTOREF into soref()/sorele().
o Change reference model for the listen queues, see below.
o Make sofree() private.  The correct KPI to use is only sorele().
o Make soabort() respect the model and sorele() instead of sofree().

Note on listening queues.  Until now the sockets on a queue had zero
reference count.  And the reference were given only upon accept(2).  The
assumption was that there is no way to see the queued socket from anywhere
except its head.  This is not true, since queued sockets already have pcbs,
which are linked at least into the global pcb lists.  With this change we
put the reference right in the sonewconn() and on accept(2) path we just
hand the existing reference to the file descriptor.

Differential revision:	https://reviews.freebsd.org/D35679
2022-07-04 12:40:51 -07:00
Gleb Smirnoff
bc7605647c sockets: use positive flag for file descriptor socket reference
Rename SS_NOFDREF to SS_FDREF and flip all bitwise operations.
Mark sockets created by socreate() with SS_FDREF.

This change is mostly illustrative. With it we see that SS_FDREF
is a debugging flag, since:
* socreate() takes a reference with soref().
* on accept path solisten_dequeue() takes a reference
  with soref() and then soaccept() sets SS_FDREF.
* soclose() checks SS_FDREF, removes it and does sorele().

Reviewed by:		tuexen
Differential revision:	https://reviews.freebsd.org/D35678
2022-07-04 12:40:51 -07:00
Warner Losh
b69996d1d5 tty: Default to printing kernel stack traceback only on INVARIANT kernels
Change the default from printing a breif kernel thread stack informaton
back to omitting it for non-invariant kernels in response to
SIGINFO/^T. Full and brief stack support can be selected with the
kern.tty_info_kstacks sysctl.

MFC After:		2 weeks
Sponsored by:		Netflix
Reviewed by:		grembo, jhb
Differential Revision:	https://reviews.freebsd.org/D35576
2022-07-02 08:02:12 -06:00
John Baldwin
0bd73da206 busdma_bounce: Use PRI_ITHD scheduling class for worker thread.
Reviewed by:	kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35641
2022-06-30 10:06:04 -07:00
John Baldwin
0288d4277f Add register sets for NT_THRMISC and NT_PTLWPINFO.
For the kernel this is mostly a non-functional change.  However, this
will be useful for simplifying gcore(1).

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D35666
2022-06-30 10:04:56 -07:00
Gleb Smirnoff
66c8e3fccf socket: fix listen(2) on an already listening socket
Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35669
Fixes:			141fe2dcee
2022-06-30 07:50:29 -07:00
Konstantin Belousov
ad175a107b vfs_mount.c: convert explicit panics and KASSERTs to MPASSERT/MPPASS
Reviewed by:	imp, mjg
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35652
2022-06-29 21:31:47 +03:00
Konstantin Belousov
1e54362824 vfs_op_exit(): assert that mnt_vfs_ops stays non-zero for unmount or suspend
Reviewed by:	mjg
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35639
2022-06-29 21:31:47 +03:00
Jamie Gritton
7060da62ff jail: Remove a prison's shared memory when it dies
Add shm_remove_prison(), that removes all POSIX shared memory segments
belonging to a prison.  Call it from prison_cleanup() so a prison
won't be stuck in a dying state due to the resources still held.

PR:		257555
Reported by:	grembo
2022-06-29 10:47:39 -07:00
Jamie Gritton
a9f7455c38 jail: add prison_cleanup() to release resources held by a dying jail
Currently, when a jail starts dying, either by losing its last user
reference or by being explicitly killed,
osd_jail_call(...PR_METHOD_REMOVE...) is called.  Encapsulate this
into a function prison_cleanup() that can then do other cleanup.
2022-06-29 10:33:05 -07:00
Gleb Smirnoff
48a55bbfe9 unix: change error code for recvmsg() failed due to RLIMIT_NOFILE
Instead of returning EMSGSIZE pass the error code from fdallocn() directly
to userland.  That would be EMFILE, which makes much more sense.  This
error code is not listed in the specification[1], but the specification
doesn't cover such edge case at all.  Meanwhile the specification lists
EMSGSIZE as the error code for invalid value of msg_iovlen, and FreeBSD
follows that, see sys_recmsg().  Differentiating these two cases will make
a developer/admin life much easier when debugging.

[1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/recvmsg.html

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35640
2022-06-29 09:42:58 -07:00
Kristof Provost
ab91feabcc ovpn: Introduce OpenVPN DCO support
OpenVPN Data Channel Offload (DCO) moves OpenVPN data plane processing
(i.e. tunneling and cryptography) into the kernel, rather than using tap
devices.
This avoids significant copying and context switching overhead between
kernel and user space and improves OpenVPN throughput.

In my test setup throughput improved from around 660Mbit/s to around
2Gbit/s.

Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D34340
2022-06-28 11:33:10 +02:00
Mateusz Guzik
7388fb714a cache: drop the vfs.cache_rename_add tunable
The functionality has been in use since Jan 2021 -- long enough(tm).
2022-06-27 09:56:20 +02:00
Gleb Smirnoff
458f475df8 unix/dgram: smart socket buffers for one-to-many sockets
A one-to-many unix/dgram socket is a socket that has been bound
with bind(2) and can get multiple connections.  A typical example
is /var/run/log bound by syslogd(8) and receiving multiple
connections from libc syslog(3) API.  Until now all of these
connections shared the same receive socket buffer of the bound
socket.  This made the socket vulnerable to overflow attack.
See 240d5a9b1c for a historical attempt to workaround the problem.

This commit creates a per-connection socket buffer for every single
connected socket and eliminates the problem.  The new behavior will
optimize seldom writers over frequent writers.  See added test case
scenarios and code comments for more detailed description of the
new behavior.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35303
2022-06-24 09:09:11 -07:00
Gleb Smirnoff
1093f16487 unix/dgram: reduce mbuf chain traversals in send(2) and recv(2)
o Use m_pkthdr.memlen from m_uiotombuf()
o Modify unp_internalize() to keep track of allocated space and memory
  as well as pointer to the last buffer.
o Modify unp_addsockcred() to keep track of allocated space and memory
  as well as pointer to the last buffer.
o Record the datagram len/memlen/ctllen in the first (from) mbuf of the
  chain in uipc_sosend_dgram() and reuse it in uipc_soreceive_dgram().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35302
2022-06-24 09:09:11 -07:00
Gleb Smirnoff
9b841b0e23 m_uiotombuf: write total memory length of the allocated chain in pkthdr
Data allocated by m_uiotombuf() usually goes into a socket buffer.
We are interested in the length of useful data to be added to sb_acc,
as well as total memory used by mbufs.  The later would be added to
sb_mbcnt.  Calculating this value at allocation time allows to save
on extra traversal of the mbuf chain.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35301
2022-06-24 09:09:11 -07:00
Gleb Smirnoff
a7444f807e unix/dgram: use minimal possible socket buffer for PF_UNIX/SOCK_DGRAM
This change fully splits away PF_UNIX/SOCK_DGRAM from other socket
buffer implementations, without any behavior changes.

Generic socket implementation is reduced down to one STAILQ and very
little code.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35300
2022-06-24 09:09:11 -07:00
Gleb Smirnoff
a4fc41423f sockets: enable protocol specific socket buffers
Split struct sockbuf into common shared fields and protocol specific
union, where protocols are free to implement whatever buffer they
want.  Such protocols should mark themselves with PR_SOCKBUF and are
expected to initialize their buffers in their pr_attach and tear
them down in pr_detach.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35299
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
315167c0de unix: provide an option to return locked from unp_connectat()
Use this new version in unix/dgram socket when sending to a target
address.  This removes extra lock release/acquisition and possible
counter-intuitive ENOTCONN.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35298
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
5dc8dd5f3a unix/dgram: inline sbappendaddr_locked() into uipc_sosend_dgram()
This allows to remove one M_NOWAIT allocation and also makes it
more clear what's going on.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35297
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
e3fbbf965e unix/dgram: add a specific receive method - uipc_soreceive_dgram
With this second step PF_UNIX/SOCK_DGRAM has protocol specific
implementation.  This gives some possibility performance
optimizations.  However, it still operates on the same struct
socket as all other sockets do.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35296
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
f384a97c83 unix/dgram: cleanup uipc_send of PF_UNIX/SOCK_DGRAM, step 2
Just remove one level of indentation as the case clause always match.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35295
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
7e5b6b391e unix/dgram: cleanup uipc_send of PF_UNIX/SOCK_DGRAM, step 1
Remove the dead code.  The new uipc_sosend_dgram() handles send()
on PF_UNIX/SOCK_DGRAM in full.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35294
2022-06-24 09:09:10 -07:00
Gleb Smirnoff
3464958246 unix/dgram: add a specific send method - uipc_sosend_dgram()
This is first step towards splitting classic BSD socket
implementation into separate classes.  The first to be
split is PF_UNIX/SOCK_DGRAM as it has most differencies
to SOCK_STREAM sockets and to PF_INET sockets.

Historically a protocol shall provide two methods for sendmsg(2):
pru_sosend and pru_send.  The former is a generic send method,
e.g. sosend_generic() which would internally call the latter,
uipc_send() in our case.  There is one important exception, though,
the sendfile(2) code will call pru_send directly.  But sendfile
doesn't work on SOCK_DGRAM, so we can do the trick.  We will create
socket class specific uipc_sosend_dgram() which will carry only
important bits from sosend_generic() and uipc_send().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35293
2022-06-24 09:09:10 -07:00
Mitchell Horne
29afffb942 subr_bus: restore bus_null_rescan()
Partially revert the previous change; we need to keep this method as a
specific override for pci_driver subclasses which should not use
pci_rescan_method() -- cardbus and ofw_pcibus. However, change the return
value to ENODEV for the same reasoning given in the original commit, and
use this as the default rescan method in bus_if.m.

Reported by:	jhb
Fixes:		36a8572ee8 ("bus_if: provide a default null rescan method")
MFC with:	36a8572ee8
2022-06-23 16:07:00 -03:00
Mitchell Horne
8701571df9 set_cputicker: use a bool
The third argument to this function indicates whether the supplied
ticker is fixed or variable, i.e. requiring calibration. Give this
argument a type and name that better conveys this purpose.

Reviewed by:	kib, markj
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35459
2022-06-23 15:15:11 -03:00
Mitchell Horne
36a8572ee8 bus_if: provide a default null rescan method
There is an existing helper method in subr_bus.c, but almost no drivers
know to use it. It also returns the same error as an empty method,
making it not very useful. Move this to bus_if.m and return a more
sensible error code.

This gives a slightly more meaningful error message when attempting
'devctl rescan' on buses and devices alike:
  "Device not configured" --> "Operation not supported by device"

Reviewed by:	imp
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35501
2022-06-23 15:15:10 -03:00
Chuck Silvers
5bd21cbbd1 vfs: fix vfs_bio_clrbuf() for PAGE_SIZE > block size
Calculate the desired page valid mask using math that will not
overflow the types used.

Sponsored by:	Netflix

Reviewed by:	mckusick, kib, markj
Differential Revision:	https://reviews.freebsd.org/D34837
2022-06-21 17:58:52 -07:00
Mark Johnston
9553bc89db aio: Improve UMA usage
- Remove the AIO proc zone.  This zone gets one allocation per AIO
  daemon process, which isn't enough to warrant a dedicated zone.  Plus,
  unlike other AIO structures, aiops are small (32 bytes with LP64), so
  UMA doesn't provide better space efficiency than malloc(9).  Change
  one of the malloc types in vfs_aio.c to make it more general.

- Don't set the NOFREE flag on the other AIO zones.  This flag means
  that memory allocated to the AIO subsystem is never freed back to the
  VM, so it's always preferable to avoid using it when possible.  NOFREE
  was set without explanation when AIO was converted to use UMA 20 years
  ago, but it does not appear to be required; all of the structures
  allocated from UMA (per-process kaioinfo, kaiocb, and aioliojob) keep
  track of references and get freed only when none exist.  Plus, these
  structures will contain dangling pointer after they're freed (e.g.,
  the "cred", "fd_file" and "uiop" fields of struct kaiocb), so
  use-after-frees are dangerous even when the structures themselves are
  type-stable.

Reviewed by:	asomers
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35493
2022-06-20 12:48:13 -04:00
Damjan Jovanovic
8c309d48aa struct kinfo_file changes needed for lsof to work using only usermode APIs`
Add kf_pipe_buffer_[in/out/size] fields to kf_pipe, and populate them.

Add a kf_kqueue struct to the kf_un union, to allow querying kqueue state,
and populate it.

Populate the kf_sock_rcv_sb_state and kf_sock_snd_sb_state fields in
kf_sock for INET/INET6 sockets, and populate all other fields for all
transport layer protocols, not just TCP.

Bump __FreeBSD_version.

Differential revision:	https://reviews.freebsd.org/D34184
Reviewed by:	jhb, kib, se
MFC after:	1 week
2022-06-18 12:34:25 +03:00
Damjan Jovanovic
8ae7694913 KERN_LOCKF: report kl_file_fsid consistently with stat(2)
PR:	264723
Reviewed by:	kib
Discussed with:	markj
MFC after:	1 week
2022-06-18 12:34:17 +03:00
Mark Johnston
f6379f7fde socket: Fix a race between kevent(2) and listen(2)
When locking the knote list for a socket, we check whether the socket is
a listening socket in order to select the appropriate mutex; a listening
socket uses the socket lock, while data sockets use socket buffer
mutexes.

If SOLISTENING(so) is false and the knote lock routine locks a socket
buffer, then it must re-check whether the socket is a listening socket
since solisten_proto() could have changed the socket's identity while we
were blocked on the socket buffer lock.

Reported by:	syzkaller
Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35492
2022-06-16 10:20:04 -04:00
Mark Johnston
756bc3adc5 kasan: Create a shadow for the bootstack prior to hammer_time()
When the kernel is compiled with -asan-stack=true, the address sanitizer
will emit inline accesses to the shadow map.  In other words, some
shadow map accesses are not intercepted by the KASAN runtime, so they
cannot be disabled even if the runtime is not yet initialized by
kasan_init() at the end of hammer_time().

This went unnoticed because the loader will initialize all PML4 entries
of the bootstrap page table to point to the same PDP page, so early
shadow map accesses do not raise a page fault, though they are silently
corrupting memory.  In fact, when the loader does not copy the staging
area, we do get a page fault since in that case only the first and last
PML4Es are populated by the loader.  But due to another bug, the loader
always treated KASAN kernels as non-relocatable and thus always copied
the staging area.

It is not really practical to annotate hammer_time() and all callees
with __nosanitizeaddress, so instead add some early initialization which
creates a shadow for the boot stack used by hammer_time().  This is only
needed by KASAN, not by KMSAN, but the shared pmap code handles both.

Reported by:	mhorne
Reviewed by:	kib
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35449
2022-06-15 11:39:10 -04:00
Doug Ambrisko
ce00b11940 mount: revert the active vnode reporting feature
Revert the computing of active vnode reporting since statfs is used
by a lot of tools.  Only report the vnodes used.

Reported by:	mjg
2022-06-15 07:24:55 -07:00
Mark Johnston
7565431f30 mount: Fix an incorrect assertion in kernel_mount()
The pointer to the mount values may be null if an error occurred while
copying them in, so fix the assertion condition to reflect that
possibility.

While here, move some initialization code into the error == 0 block.  No
functional change intended.

Reported by:	syzkaller
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2022-06-14 12:00:59 -04:00
Mark Johnston
630f633f2a vm_object: Use the vm_object_(set|clear)_flag() helpers
... rather than setting and clearing flags inline.  No functional change
intended.

Reviewed by:	alc, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35469
2022-06-14 12:00:59 -04:00
Mark Johnston
e8955bd643 pipe: Use a distinct wait channel for I/O serialization
Suppose a thread tries to read from an empty pipe.  pipe_read() does the
following:

1. pipelock(), possibly sleeping
2. check for buffered data
3. pipeunlock()
4. set PIPE_WANTR and sleep
5. goto 1

pipelock() is an open-coded mutex; if a thread blocks in pipelock(), it
sleeps until the lock holder calls pipeunlock().

Both sleeps use the same wait channel.  So if there are multiple threads
in pipe_read(), a thread T1 in step 3 can wake up a thread T2 sleeping
in step 4.  Then T1 goes to sleep in step 4, and T2 acquires and
releases the pipelock, waking up T1 again.  This can go on indefinitely,
livelocking the process (and potentially starving a would-be writer).

Fix the problem by using a separate wait channel for pipelock().

Reported by:	Paul Floyd <paulf2718@gmail.com>
Reviewed by:	mjg, kib
PR:		264441
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35415
2022-06-14 12:00:59 -04:00
Cy Schubert
d781401512 kern_thread.c: Fix i386 build
Chase 4493a13e3b by updating static
assertions of struct proc.
2022-06-13 19:35:33 -07:00
Konstantin Belousov
1575804961 reap_kill_proc(): avoid singlethreading any other process if we are exiting
This is racy because curproc process lock is not used, but allows the
process to exit faster.  It is userspace issue to create such race
anyway, and not fullfilling the guarantee that all reaper descendants
are signalled should be fine.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
e0343eacf3 reap_kill_subtree(): hold the reaper when entering it into the queue to handle later
We drop proctree_lock, which allows the process to exit while memoized
in the list to proceed.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
1d4abf2cfa reap_kill_subtree_once(): handle proctree_lock unlock in reap_kill_proc()
Recorded reaper might loose its reaper status, so we should not assert
it, but check and avoid signalling if this happens.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 week
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
addf103ce6 reap_kill_proc: do not retry on thread_single() failure
The failure means that the process does single-threading itself, which
makes our action not needed.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
008b2e6544 Make stop_all_proc_block interruptible to avoid deadlock with parallel suspension
If we try to single-thread a process which thread entered
procctl(REAP_KILL_SUBTREE), and sleeping waiting for us unlocking
stop_all_proc_blocker, we must be able to finish single-threading.  This
requires the sleep to be interruptible.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Mark Johnston
2d5ef216b6 thread_single_end(): consistently maintain p_boundary_count for ALLPROC mode
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 week
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
1b4701fe1e thread_unsuspend(): do not unuspend the suspended leader thread doing SINGLE_ALLPROC
markj wrote:
tdsendsignal() may unsuspend a target thread. I think there is at least
one bug there: suppose thread T is suspended in
thread_single(SINGLE_ALLPROC) when trying to kill another process with
REAP_KILL. Suppose a different thread sends SIGKILL to T->td_proc. Then,
tdsendsignal() calls thread_unsuspend(T, T->td_proc). thread_unsuspend()
incorrectly decrements T->td_proc->p_suspcount to -1.

Later, when T->td_proc exits, it will wait forever in
thread_single(SINGLE_EXIT) since T->td_proc->p_suspcount never reaches 1.

Since the thread suspension is bounded by time needed to do
thread_single(), skipping the thread_unsuspend_one() call there should
not affect signal delivery if this thread is selected as target.

Reported by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
b9009b1789 thread_single(): remove already checked conditional expression
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
4493a13e3b Do not single-thread itself when the process single-threaded some another process
Since both self single-threading and remote single-threading rely on
suspending the thread doing thread_single(), it cannot be mixed: thread
doing thread_suspend_switch() might be subject to thread_suspend_one()
and vice versa.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
dd883e9a7e weed_inhib(): correct the condition to re-suspend a thread
suspended for SINGLE_ALLPROC mode.  There is no need to check for
boundary state.  It is only required to see that the suspension comes
from the ALLPROC mode.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:03 +03:00
Konstantin Belousov
b9893b3533 weed_inhib(): do not double-suspend already suspended thread if the loop reiterates
In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:02 +03:00
Konstantin Belousov
d7a9e6e740 thread_single: wait for P_STOPPED_SINGLE to pass
to avoid ALLPROC mode to try to race with any other single-threading
mode.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:02 +03:00
Konstantin Belousov
02a2aacbe2 issignal(): ignore signals when process is single-threading for exit
Places that will wait for curproc->p_singlethr to become zero (in the
next commit, the counter of number of external single-threading is
to be introduced), must wait for it interruptible, otherwise we
deadlock.  On the other hand, a signal delivered during this window,
if directed to the waiting thread, would cause the wait loop to become
a busy loop.

Since we are exiting, it is safe to ignore the signals.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:02 +03:00
Konstantin Belousov
d3000939c7 P2_WEXIT: avoid thread_single() for exiting process earlier
before the process itself does thread_single(SINGLE_EXIT).  We cannot
single-thread such process in ALLPROC (external) mode, and properly
detect and report the failure to do so due to the process becoming
zombie is easier to prevent than handle.

In collaboration with:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D35310
2022-06-13 22:30:02 +03:00
Doug Ambrisko
6468cd8e0e mount: add vnode usage per file system with mount -v
This avoids the need to drop into the ddb to figure out vnode
usage per file system.  It helps to see if they are or are not
being freed.  Suggestion to report active vnode count was from
kib@

Reviewed by:   	kib
Differential Revision: https://reviews.freebsd.org/D35436
2022-06-13 07:56:38 -07:00
Hans Petter Selasky
b8394039dc mbuf(9): Fix size of mbuf for all 32-bit platforms (i386, ARM, PowerPC and RISCV)
Do this by reducing the size of the MBUF_PEXT_MAX_PGS, causing "struct mbuf" to
be bigger than M_SIZE, and also add a missing padding field to ensure 64-bit
alignment.

Reviewed by:	gallatin@
Reported by:	Elliott Mitchell
Differential revision:	https://reviews.freebsd.org/D35339
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2022-06-07 22:09:10 +02:00
Hans Petter Selasky
fe8c78f0d2 ktls: Add full support for TLS RX offloading via network interface.
Basic TLS RX offloading uses the "csum_flags" field in the mbuf packet
header to figure out if an incoming mbuf has been fully offloaded or
not. This information follows the packet stream via the LRO engine, IP
stack and finally to the TCP stack. The TCP stack preserves the mbuf
packet header also when re-assembling packets after packet loss. When
the mbuf goes into the socket buffer the packet header is demoted and
the offload information is transferred to "m_flags" . Later on a
worker thread will analyze the mbuf flags and decide if the mbufs
making up a TLS record indicate a fully-, partially- or not decrypted
TLS record. Based on these three cases the worker thread will either
pass the packet on as-is or recrypt the decrypted bits, if any, or
decrypt the packet as usual.

During packet loss the kernel TLS code will call back into the network
driver using the send tag, informing about the TCP starting sequence
number of every TLS record that is not fully decrypted by the network
interface. The network interface then stores this information in a
compressed table and starts asking the hardware if it has found a
valid TLS header in the TCP data payload. If the hardware has found a
valid TLS header and the referred TLS header is at a valid TCP
sequence number according to the TCP sequence numbers provided by the
kernel TLS code, the network driver then informs the hardware that it
can resume decryption.

Care has been taken to not merge encrypted and decrypted mbuf chains,
in the LRO engine and when appending mbufs to the socket buffer.

The mbuf's leaf network interface pointer is used to figure out from
which network interface the offloading rule should be allocated. Also
this pointer is used to track route changes.

Currently mbuf send tags are used in both transmit and receive
direction, due to convenience, but may get a new name in the future to
better reflect their usage.

Reviewed by:	jhb@ and gallatin@
Differential revision:	https://reviews.freebsd.org/D32356
Sponsored by:	NVIDIA Networking
2022-06-07 12:58:09 +02:00
Hans Petter Selasky
f0fca64618 ktls: Refer send tag pointer once.
So that the asserts and the actual code see the same values.

Differential revision:	https://reviews.freebsd.org/D32356
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2022-06-07 12:57:03 +02:00
Hans Petter Selasky
4d88d81c31 mbuf(9): Implement a leaf network interface field in the mbuf packet header.
When packets are received they may traverse several network interfaces like
vlan(4) and lagg(9). When doing receive side offloads it is important to
know the first network interface entry point, because that is where all
offloading is taking place. This makes it possible to track receive
side route changes for multiport setups, for example when lagg(9) receives
traffic from more than one port. This avoids having to install multiple
offloading rules for the same stream.

This field works similar to the existing "rcvif" mbuf packet header field.

Submitted by:	jhb@
Reviewed by:	gallatin@ and gnn@
Differential revision:	https://reviews.freebsd.org/D35339
Sponsored by:	NVIDIA Networking
Sponsored by:	Netflix
2022-06-07 12:54:42 +02:00
Gleb Smirnoff
d97922c6c6 unix/*: rewrite unp_internalize() cmsg parsing cycle
Make it a complex, but a single for(;;) statement.  The previous cycle
with some loop logic in the beginning and some loop logic at the end
was confusing.  Both me and markj@ were misleaded to a conclusion that
some checks are unnecessary, while they actually were necessary.

While here, handle an edge case found by Mark, when on 64-bit platform
an incorrect message from userland would underflow length counter, but
return without any error.  Provide a test case for such message.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35375
2022-06-06 10:05:28 -07:00
Yuichiro NAITO
8d95f50052 smp: Use local copies of the setup function pointer and argument
No functional change intended.

PR:		264383
Reviewed by:	jhb, markj
MFC after:	1 week
2022-06-06 11:29:51 -04:00
Gleb Smirnoff
2573e6ced9 unix/dgram: rename unpdg_sendspace to unpdg_maxdgram
Matches the meaning of the variable and sysctl node name.
2022-06-03 12:55:44 -07:00
Gleb Smirnoff
a8e286bb5d sockets: use socket buffer mutexes in struct socket directly
Convert more generic socket code to not use sockbuf compat pointer.
Continuation of 4328318445.
2022-06-03 12:55:44 -07:00
Mitchell Horne
35eb9b10c2 Use KERNEL_PANICKED() in more places
This is slightly more optimized than checking panicstr directly. For
most of these instances performance doesn't matter, but let's make
KERNEL_PANICKED() the common idiom.

Reviewed by:	mjg
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D35373
2022-06-02 10:15:43 -03:00
Gleb Smirnoff
f083739350 soo_aio_*: use socket buffer mutexes in struct socket directly
A miss from commit 4328318445.
2022-05-30 20:46:38 -07:00
Dmitry Chagin
d46174cd88 Finish cpuset_getaffinity() after f35093f8
Split cpuset_getaffinity() into a two counterparts, where the
user_cpuset_getaffinity() is intended to operate on the cpuset_t from
user va, while kern_cpuset_getaffinity() expects the cpuset from kernel
va.
Accordingly, the code that clears the high bits is moved to the
user_cpuset_getaffinity(). Linux sched_getaffinity() syscall returns
the size of set copied to the user-space and then glibc wrapper clears
the high bits.

MFC after:		2 weeks
2022-05-28 20:53:08 +03:00
Dmitry Chagin
31d1b816fe sysent: Get rid of bogus sys/sysent.h include.
Where appropriate hide sysent.h under proper condition.

MFC after:	2 weeks
2022-05-28 20:52:17 +03:00
Gleb Smirnoff
d64f2f42c1 unix: unp_externalize() can M_WAITOK
Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35318
2022-05-27 20:48:38 -07:00
Gleb Smirnoff
d59bc188d6 sockbuf: remove unused mbuf counter and cluster counter
With M_EXTPG mbufs these two counters already do not represent the
reality.  As we are moving towards protocol independent socket buffers,
which may not even use mbufs at all, the counters become less and less
relevant.  The only userland seeing them was 'netstat -x'.

PR:			264181 (exp-run)
Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35334
2022-05-27 08:20:17 -07:00
Gleb Smirnoff
75e7e3ce34 unix: fix incorrect assertion in 4682ac697c
Pointy hat to:	glebius
Fixes:		4682ac697c
2022-05-26 11:35:05 -07:00
Gleb Smirnoff
4682ac697c unix: turn check in unp_externalize() into assertion
In this function we always work with mbufs that we previously
created ourselves in unp_internalize().  They must be valid.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35319
2022-05-25 13:29:20 -07:00
Gleb Smirnoff
579b45e203 unix/*: check new control size in unp_internalize()
Now that we call sbcreatecontrol() with M_WAITOK, we are expected to
pass a valid size.  Return same error code, we are returning for an
oversized control from sockargs().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35317
2022-05-25 13:29:13 -07:00
Gleb Smirnoff
d60ea9a10a sockets: return EMSGSIZE if control part of message is too large
Specification doesn't list an explicit error code for the control
size specified by msg_control being too large.  But it does list
EMSGSIZE as error code for "message is too large to be sent all at
once (as the socket requires)".  It also lists EINVAL as code for
the "The sum of the iov_len values overflows an ssize_t."  Given
how generic and uninformative EINVAL is, the EMSGSIZE is more
appropriate.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/sendmsg.html

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35316
2022-05-25 13:29:04 -07:00
Gleb Smirnoff
ad51c47fb4 sockbuf: fix assertion in sbcreatecontrol()
Fixes:	6890b58814
2022-05-25 00:19:41 -07:00
Mark Johnston
524dadf7a8 kevent: Fix an off-by-one in filt_timerexpire_l()
Suppose a periodic kevent timer fires close to its deadline, so that
now - kc->next is small.  Then delta ends up being 1, and the next timer
deadline is set to (delta + 1) * kc->to, where kc->to is the timer
period.  This means that the timer fires at half of the requested rate,
and the value returned in kn_data is similarly inaccurate.

PR:		264131
Fixes:		7cb40543e9 ("filt_timerexpire: do not iterate over the interval")
Reviewed by:	kib
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D35313
2022-05-24 20:14:33 -04:00
Mateusz Guzik
cdb337b097 vfs: fix copy-pasto in previous
Reported by:	dchagin
2022-05-20 20:58:11 +00:00
Mateusz Guzik
ec3c225711 vfs: call vn_truncate_locked from kern_truncate
This fixes a bug where the syscall would not bump writecount.

PR:	263999
2022-05-20 17:25:51 +00:00
Mateusz Guzik
6b715687bd vfs: make sure truncate always calls NDFREE_*
While here convert it to NDFREE_NOTHING.
2022-05-20 17:25:51 +00:00
Mark Johnston
4a3e51335e cpuset: Fix the KASAN and KMSAN builds
Rename the "copyin" and "copyout" fields of struct cpuset_copy_cb to
something less generic, since sanitizers define interceptors for
copyin() and copyout() using #define.

Reported by:	syzbot+2db5d644097fc698fb6f@syzkaller.appspotmail.com
Fixes:	47a57144af ("cpuset: Byte swap cpuset for compat32 on big endian architectures")
Sponsored by:	The FreeBSD Foundation
2022-05-20 10:34:25 -04:00
Dmitry Chagin
eca368ecb6 Retire sv_transtrap
Call translate_traps directly from sendsig().

MFC after:		2 weeks
2022-05-20 14:54:03 +03:00
Dmitry Chagin
2479e381cd kqueue: Trim trailing whitespace
MFC after:		1 week
2022-05-19 19:52:02 +03:00
Justin Hibbits
47a57144af cpuset: Byte swap cpuset for compat32 on big endian architectures
Summary:
BITSET uses long as its basic underlying type, which is dependent on the
compile type, meaning on 32-bit builds the basic type is 32 bits, but on
64-bit builds it's 64 bits.  On little endian architectures this doesn't
matter, because the LSB is always at the low bit, so the words get
effectively concatenated moving between 32-bit and 64-bit, but on
big-endian architectures it throws a wrench in, as setting bit 0 in
32-bit mode is equivalent to setting bit 32 in 64-bit mode.  To
demonstrate:

32-bit mode:

BIT_SET(foo, 0):        0x00000001

64-bit sees: 0x0000000100000000

cpuset is the only system interface that uses bitsets, so solve this
by swapping the integer sub-components at the copyin/copyout points.

Reviewed by:	kib
MFC after:	3 days
Sponsored by:	Juniper Networks, Inc.
Differential Revision:	https://reviews.freebsd.org/D35225
2022-05-19 10:49:55 -05:00
Andrew Turner
11a6ecd425 Handle cas failure when the compare succeeds
When locking a priority inherit mutex we perform a compare and swap
operation to try and acquire the mutex. This may fail even when the
compare succeeds.

Check and handle this case.

PR:		263825
Reviewed by:	kib, markj
Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35150
2022-05-19 11:30:21 +01:00
Gleb Smirnoff
6890b58814 sockbuf: improve sbcreatecontrol()
o Constify memory pointer.  Make length unsigned.
o Make it never fail with M_WAITOK and assert that length is sane.
2022-05-17 10:10:42 -07:00
Gleb Smirnoff
b46667c63e sockbuf: merge two versions of sbcreatecontrol() into one
No functional change.
2022-05-17 10:10:42 -07:00
Gleb Smirnoff
eac7f0798b unix: garbage collect unp_dispose_mbuf() for brevity 2022-05-17 10:10:41 -07:00
Gleb Smirnoff
2e5bf7c49f unix: fix mbuf leak on close of socket with data
Fixes:	1f32cef471
2022-05-17 10:10:41 -07:00
Vladimir Kondratyev
b6f87b78b5 LinuxKPI: Implement kthread_worker related functions
Kthread worker is a single thread workqueue which can be used in cases
where specific kthread association is necessary, for example, when it
should have RT priority or be assigned to certain cgroup.

This change implements Linux v4.9 interface which mostly hides kthread
internals from users thus allowing to use ordinary taskqueue(9) KPI.
As kthread worker prohibits enqueueing of already pending or canceling
tasks some minimal changes to taskqueue(9) were done.
taskqueue_enqueue_flags() was added to taskqueue KPI which accepts extra
flags parameter. It contains one or more of the following flags:

TASKQUEUE_FAIL_IF_PENDING - taskqueue_enqueue_flags() fails if the task
    is already scheduled to execution. EEXIST is returned and the
    ta_pending counter value remains unchanged.
TASKQUEUE_FAIL_IF_CANCELING - taskqueue_enqueue_flags() fails if the
    task is in the canceling state and ECANCELED is returned.

Required by:	drm-kmod 5.10

MFC after:	1 week
Reviewed by:	hselasky, Pau Amma (docs)
Differential Revision:	https://reviews.freebsd.org/D35051
2022-05-17 15:10:20 +03:00
Rick Macklem
373511338d uipc_socket.c: Modify MSG_TLSAPPDATA to only do Alert Records
Without this patch, the MSG_TLSAPPDATA flag would cause
soreceive_generic() to return ENXIO for any non-application
data record in a TLS receive stream.

This works ok for TLS1.2, since Alert records appear to be
the only non-application data records received.
However, for TLS1.3, there can be post-handshake handshake
records, such as NewSessionKey sent to the client from the
server. These handshake records cannot be handled by the
upcall which does an SSL_read() with length == 0.

It appears that the client can simply throw away these
NewSessionKey records, but to do so, it needs to receive
them within the kernel.

This patch modifies the semantics of MSG_TLSAPPDATA slightly,
so that it only applies to Alert records and not Handshake
records. It is needed to allow the krpc to work with KTLS1.3.

Reviewed by:	hselasky
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35170
2022-05-14 12:56:50 -07:00
Dmitry Chagin
cb2ae61631 sysvsem: Fix a typo
Per jamie@ rpr can be NULL if the jail is created with sysvsem=disable.
But at least it doesn't appear to be fatal, since rpr is never dereferenced
but is only compared to other prison pointers.

Reviewed by:		jamie
Differential revision:	https://reviews.freebsd.org/D35198
MFC after:		2 weeks
2022-05-14 14:07:20 +03:00
Dmitry Chagin
b6c8f461f0 sysvsem: Style(9)
MFC after:	2 weeks
2022-05-14 14:06:58 +03:00
Dmitry Chagin
f0b0fdf15e sysvsem: Trim traiing whitespace
MFC after:	2 weeks
2022-05-14 14:06:40 +03:00
Mitchell Horne
db71383b88 kerneldump: remove physical from dump routines
It is unused, especially now that the underlying d_dumper methods do not
accept the argument.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35174
2022-05-13 10:43:19 -03:00
Mitchell Horne
489ba22236 kerneldump: remove physical argument from d_dumper
The physical address argument is essentially ignored by every dumper
method. In addition, the dump routines don't actually pass a real
address; every call to dump_append() passes a value of zero for
physical.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35173
2022-05-13 10:42:48 -03:00
Mitchell Horne
0f50da2e09 Drop d_dump from struct cdevsw
It appears to be unused. These days struct disk has a d_dump member,
which is what gets passed to the kernel dump framework.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D35172
2022-05-13 10:42:17 -03:00
Gleb Smirnoff
bb35a4e11d unix: microoptimize unp_connectat() - one less lock on success
This change is also a preparation for further optimization to
allow locked return on success.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35182
2022-05-12 13:22:39 -07:00
Gleb Smirnoff
08f17d1432 unix: make unp_connect2() void
Assert that sockets are of the same type.  unp_connectat() already did
this check.  Add the check to uipc_connect2().

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35181
2022-05-12 13:22:39 -07:00
Gleb Smirnoff
4328318445 sockets: use socket buffer mutexes in struct socket directly
Since c67f3b8b78 the sockbuf mutexes belong to the containing socket,
and socket buffers just point to it.  In 74a68313b5 macros that access
this mutex directly were added.  Go over the core socket code and
eliminate code that reaches the mutex by dereferencing the sockbuf
compatibility pointer.

This change requires a KPI change, as some functions were given the
sockbuf pointer only without any hint if it is a receive or send buffer.

This change doesn't cover the whole kernel, many protocols still use
compatibility pointers internally.  However, it allows operation of a
protocol that doesn't use them.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35152
2022-05-12 13:22:12 -07:00
Gleb Smirnoff
01235012e5 unix/dgram: uipc_listen() is specific for SOCK_STREAM and SOCK_SEQPACKET
Rely on pr_usrreqs_init() to init SOCK_DGRAM to pru_listen_notsupp().
2022-05-12 11:04:40 -07:00
Gleb Smirnoff
3c87ba3c3b unix/dgram: pru_rcvd never called since PR_WANTRCVD not set 2022-05-12 11:04:40 -07:00
Gleb Smirnoff
2e4e5ee23f sockets: delete stale comment from sofree()
First  paragraph refers to old past "we used to" and is no longer
important today.  Second paragraph has just a wrong statement that
socket buffer is destroyed before pru_detach.
2022-05-12 11:02:50 -07:00
Gleb Smirnoff
1f32cef471 unix: don't call sbrelease() in uipc_detach()
Since a982ce0442 the socket buffer is already cleared and released in
unp_dispose() that is called just before uipc_detach().
2022-05-12 11:02:50 -07:00
Dmitry Chagin
586ed32106 kdump: Decode cpuset_t.
Reviewed by:		jhb
Differential revision:	https://reviews.freebsd.org/D34982
MFC after:		2 weeks
2022-05-11 10:40:39 +03:00
Dmitry Chagin
f35093f8d6 Use Linux semantics for the thread affinity syscalls.
Linux has more tolerant checks of the user supplied cpuset_t's.

Minimum cpuset_t size that the Linux kernel permits in case of
getaffinity() is the maximum CPU id, present in the system / NBBY,
the maximum size is not limited.
For setaffinity(), Linux does not limit the size of the user-provided
cpuset_t, internally using only the meaningful part of the set, where
the upper bound is the maximum CPU id, present in the system, no larger
than the size of the kernel cpuset_t.
Unlike FreeBSD, Linux ignores high bits if set in the setaffinity(),
so clear it in the sched_setaffinity() and Linuxulator itself.

Reviewed by:		Pau Amma (man pages)
In collaboration with:	jhb
Differential revision:	https://reviews.freebsd.org/D34849
MFC after:		2 weeks
2022-05-11 10:36:01 +03:00
Gleb Smirnoff
7db54446c6 sockbufs: make sbrelease_internal() private 2022-05-09 10:43:01 -07:00
Gleb Smirnoff
a982ce0442 sockets: remove the socket-on-stack hack from sorflush()
The hack can be tracked down to 4.4BSD, where copy was performed
under splimp() and then after splx() dom_dispose was called.
Stevens has a chapter on this function, but he doesn't answer why
this trick is necessary.  Why can't we call into dom_dispose under
splimp()?  Anyway, with multithreaded kernel the hack seems to be
necessary to avoid LORs between socket buffer lock and different
filesystem locks, especially network file systems.

The new socket buffers KPI sbcut() from 1d2df300e9 allow us to get
rid of the hack.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35125
2022-05-09 10:43:01 -07:00
Gleb Smirnoff
42f2fa9953 sockets: don't call dom_dispose() on a listening socket
sorflush() already did the right thing, so only sofree() needed
a fix.  Turn check into assertion in our only dom_dispose method.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35124
2022-05-09 10:42:57 -07:00
Gleb Smirnoff
c17418a0ba sockets: assert that any protocol with PR_RIGHTS has dom_dispose()
Through the entire history only PF_UNIX has this feature.

Reviewed by:		markj
Differential revision:	https://reviews.freebsd.org/D35123
2022-05-09 10:42:48 -07:00
Gleb Smirnoff
24df85d29a unix/*: unp_internalize() can sleep, so allocate mbufs with M_WAITOK 2022-05-09 10:42:48 -07:00
Gleb Smirnoff
97f8198e95 sockets: make SO_SND/SO_RCV a enum
Not a functional change now. The enum will also be used for other socket
buffer related KPIs.
2022-05-09 10:42:47 -07:00
Warner Losh
45ae223ac6 msgbuf: Allow microsecond granularity timestamps
Today, kern.msgbuf_show_timestamp=1 will give 1 second granularity
timestamps on dmesg lines. When kern.msgbuf_show_timestamp=2, we'll
produce microsecond level graunlarity.
For example:
old (== 1):
[13] Dual Console: Video Primary, Serial Secondary
[14] lo0: link state changed to UP
[15] bxe0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[15] bxe0: link state changed to UP
new (== 2):
[13.807015] Dual Console: Video Primary, Serial Secondary
[14.544150] lo0: link state changed to UP
[15.272044] bxe0: NIC Link is Up, 10000 Mbps full duplex, Flow control: ON - receive & transmit
[15.272052] bxe0: link state changed to UP

Sponsored by:		Netflix
2022-05-07 09:32:22 -06:00
Alan Somers
1d2421ad8b Correctly measure system load averages > 1024
The old fixed-point arithmetic used for calculating load averages had an
overflow at 1024.  So on systems with extremely high load, the observed
load average would actually fall back to 0 and shoot up again, creating
a kind of sawtooth graph.

Fix this by using 64-bit math internally, while still reporting the load
average to userspace as a 32-bit number.

Sponsored by:	Axcient
Reviewed by:	imp
Differential Revision: https://reviews.freebsd.org/D35134
2022-05-06 17:25:43 -06:00
John Baldwin
2fdcc2ef6f cpufreq: Remove unused devclass argument to DRIVER_MODULE. 2022-05-06 15:46:58 -07:00
Dmitry Chagin
f04534f5c8 sysvsem: Add a timeout argument to the semop.
For future use in the Linux emulation layer for the semtimedop syscall
split the sys_semop syscall into two counterparts and add
struct timespec *timeout argument to the last one.

Reviewed by:		jhb, kib
Differential revision:	https://reviews.freebsd.org/D35121
MFC after:		2 weeks
2022-05-06 19:51:48 +03:00
Kristof Provost
613acc6483 mbuf: do not restore dying interfaces
When we remove an interface it is first removed from the interface list
V_ifnet (by if_unlink_ifnet()) and marked as IFF_DYING. We then wait for
any possible references to stop being used (i.e.
epoch_wait/epoch_drain_callbacks) before we tear it fully down.

However, the index in ifindex_table is not removed, so m_rcvif_restore()
can still find the (now dying) interface.

This results in panics, for example when dummynet restores the rcvif
pointer and passes a packet to ip6_input() we can panic because the
AF_INET6 domain has already been removed (so we end up dereferencing a
NULL pointer there).

Check that the interface is not dying before we restore it, which is
equivalent to checking its presence in V_ifnet, and thus ensures that
future accesses (while in NET_EPOCH) are safe.

Reviewed by:	glebius
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D34076

(cherry picked from commit 703e533da5)
2022-05-05 14:38:08 -04:00
Gleb Smirnoff
4d7a1361ef ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif
Supplement ifindex table with generation count and use it to
serialize & restore an ifnet pointer.

Reviewed by:		kp
Differential revision:	https://reviews.freebsd.org/D33266
Fun note:		git show e6abef0918

(cherry picked from commit e1882428dc)
2022-05-05 14:38:07 -04:00
Marko Zec
6c741ffbfa Revert "mbuf: do not restore dying interfaces"
This reverts commit 703e533da5.

Revert "ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif"

This reverts commit e1882428dc.

Obtained from: github.com/glebius/FreeBSD/commits/backout-ifindex
2022-05-03 19:11:40 +02:00
Konstantin Belousov
6fe78ad434 subr_unit.c: make userspace tests buildable
by defining a placeholder for UNR_NO_MTX

Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2022-04-28 03:00:14 +03:00
Konstantin Belousov
709783373e Fix another race between fork(2) and PROC_REAP_KILL subtree
where we might not yet see a new child when signalling a process.
Ensure that this cannot happen by stopping all reapping subtree,
which ensures that the child is not inside a syscall, in particular
fork(2).

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:35 +03:00
Konstantin Belousov
39794d80ad Fix a race between fork(2) and PROC_REAP_KILL subtree
by repeating iteration over the subtree until there are no new processes
to signal.

Reported and tested by:	pho
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:35 +03:00
Konstantin Belousov
d1df347368 kern_procctl: add possibility to take stop_all_proc_block() around exec
stop_allo_proc_block() must be taken before proctree_lock.

Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:35 +03:00
Konstantin Belousov
2e7595ef2f Add stop_all_proc_block(9)
It allows to have more than one consumer of thread_signle(SIGNLE_ALLPROC) by
serializing them.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:35 +03:00
Konstantin Belousov
54a11adbd9 reap_kill(): split children and subtree killers into helpers
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
134529b11b reap_kill(): rename the reap variable to reaper
Suggested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
e4ce431e2a reap_kill(): de-inline LIST_FOREACH(), twice
Suggested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
b9294a3e15 reaper_abandon_children(): upgrade proctree_lock assert to exclusive
p_reapsibling linkage is protected by proctree_lock, and it is modified
there.

Suggested and reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
e59b940dcb unr(9): allow to avoid internal locking
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
Konstantin Belousov
c4be460e84 init_unrhdr(): make it usable by initializing everything
Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D35014
2022-04-28 02:27:34 +03:00
John Baldwin
1431239494 Add a __witness_used for variables only used under #ifdef WITNESS.
__diagused is now solely used for variables only used under INVARIANTS.

Reviewed by:	mjg
Differential Revision:	https://reviews.freebsd.org/D35085
2022-04-27 11:46:16 -07:00
Dmitry Chagin
4a700f3c32 sigtimedwait: Prevent timeout math overflows.
Our kern_sigtimedwait() calculates absolute sleep timo value as 'uptime+timeout'.
So, when the user specifies a big timeout value (LONG_MAX), the calculated
timo can be less the the current uptime value.
In that case kern_sigtimedwait() returns EAGAIN instead of EINTR, if
unblocked signal was caught.

While here switch to a high-precision sleep method.

Reviewed by:		mav, kib
In collaboration with:	mav
Differential revision:	https://reviews.freebsd.org/D34981
MFC after:		2 weeks
2022-04-25 10:23:15 +03:00
Dmitry Chagin
91e7bdcdcf Add timespecvalid_interval macro and use it.
Reviewed by:		jhb, imp (early rev)
Differential revision:	https://reviews.freebsd.org/D34848
MFC after:		2 weeks
2022-04-25 10:20:54 +03:00
John Baldwin
a4c5d490f6 KTLS: Move OCF function pointers out of ktls_session.
Instead, create a switch structure private to ktls_ocf.c and store a
pointer to the switch in the ocf_session.  This will permit adding an
additional function pointer needed for NIC TLS RX without further
bloating ktls_session.

Reviewed by:	hselasky
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D35011
2022-04-22 15:52:12 -07:00
John Baldwin
92e40a9b92 busdma_bounce: Batch bounce page free operations when possible.
Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D34968
2022-04-21 12:01:55 -07:00
John Baldwin
d4ab3a8d4f busdma_bounce: Add free_bounce_pages helper function.
Deduplicate code to iterate over the bpages list in a bus_dmamap_t
freeing bounce pages during bus_dmamap_unload.

Reviewed by:	imp
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D34967
2022-04-21 10:42:14 -07:00
John Baldwin
10fe9a1fb4 busdma_bounce: Make the map waiting list per-bounce-zone.
When pages are freed to a bounce zone, only maps waiting for pages for
that zone can make forward progress.  If a map for a different bounce
zone is at the head of the global list, then requests that could
otherwise make forward progress will be stalled waiting on the other
bounce zone.  If bounce zones shared bounce pages then a global list
would still make sense to prevent "later" requests from starving an
earlier request but that is not a concern with per-zone bounce page
pools.

Reviewed by:	imp
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D34966
2022-04-21 10:41:09 -07:00
John Baldwin
d11f5d4762 busdma_bounce: Use a simple kproc to invoke deferred requests.
Rather than using a software interrupt with a single handler, just
create a dedicated kernel process woken up with a simple wakeup().

Reviewed by:	imp
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D34965
2022-04-21 10:40:35 -07:00
John Baldwin
c7aa0304d5 Run softclock threads at a hardware ithread priority.
Add a new PI_SOFTCLOCK for use by softclock threads.  Currently this
maps to PI_AV which is the second-highest ithread priority.

Reviewed by:	mav, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D33693
2022-04-21 10:40:01 -07:00
John Baldwin
3d7e90fc20 cpufreq_curr_sysctl: Use devclass_find to lookup cpufreq devclass.
Reviewed by:	imp
Differential Revision:	https://reviews.freebsd.org/D35002
2022-04-21 10:29:14 -07:00
Kristof Provost
a879e40ca2 callout: fix using shared rmlocks
15b1eb142c changed the callout code to store the CALLOUT_SHAREDLOCK flag
in c_iflags (where it used to be c_flags), but failed to update the
check in softclock_call_cc(). This resulted in the callout code always
taking the write lock, even if a read lock had been requested (with
the CALLOUT_SHAREDLOCK flag in callout_init_rm()).

Reviewed by:	markj
MFC after:	1 week
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D34959
2022-04-20 13:06:50 +02:00
John Baldwin
5bdea8826b devclass_add_driver: Permit NULL to be passed in dcp.
This permits a driver module structure that doesn't want to store a
pointer to the new driver's devclass.

Reviewed by:	imp
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D34962
2022-04-19 10:43:50 -07:00
Mateusz Guzik
c5c981d443 signals: plug a set-but-not-used var
Sponsored by:	Rubicon Communications, LLC ("Netgate")
2022-04-19 12:45:57 +00:00
John Baldwin
d139909d6e destroy_dev_sched*: Don't hold Giant for all deferred destroy_dev.
Rather than using taskqueue_swi_giant which holds Giant for all
deferred destroy_dev calls, create a separate queue for destroyed
devices with D_NEEDGIANT set in the corresponding cdevsw.  The task
for this queue holds Giant whild destroying deferred devices while the
task for the default queue does not hold Giant.

In addition, switch to taskqueue_thread for destroy_dev_sched.
Deferred destroy_dev requests don't need to run at an SWI priority.

Reviewed by:	imp, markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D34915
2022-04-18 12:04:30 -07:00
Konstantin Belousov
362ff9867e Revert rest of a5970a529c: use vrefact() when working on fp->f_vnode
Now, since O_PATH-opened file descriptors use use references instead
of the hold references, vrefact() chahges from that revision can be
reverted.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34906
2022-04-15 16:56:20 +03:00
Ed Maste
f99cc5a389 sysent: regen after 52a1d90c8b, posix_fadvise in capmode 2022-04-14 15:17:36 -04:00
Ed Maste
52a1d90c8b Allow posix_fadvise in capability mode
posix_fadvise operates only on a provided fd.  Noted by
Mathieu <sigsys@gmail.com> in review D34761.

No new CAP_ rights are added for posix_fadvise(), as 'advice' in
general only influences when I/O happens; the fd must have existing
CAP_ rights for actual data access.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34903
2022-04-14 15:11:21 -04:00
Konstantin Belousov
bf13db086b Mostly revert a5970a529c: Make files opened with O_PATH to not block non-forced unmount
Problem is that open(O_PATH) on nullfs -o nocache is broken then,
because there is no reference on the vnode after the open syscall exits.

Reported and tested by:	ambrisko
Reviewed by:	markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2022-04-14 02:47:04 +03:00
John Baldwin
36fb372264 kern: Move variables only used for MAC under #ifdef MAC. 2022-04-13 16:08:23 -07:00
John Baldwin
4aec198420 sched_ule: Inline value of ts in sched_thread_priority.
This avoids a set but unused warning in kernels without SMP where
TDQ_CPU() doesn't use its argument.
2022-04-13 16:08:23 -07:00
John Baldwin
8758ac757f sched_4bsd: ts is only used in sched_bind for SMP. 2022-04-13 16:08:22 -07:00
John Baldwin
72ff256c51 sched_4bsd: Remove unused variables. 2022-04-12 14:58:59 -07:00
John Baldwin
dbd51c416a realloc(9): Move slab and zone under #ifndef DEBUG_REDZONE. 2022-04-12 14:58:59 -07:00
Mark Johnston
d769609620 tty: Remove an incorrect assertion from ttyinq_line_iterate()
We may legitimately have tib == NULL if we're at the very end of the
queue.

PR:		215373
Reported by:	pho
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2022-04-12 17:30:04 -04:00
Tom Jones
1ea833a572 kdb: set kdb_why when entered via reboot and panic
Reviewed by:	jhb
Sponsored by:   NetApp, Inc.
Sponsored by:   Klara, Inc.
X-NetApp-PR:    #74
Differential Revision:	https://reviews.freebsd.org/D34551
2022-04-12 10:34:40 +01:00
Dmitry Chagin
c6487446d7 getdirentries: return ENOENT for unlinked but still open directory.
To be more compatible to IEEE Std 1003.1-2008 (“POSIX.1”).

Reviewed by:		mjg, Pau Amma (doc)
Differential revision:  https://reviews.freebsd.org/D34680
MFC after:		2 weeks
2022-04-11 23:30:16 +03:00
Konstantin Belousov
eca39864f7 Add sysctl KERN_LOCKF
reporting the shapshot of the active advisory locks.

A new VFS ops method vfs_report_lockf if provided in the mount point
op table.  If it is NULL, as it is currently for all existing
filesystems, vfs_report_lockf() function is used, which gathers
information from the standard implementation inside kern/kern_lockf.c.

Filesystems implementing its own locking (NFSv4 as example) can provide
a custom implementation.

Reviewed by:	markj, rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34756
2022-04-10 00:43:53 +03:00
Konstantin Belousov
147e4fe3f1 kern_lockf.c: remove no longer neeeded UFS headers
Reviewed by:	markj, rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34756
2022-04-10 00:43:53 +03:00
Konstantin Belousov
59e85819be lockf: remove lf_inode from struct lockf_entry
The UFS-specific struct inode cannot be used in generic advisory lock
code.  It was probably used as a shortcut for the debugging, as the
remnants of the code around it indicates.

Use somewhat more verbose and less concentrated, but universal,
VOP_PRINT(), where needed.

Reviewed by:	markj, rmacklem
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D34756
2022-04-10 00:43:53 +03:00
Gordon Bergling
f171938cd6 jail: Remove a double word in a source code comment
- s/a a/a/

MFC after:	3 days
2022-04-09 14:19:17 +02:00
Gordon Bergling
c3721292e3 kern: Remove a double word in a source code comment
- s/for for/for/

MFC after:	3 days
2022-04-09 10:50:04 +02:00
Gordon Bergling
768f9b8b8b kern: Fix a typo in a source code comment
- s/is is/is/

MFC after:	3 days
2022-04-09 09:14:14 +02:00
Andrew Turner
41e6d2091c Enable subr_physmem_test on supported architectures
Only build where it's supported.

While here add support for amd64 to help with testing.

Sponsored by:	The FreeBSD Foundation
2022-04-07 14:31:51 +01:00
Andrew Turner
d8bff5b67c Handle non-page aligned/sized memory in physmem
In some configurations the firmware may pass memory regions that are
not page sized or aligned, e.g. when using 16k pages on arm64. If this
is the case we will calculate many small regions because the alignment
is applied before being inserted. As we round the start up and end down
this will leave a 1 page hole between what should have been a single
region.

Fix by keeping the original alignment until we are just about to insert
the region into the avail array.

Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34694
2022-04-06 14:13:29 +01:00
Andrew Turner
8c99dfed54 Port subr_physmem to userspace and add tests
These give us some confidience we haven't broken anything in early
boot code that may be running before the console.

Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34691
2022-04-06 14:13:05 +01:00
Mitchell Horne
eb9d205fa6 livedump: add event handler hooks
Add three hooks to the livedump process: before, after, and for each
block of dumped data. This allows, for example, quiescing the system
before the dump begins or protecting data of interest to ensure its
consistency in the final output.

Reviewed by:	markj, kib (previous version)
Reviewed by:	debdrup (manpages)
Reviewed by:	Pau Amma <pauamma@gundo.com> (manpages)
MFC after:	3 weeks
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D34067
2022-04-05 15:35:05 -03:00
Mitchell Horne
c9114f9f86 Add new vnode dumper to support live minidumps
This dumper can instantiate and write the dump's contents to a
file-backed vnode.

Unlike existing disk or network dumpers, the vnode dumper should not be
invoked during a system panic, and therefore is not added to the global
dumper_configs list. Instead, the vnode dumper is constructed ad-hoc
when a live dump is requested using the new ioctl on /dev/mem. This is
similar in spirit to a kgdb session against the live system via
/dev/mem.

As described briefly in the mem(4) man page, live dumps are not
guaranteed to result in a usuable output file, but offer some debugging
value where forcefully panicing a system to dump its memory is not
desirable/feasible.

A future change to savecore(8) will add an option to save a live dump.

Reviewed by:	markj, Pau Amma <pauamma@gundo.com> (manpages)
Discussed with:	kib
MFC after:	3 weeks
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D33813
2022-04-05 15:35:05 -03:00
Mitchell Horne
59c27ea18c Split out dumper allocation from list insertion
Add a new function, dumper_create(), to allocate a dumper.
dumper_insert() will call this function and retains the existing
behaviour.

This is desirable for performing live dumps of the system. Here, there
is a need to allocate and configure a dumper structure that is invoked
outside of the typical debugger context. Therefore, it should be
excluded from the list of panic-time dumpers.

free_single_dumper() is made public and renamed to dumper_destroy().

Reviewed by:	kib, markj
MFC after:	1 week
Sponsored by:	Juniper Networks, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D34068
2022-04-05 15:35:05 -03:00
Mateusz Guzik
b7262756e2 vfs: fixup WANTIOCTLCAPS on open
In some cases vn_open_cred overwrites cn_flags, effectively nullifying
initialisation done in NDINIT. This will have to be fixed.

In the meantime make sure the flag is passed.

Reported by:	jenkins
Noted by:	Mathieu <sigsys@gmail.com>
2022-04-02 20:49:01 +02:00
Gordon Bergling
c9b04ee4f8 kern: Fix two typos in source code comments
- s/accomodate/accommodate/

MFC after:	3 days
2022-04-02 14:52:49 +02:00
Gordon Bergling
7181887e82 kern: Fix two typos in source code comments
- s/measurment/measurement/

MFC after:	3 days
2022-04-02 14:15:27 +02:00
Mateusz Guzik
0c805718cb vfs: fix memory leak on lookup with fds with ioctl caps
Reviewed by:	markj
PR:		262515
Noted by:	firk@cantconnect.ru
Differential Revision:	https://reviews.freebsd.org/D34667
2022-04-02 12:09:07 +00:00
Gordon Bergling
669d5ea4e3 kern: Fix a typo in a source code comment
- s/paniced/panicked/

MFC after:	3 days
2022-04-02 10:15:02 +02:00
Ed Maste
e5821a2156 syscalls.master: remove obsolete comment about compatibility tables
Compatibility ABIs no longer use a separate syscalls.master.

Fixes:		be67ea40c5 ("freebsd32: generate from ...")
Sponsored by:	The FreeBSD Foundation
2022-03-30 11:07:00 -04:00
Brooks Davis
8601fca789 sysent: regen for syscallarg_t 2022-03-28 19:43:03 +01:00
Brooks Davis
b1ad6a9000 syscallarg_t: Add a type for system call arguments
This more clearly differentiates system call arguments from integer
registers and return values. On current architectures it has no effect,
but on architectures where pointers are not integers (CHERI) and may
not even share registers (CHERI-MIPS) it is necessiary to differentiate
between system call arguments (syscallarg_t) and integer register values
(register_t).

Obtained from:	CheriBSD

Reviewed by:	imp, kib
Differential Revision:	https://reviews.freebsd.org/D33780
2022-03-28 19:43:03 +01:00
Andrew Turner
f461b95561 Fix a sign mismatch warning in the physmem code
Make sure both sides of a comparison are unsigned. As the values being
compared are size_t make the the value in the for loop size_t too.

Sponsored by:	The FreeBSD Foundation
2022-03-28 11:51:09 +01:00
Mateusz Guzik
2533b5dc82 vfs: add missing bits to vdropl_impl
This completes the patch which was originally meant to go in.

Spotted by:	mhorne
Fixes: c35ec1efdc ("vfs: [1/2] fix stalls in vnode reclaim by not
requeieing from vnlru")
2022-03-27 14:35:37 +00:00
Mateusz Guzik
a4032e2a69 vfs: assorted tidy ups to lookup
No functional changes.
2022-03-26 17:06:09 +00:00
Alexander Leidinger
aeb91e95cf Log euid, rgid and jail on listen queue overflow
If you have numerous jails with multiple similar services running,
this helps to narrow down which services this log is referring to.
2022-03-26 11:17:55 +01:00
Eric van Gyzen
aca2a7faca stack_zero is not needed before stack_save
The man page was recently clarified to commit to this contract.

MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2022-03-25 20:10:38 -05:00
Eric van Gyzen
863070bbf6 ksiginfo_alloc: pass M_WAITOK or M_NOWAIT to uma_zalloc
It expects exactly one of those flags.  A future commit will assert this.

Reviewed by:	rstone
MFC after:	1 month
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D34451
2022-03-25 20:10:37 -05:00
Mateusz Guzik
0f60088399 vfs: set cn_namelen when handling degenerate lookups
Turns out execve looks at it to store binary name, but in order to
trigger the problem one has to be trying to exec '/'. As is the value
would be left uninitialized (or rather set to -1 on debug kernels).

Fixes:	56244d3574 ("vfs: hoist degenerate path lookups out of the
loop")
2022-03-25 18:19:36 +00:00
Mateusz Guzik
4ef6e56ae8 vfs: hoist trailing slash handling out of the loop 2022-03-24 14:36:31 +00:00
Mateusz Guzik
3b6792d28a vfs: factor symlink traversal out of namei
The intent down the road is to eliminate the loop to begin with,
pushing traversal down to vfs_lookup, all while not allocating the
extra buffer.
2022-03-24 13:11:22 +00:00
Mateusz Guzik
d9ea7e2b1e vfs: factor FAILIFEXISTS handling out of vfs_lookup 2022-03-24 11:22:20 +00:00
Mateusz Guzik
56244d3574 vfs: hoist degenerate path lookups out of the loop 2022-03-24 11:22:12 +00:00
Mateusz Guzik
bb92cd7bcd vfs: NDFREE(&nd, NDF_ONLY_PNBUF) -> NDFREE_PNBUF(&nd) 2022-03-24 10:20:51 +00:00
Mark Johnston
1babcad6bc elf: Avoid dumping uninitialized bytes in PRSTATUS core dump notes
elf_prstatus_t contains pad space.

Reported by:	KMSAN
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34606
2022-03-23 12:53:49 -04:00
Mark Johnston
7524994da0 callout: Remove the CS_EXECUTING flag
It is now unused.

MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34626
2022-03-23 12:37:02 -04:00
Mark Johnston
b319171861 setitimer: Fix exit race
We use the p_itcallout callout, interlocked by the proc lock, to
schedule timeouts for the setitimer(2) system call.  When a process
exits, the callout must be stopped before the process struct is
recycled.

Currently we attempt to stop the callout in exit1() with the call
_callout_stop_safe(&p->p_itcallout, CS_EXECUTING).  If this call returns
0, then we sleep in order to drain the callout.  However, this happens
only if the callout is not scheduled at all.  If the callout thread is
blocked on the proc lock, then exit1() will not block and the callout
may execute after the process has fully exited, typically resulting in a
panic.

I cannot see a reason to use the CS_EXECUTING flag here.  Instead, use
the regular callout_stop()/callout_drain() dance to halt the callout.

Reported by:	ler
Tested by:	ler, pho
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D34625
2022-03-23 12:36:12 -04:00
Alexander Motin
fd6ca665d2 Fix umtxq_sleep() regression caused by 56070dd2e4.
umtxq_requeue() moves the queue to a different hash chain and different
lock, so we can't rely on msleep_sbt() reacquiring the same old lock.
We have to use PDROP and update the queue chain and so lock pointer.

PR:		262587
MFC after:	2 weeks
2022-03-21 19:55:55 -04:00
firk
bb53dd56c3 kern_tc.c/cputick2usec() (which is used to calculate cputime from
cpu ticks) has some imprecision and, worse, huge timestep (about
20 minutes on 4GHz CPU) near 53.4 days of elapsed time.

kern_time.c/cputick2timespec() (it is used for clock_gettime() for
querying process or thread consumed cpu time) Uses cputick2usec()
and then needlessly converting usec to nsec, obviously losing
precision even with fixed cputick2usec().

kern_time.c/kern_clock_getres() uses some weird (anyway wrong)
formula for getting cputick resolution.

PR:		262215
Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D34558
2022-03-21 09:33:46 -04:00
Andrew Turner
cab496e16c Make SHMMAXPGS an unsigned long
This is used to calculate sizes that are then stored in unsigned long
fields. Make this unsigned long so the calculations use this type and
not an int that can lead to an integer overflow with a large PAGE_SIZE.

This allows building this on arm64 with PAGE_SIZE of 16k. Further work
will be needed if a 32-bit architecture tries to use a similar sized
page.

Sponsored by:	The FreeBSD Foundation
2022-03-21 10:27:35 +00:00
Colin Percival
2406867f5b tslog: Add CTLFLAG_SKIP to sysctls
The timestamp logs are quite large (often much larger than all the
other sysctls combined) so it's unlikely anyone will want to have
them displayed by `sysctl -a`.

MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D34616
2022-03-20 11:31:16 -07:00