As with FIFOs, a path descriptor for a unix socket cannot be used with
kevent().
In principle connectat(2) and bindat(2) could be modified to support an
AT_EMPTY_PATH-like mode which operates on the socket referenced by an
O_PATH fd referencing a unix socket. That would eliminate the path
length limit imposed by sockaddr_un.
Update O_PATH tests.
Reviewed by: kib
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31970
To handle shutdown(SHUT_RD) we flush the receive buffer of the socket.
This may involve searching for control messages of type SCM_RIGHTS,
since we need to close the file references. Closing arbitrary files
with socket buffer locks held is undesirable, mainly due to lock
ordering issues, so we instead make a copy of the socket buffer and
operate on that without any locks. Fields in the original buffer are
cleared.
This behaviour clobbered the AIO job queue associated with a receive
buffer. It could also cause us to leak a KTLS session reference.
Reorder socket buffer fields to address this.
An alternate solution would be to remove the hack in sorflush(), but
this is not quite feasible (yet). In particular, though sorflush()
flags the sockbuf with SBS_CANTRCVMORE, it is possible for more data to
be queued - the flag just prevents userspace from reading more data. I
suspect we should fix this; SBS_CANTRCVMORE represents a terminal state
and protocols can likely just drop any data destined for such a buffer.
Many of them already do, but in some cases the check is racy, and some
KPI churn will be needed to fix everything. This approach is more
straightforward for now.
Reported by: syzbot+104d8ee3430361cb2795@syzkaller.appspotmail.com
Reported by: syzbot+5bd2e7d05f84a59d0d1b@syzkaller.appspotmail.com
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31976
This flag was added during the transition away from the legacy zone
allocator, commit c897b81311. The old
zone allocator effectively provided _NOFREE semantics, but it seems that
they are not required for sockets. In particular, we use reference
counting to keep sockets live.
One somewhat dangerous case is sonewconn(), which returns a pointer to a
socket with reference count 0. This socket is still effectively owned
by the listening socket. Protocols must therefore be careful to
synchronize sonewconn() calls with their pru_close implementations,
since for listening sockets soclose() will abort the child sockets. For
example, TCP holds the listening socket's PCB read locked across the
sonewconn() call, which blocks tcp_usr_close(), and sofree()
synchronizes with a concurrent soabort() of the nascent socket.
However, _NOFREE semantics are not required here.
Eliminating _NOFREE has several benefits: it enables use-after-free
detection (e.g., by KASAN) and lets the system reclaim memory from the
socket zone under memory pressure. No functional change intended.
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31975
Sockets in a listen queue hold a reference to the parent listening
socket. Several code paths release this reference manually when moving
a child socket out of the queue.
Replace comments about the expected post-decrement refcount value with
assertions. Use refcount_load() instead of a plain load. No functional
change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31974
After releasing the fd reference to a socket "so", we should avoid
testing SOLISTENING(so) since the socket may have been freed. Instead,
directly test whether the list of unaccepted sockets is empty.
Fixes: f4bb1869dd ("Consistently use the SOLISTENING() macro")
Pointy hat: markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31973
ktls_get_(rx|tx)_mode() can return an errno value or a TLS mode, so
errors are effectively hidden. Fix this by using a separate output
parameter. Convert to the new socket buffer locking macros while here.
Note that the socket buffer lock is not needed to synchronize the
SOLISTENING check here, we can rely on the PCB lock.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31977
It allows to override kern.elf{32,64}.allow_wx on per-process basis.
In particular, it makes it possible to run binaries without PT_GNU_STACK
and without elfctl note while allow_wx = 0.
Reviewed by: brooks, emaste, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31779
Reimplement bdf0f24bb1 by checking for the caller' ABI in
the implementation of PT_GET_SC_ARGS, and copying out everything if
it is Linuxolator.
Also fix a minor information leak: if PT_GET_SC_ARGS_ALL is done on the
thread reused after other process, it allows to read some number of that
thread last syscall arguments. Clear td_sa.args in thread_alloc().
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D31968
For an invalid filesystem name used like this:
mount -t asdfs /dev/ada1p5 /usr/obj
emit an error message like this:
mount: /dev/ada1p5: Invalid fstype: Invalid argument
instead of:
mount: /dev/ada1p5: Operation not supported by device
Differential Revision: https://reviews.freebsd.org/D31540
This is one of the pieces required to make modern (ie Focal)
strace(1) work.
Reviewed By: jhb (earlier version)
Sponsored by: EPSRC
Differential Revision: https://reviews.freebsd.org/D28212
Move the type and function pointers for operations on existing send
tags (modify, query, next, free) out of 'struct ifnet' and into a new
'struct if_snd_tag_sw'. A pointer to this structure is added to the
generic part of send tags and is initialized by m_snd_tag_init()
(which now accepts a switch structure as a new argument in place of
the type).
Previously, device driver ifnet methods switched on the type to call
type-specific functions. Now, those type-specific functions are saved
in the switch structure and invoked directly. In addition, this more
gracefully permits multiple implementations of the same tag within a
driver. In particular, NIC TLS for future Chelsio adapters will use a
different implementation than the existing NIC TLS support for T6
adapters.
Reviewed by: gallatin, hselasky, kib (older version)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31572
- kern_kcov.c needs to be compiled with -fsanitize=kernel-memory when
KMSAN is configured since it calls into various other subsystems.
- Disable address and memory sanitizers in kcov(4)'s coverage sanitizer
callbacks, as they do not provide useful checking. Moreover, with
KMSAN we may otherwise get false positives since the caller (coverage
sanitizer runtime) is not instrumented.
- Disable KASAN and KMSAN interceptors in subr_coverage.c, as they do
not provide any benefit but do introduce overhead when fuzzing.
Sponsored by: The FreeBSD Foundation
Add HWPMC events to measure latency.
Provide sysctl to choose the number of outstanding events which
trigger HWPMC event.
Obtained from: Semihalf
Sponsored by: Stormshield
Differential revision: https://reviews.freebsd.org/D31283
Some system software expects to be able to read at least the number of
bytes returned by FIONREAD. When control messages are counted in this
return value, this assumption is violated. Follow Linux and OpenBSD
here (as well as our own kevent(EVFILT_READ)) and only return the number
of data bytes available.
Reported by: avg
MFC after: 2 weeks
With lio_listio(2), the opcode is specified by userspace rather than
being hard-coded by the system call (e.g., aio_readv() -> LIO_READV).
kern_lio_listio() calls aio_aqueue() with an opcode of LIO_NOP, which
gets fixed up when the aiocb is copied in.
When copying in a job request for vectored I/O, we need to dynamically
allocate a uio to wrap an iovec. So aiocb_copyin() needs to get the
opcode from the aiocb and then decide whether an allocation is required.
We failed to do this in the COMPAT_FREEBSD32 case. Fix it.
Reported by: syzbot+27eab6f2c2162f2885ee@syzkaller.appspotmail.com
Reviewed by: kib, asomers
Fixes: f30a1ae8d5 ("lio_listio(2): Allow LIO_READV and LIO_WRITEV.")
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31914
soo_aio_queue() did not handle the possibility that the provided socket
is a listening socket. Up until recently, to fix this one would have to
acquire the socket lock first and check, since the socket buffer locks
were destroyed by listen(2).
Now that the socket buffer locks belong to the socket, simply check
SOLISTENING(so) after acquiring them, and make listen(2) return an error
if any AIO jobs are enqueued on the socket.
Add a couple of simple regression test cases.
Note that this fixes things only for the default AIO implementation;
cxgbe(4)'s TCP offload has a separate pru_aio_queue implementation which
requires its own solution.
Reported by: syzbot+c8aa122fa2c6a4e2a28b@syzkaller.appspotmail.com
Reported by: syzbot+39af117d43d4f0faf512@syzkaller.appspotmail.com
Reported by: syzbot+60cceb9569145a0b993b@syzkaller.appspotmail.com
Reported by: syzbot+2d522c5db87710277ca5@syzkaller.appspotmail.com
Reviewed by: tuexen, gallatin, jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31901
When traversing a list of interface addresses, we need to be in a net
epoch section, and protocol ctlinput routines need a stable reference to
the address.
Reported by: syzbot+3219af764ead146a3a4e@syzkaller.appspotmail.com
Reviewed by: kp, melifaro
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31889
osd_register(9) may reallocate and expand the destructor array for a
given object type if no space is available for a new key. This happens
with the object lock held. Thus, when verifying that a given slot in
the array is occupied, we need to hold the object lock to avoid racing
with a reallocation.
Reported by: syzbot+69ce54c7d7d813315dd3@syzkaller.appspotmail.com
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Although it is not specified in the RFCs, the concept that
the NFSv4 server should reply to an RPC request within a
reasonable time is accepted practice within the NFSv4 community.
Without this patch, the NFSv4.2 server attempts to reply to
a Copy operation within 1second by limiting the copy to
vfs.nfs.maxcopyrange bytes (default 10Mbytes). This is crude at
best, given the large variation in I/O subsystem performance.
This patch adds a kernel only flag COPY_FILE_RANGE_TIMEO1SEC
that the NFSv4.2 can specify, which tells VOP_COPY_FILE_RANGE()
to return after approximately 1 second with a partial result and
implements this in vn_generic_copy_file_range(), used by
vop_stdcopyfilerange().
Modifying the NFSv4.2 server to set this flag will be done in
a separate patch. Also under consideration is exposing the
COPY_FILE_RANGE_TIMEO1SEC to userland for use on the FreeBSD
copy_file_range(2) syscall.
MFC after: 2 weeks
Reviewed by: khng
Differential Revision: https://reviews.freebsd.org/D31829
This behaviour appears to date from the 4.4 BSD import. It has two
problems:
1. The update to so_state is not protected by the socket lock, so
concurrent updates to so_state may be lost.
2. Suppose two threads race to call connect(2) on a socket, and one
succeeds while the other fails. Then the failing thread may
incorrectly clear SS_ISCONNECTING, confusing the state machine.
Simply remove the update. It does not appear to be necessary:
pru_connect implementations which call soisconnecting() only do so after
all failure modes have been handled. For instance, tcp_connect() and
tcp6_connect() will never return an error after calling soisconnected().
However, we cannot correctly assert that SS_ISCONNECTED is not set after
an error from soconnect() since the socket lock is not held across the
pru_connect call, so a concurrent connect(2) may have set the flag.
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31699
Now that SOCK_IO_*_LOCK() checks for listening sockets, we can eliminate
some racy SOLISTENING() checks. No functional change intended.
Reviewed by: tuexen
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31660
Currently, most protocols implement pru_listen with something like the
following:
SOCK_LOCK(so);
error = solisten_proto_check(so);
if (error) {
SOCK_UNLOCK(so);
return (error);
}
solisten_proto(so);
SOCK_UNLOCK(so);
solisten_proto_check() fails if the socket is connected or connecting.
However, the socket lock is not used during I/O, so this pattern is
racy.
The change modifies solisten_proto_check() to additionally acquire
socket buffer locks, and the calling thread holds them until
solisten_proto() or solisten_proto_abort() is called. Now that the
socket buffer locks are preserved across a listen(2), this change allows
socket I/O paths to properly interlock with listen(2).
This fixes a large number of syzbot reports, only one is listed below
and the rest will be dup'ed to it.
Reported by: syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com
Reviewed by: tuexen, gallatin
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31659
In preparation for moving sockbuf locks into the containing socket,
provide alternative macros for the sockbuf I/O locks:
SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK(). These operate on a
socket rather than a socket buffer. Note that these locks are used only
to prevent concurrent readers and writters from interleaving I/O.
When locking for I/O, return an error if the socket is a listening
socket. Currently the check is racy since the sockbuf sx locks are
destroyed during the transition to a listening socket, but that will no
longer be true after some follow-up changes.
Modify a few places to check for errors from
sblock()/SOCK_IO_(SEND|RECV)_LOCK() where they were not before. In
particular, add checks to sendfile() and sorflush().
Reviewed by: tuexen, gallatin
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31657
Linux fails to build now because the mkdtemp in the bootstrapped
environment wants 6 or more X's. Use 10 out of an abundance of caution.
Sponsored by: Netflix
Reviewed by: arichards
Differential Revision: https://reviews.freebsd.org/D31863
Otherwise return from the syscall and next syscall, which could be
kevent(2) on the kqueue that should be notified, races with the kqueue
taskqueue thread, and potentially misses the wakeup. This is reliably
visible when kevent(2) only peeks into events using zeroed timeout.
PR: 258310
Reported by: arichardson, Jan Kokemüller <jan.kokemueller@gmail.com>
Reviewed by: arichardson, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31858
The 'intr_config_hooks' SYSINIT is now taking a nontrivial amount of
time in my profiling; run_interrupt_driven_config_hooks is responsible
for most of it, so this adds useful information to the resulting
flamecharts.
Without this change when virtual console enabled depending on buffer
presence and state different parts of output go to different consoles.
MFC after: 1 month
Protect conscallout with tty lock instead of Giant. In addition to
Giant removal it also closes race on console unset.
Introduce additional lock to protect against concurrent console sets.
Remove consbuf free on console unset as unsafe, making impossible to
change buffer size after first allocation. Instead increase default
buffer size from 8KB to 64KB and processing rate from 5Hz to 10-15Hz
to make the output more smooth.
MFC after: 1 month
Implement lock_spin()/unlock_spin() lock class methods, moving the
assertion to _sleep() instead. Change assertions in callout(9) to
allow spin locks for both regular and C_DIRECT_EXEC cases. In case of
C_DIRECT_EXEC callouts spin locks are the only locks allowed actually.
As the first use case allow taskqueue_enqueue_timeout() use on fast
task queues. It actually becomes more efficient due to avoided extra
context switches in callout(9) thanks to C_DIRECT_EXEC.
MFC after: 2 weeks
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D31778
The issue was reported by
Alexander Lochmann <alexander.lochmann@tu-dortmund.de>,
who found the problem by performing lock analysis using LockDoc,
see https://doi.org/10.1145/3302424.3303948.
Reviewed by: mckusick
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31780
It has a prohibitive performance impact when running real workloads.
Note this only affects kernels with DIAGNOSTIC.
Reviewed by: markj
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31784
Switch the main syscall table to use CAPENABLED flags rather than
capabilities.conf. This avoid synchronization issues between
syscalls.master and capabilities.conf (e.g. when renaming a syscall
during development).
For now, move capabilities.conf to sys/compat/freebsd32 and use it
there. Use of sys/compat/freebsd32/syscalls.master should be replaced
by makesyscalls.lua enhancements to allow the main one to be used.
This change results in no changes to generated files after running
`make sysent`.
Reviewed by: kevans, emaste
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D31350
The CAPENABLED flag indicates that the syscall can be used in capsicum
capability mode. It is intended to replace capabilities.conf.
Reviewed by: kevans, emaste
MFC after: 1 week
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D31349
This changes vn_deallocate() to match the behavior of vn_rdwr() when
picking which ucred to use. That is, vn_deallocate() uses file_cred for
making VOP call if it is non-NULL, or use active_cred otherwise.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31712
Fix the following race between itimer_proc_continue() and process exit.
itimer_proc_continue() may be called via realitexpire(), the real
interval timer. Note that exit1() drains this timer _after_ draining
and freeing itimers. Moreover, itimers_exit() is called without the
process lock held; it only acquires the proc lock when deleting
individual itimers, so once they are drained we free p->p_itimers
without any synchronization. Thus, itimer_proc_continue() may load a
non-NULL p->p_itimers array and iterate over it after it has been freed.
Fix the problem by using the process lock when clearing p->p_itimers, to
synchronize with itimer_proc_continue(). Formally, accesses to this
field should be protected by the process lock anyway, and since the
array is allocated lazily this will not incur any overhead in the common
case.
Reported by: syzbot+c40aa8bf54fe333fc50b@syzkaller.appspotmail.com
Reported by: syzbot+929be2f32503bbc3844f@syzkaller.appspotmail.com
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31759
KTLS OCF support was originally targeted at software backends that
used host CPU cycles to encrypt TLS records. As a result, each KTLS
worker thread queued a single TLS record at a time and waited for it
to be encrypted before processing another TLS record. This works well
for software backends but limits throughput on OCF drivers for
coprocessors that support asynchronous operation such as qat(4) or
ccr(4). This change uses an alternate function (ktls_encrypt_async)
when encrypt TLS records via a coprocessor. This function queues TLS
records for encryption and returns. It defers the work done after a
TLS record has been encrypted (such as marking the mbufs ready) to a
callback invoked asynchronously by the coprocessor driver when a
record has been encrypted.
- Add a struct ktls_ocf_state that holds the per-request state stored
on the stack for synchronous requests. Asynchronous requests malloc
this structure while synchronous requests continue to allocate this
structure on the stack.
- Add a ktls_encrypt_async() variant of ktls_encrypt() which does not
perform request completion after dispatching a request to OCF.
Instead, the ktls_ocf backends invoke ktls_encrypt_cb() when a TLS
record request completes for an asynchronous request.
- Flag AEAD software TLS sessions as async if the backend driver
selected by OCF is an async driver.
- Pull code to create and dispatch an OCF request out of
ktls_encrypt() into a new ktls_encrypt_one() function used by both
ktls_encrypt() and ktls_encrypt_async().
- Pull code to "finish" the VM page shuffling for a file-backed TLS
record into a helper function ktls_finish_noanon() used by both
ktls_encrypt() and ktls_encrypt_cb().
Reviewed by: markj
Tested on: ccr(4) (jhb), qat(4) (markj)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31665
Move the common kernel function signatures from machine/reg.h to a new
sys/reg.h. This is in preperation for adding PT_GETREGSET to ptrace(2).
Reviewed by: imp, markj
Sponsored by: DARPA, AFRL (original work)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19830
Yield at the end of each loop iteration if there are remaining works as
indicated by the value of *len updated by VOP_DEALLOCATE. Without this,
when calling vop_stddeallocate to zero a large region, the
implementation only zerofills a relatively small chunk and returns.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31705
Restore the pre-1d874ba4f8ba behaviour of disassociating the current
SIGIO recipient before looking up the specified process or process
group. This avoids a lock recursion in the scenario where a process
group is configured to receive SIGIO for an fd when it has already been
so configured.
Reported by: pho
Tested by: pho
Reviewed by: kib
MFC after: 3 days
A future commit to the NFS client uses vop_stddeallocate() for
cases where the NFS server does not support a Deallocate operation.
Change vop_stddeallocate() from static to global so that it can
be called by the NFS client.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31640
Rehash updates v_hash. Also, rehash moves the vnode to different hash
bucket, which should be noticed in vfs_hash_get() after sleeping for
the vnode lock.
Reviewed by: mckusick, rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31464
After vnode lock, recheck v_hash. When vfs_hash_insert() is used with
a predicate, recheck it after the selected vnode is locked. Since
vfs_hash_lock is dropped, vnode could be rehashed during the sleep for
the vnode lock, which could go unnoticed there.
Reported and tested by: pho
Reviewed by: mckusick, rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31464
soconnect(...) is equivalent to soconnectat(AT_FDCWD, ...), so rely on
this to save a branch. No functional change intended.
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
- pget()/pfind() will acquire the PID hash bucket locks, which are
sleepable sx locks, but this means that the sigio mutex cannot be held
while calling these functions. Instead, use pget() to hold the
process, after which we lock the sigio and proc locks, respectively.
- funsetownlst() assumes that processes cannot be registered for SIGIO
once they have P_WEXIT set. However, pfind() will happily return
exiting processes, breaking the invariant. Add an explicit check for
P_WEXIT in fsetown() to fix this. [1]
Fixes: f52979098d ("Fix a pair of races in SIGIO registration")
Reported by: syzkaller [1]
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31661
rmsr.r_offset now is set to rqsr.r_offset plus the number of bytes
zeroed before hitting the end-of-file. After this change rmsr.r_offset
no longer contains the EOF when the requested operation range is
completely beyond the end-of-file. Instead in such case rmsr.r_offset is
equal to rqsr.r_offset. Callers can obtain the number of bytes zeroed
by subtracting rqsr.r_offset from rmsr.r_offset.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31677
This avoids any unnecessary works in such case.
Sponsored by: The FreeBSD Foundation
Reviewed by: markj, kib
Differential Revision: https://reviews.freebsd.org/D31655
rmacklem@ spotted two things in the system call:
- Upon returning from a successful operation, vop_stddeallocate can
update rmsr.r_offset to a value greater than file size. This behavior,
although being harmless, can be confusing.
- The EINVAL return value for rqsr.r_offset + rqsr.r_len > OFF_MAX is
undocumented.
This commit has the following changes:
- vop_stddeallocate and shm_deallocate to bound the the affected area
further by the file size.
- The EINVAL case for rqsr.r_offset + rqsr.r_len > OFF_MAX is
documented.
- The fspacectl(2), vn_deallocate(9) and VOP_DEALLOCATE(9)'s return
len is explicitly documented the be the value 0, and the return offset
is restricted to be the smallest of off + len and current file size
suggested by kib@. This semantic allows callers to interact better
with potential file size growth after the call.
Sponsored by: The FreeBSD Foundation
Reviewed by: imp, kib
Differential Revision: https://reviews.freebsd.org/D31604
Allow multiple vector IOs to be started with one system call.
aio_readv() and aio_writev() already used these opcodes under the
covers. This commit makes them available to user space.
Being non-standard extensions, they're only visible if __BSD_VISIBLE is
defined, like the functions.
Reviewed by: asomers, kib
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D31627
Now that we allow recursive unmount attempts to be abandoned upon
exceeding the retry limit, we should avoid leaving an unkillable
thread when a synchronous unmount request was issued against the
base filesystem.
Reviewed by: kib (earlier revision), mkusick
Differential Revision: https://reviews.freebsd.org/D31450
A forcible unmount attempt may fail due to a transient condition, but
it may also fail due to some issue in the filesystem implementation
that will indefinitely prevent successful unmount. In such a case,
the retry logic in the recursive unmount facility will cause the
deferred unmount taskqueue to execute constantly.
Avoid this scenario by imposing a retry limit, with a default value
of 10, beyond which the recursive unmount facility will emit a log
message and give up. Additionally, introduce a grace period, with
a default value of 1s, between successive unmount retries on the
same mount.
Create a new sysctl node, vfs.deferred_unmount, to export the total
number of failed recursive unmount attempts since boot, and to allow
the retry limit and retry grace period to be tuned.
Reviewed by: kib (earlier revision), mkusick
Differential Revision: https://reviews.freebsd.org/D31450
This fixes some of the LORs seen on mount/unmount.
Complete fix will require taking care of unmount as well.
Reviewed by: kib
Tested by: pho (previous version)
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31611
domain_init() gets reinvoked for each vnet on a system, so we must not
alter global state. Practically speaking, we were creating circular
lists and tying up a softclock thread into an infinite loop.
The breakage here was most easily observed by simply creating a jail
in a new vnet and watching the system suddenly become erratic.
Reported by: markj
Fixes: e0a17c3f06 ("uipc: create dedicated lists for fast ...")
Pointy hat: kevans
This lock no longer exists. It was removed in
a60100fdfc (if: Remove ifnet_rwlock, 2020-11-25)
Reviewed by: mjg
Pointed out by: Dheeraj Kandula <dheerajk@netapp.com>
Different Revision: https://reviews.freebsd.org/D31585
Introduce m_get3() which is similar to m_get2(), but can allocate up to
MJUM16BYTES bytes (m_get2() can only allocate up to MJUMPAGESIZE).
This simplifies the bpf improvement in f13da24715.
Suggested by: glebius
Differential Revision: https://reviews.freebsd.org/D31455
This avoids having to walk all possible protocols only to check if they
have one (vast majority does not).
Original patch by kevans@.
Reviewed by: kevans
Sponsored by: Rubicon Communications, LLC ("Netgate")
When a sigtimedwait(2) caller goes to sleep, it uses a wait channel of
p->p_sigacts with the proc lock as the interlock. However, p_sigacts
can be shared between processes if a child is created with
rfork(RFSIGSHARE | RFPROC). Thus we can end up with two threads
sleeping on the same wait channel using different locks, which is not
permitted.
Fix the problem simply by using a process-unique wait channel, following
the example of sigsuspend. The actual wait channel value is irrelevant
here, sleeping threads are awoken using sleepq_abort().
Reported by: syzbot+8c417afabadb50bb8827@syzkaller.appspotmail.com
Reported by: syzbot+1d89fc2a9ef92ef64fa8@syzkaller.appspotmail.com
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31563
TLS 1.0 empty fragment mbufs have no payload and thus m_epg_npgs is
zero. However, these mbufs need to occupy a "unit" of space for the
purposes of M_NOTREADY tracking similar to regular mbufs. Previously
this was done for the page count returned from ktls_frame() and passed
to ktls_enqueue() as well as the page count passed to pru_ready().
However, sbready() and mb_free_notready() only use m_epg_nrdy to
determine the number of "units" of space in an M_EXT mbuf, so when a
TLS 1.0 fragment was marked ready it would mark one unit of the next
mbuf in the socket buffer as ready as well. To fix, set m_epg_nrdy to
1 for empty fragments. This actually simplifies the code as now only
ktls_frame() has to handle TLS 1.0 fragments explicitly and the rest
of the KTLS functions can just use m_epg_nrdy.
Reviewed by: gallatin
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D31536
I can see two concerns for adding domains after domainfinalize:
1.) The slow/fast callouts have already been setup.
2.) Userland could create a socket while we're in the middle of
initialization.
We can address #1 fairly easily by tracking whether the domain's been
initialized for at least the default vnet. There are still some concerns
about the callbacks being invoked while a vnet is in the process of
being created/destroyed, but this is a pre-existing issue that the
callbacks must coordinate anyways.
We should also address #2, but technically this has been an issue
anyways because we don't assert on post-domainfinalize additions; we
don't seem to hit it in practice.
Future work can fix that up to make sure we don't find partially
constructed domains, but care must be taken to make sure that at least,
e.g., the usages of pffindproto in ip_input.c can still find them.
Differential Revision: https://reviews.freebsd.org/D25459
This gives any given domain a chance to indicate that it's not actually
supported on the current system. If dom_probe isn't supplied, we assume
the domain is universally applicable as most of them are. Keeping
fully-initialized and registered domains around that physically can't
work on a large majority of FreeBSD deployments is sub-optimal and leads
to errors that aren't consistent with the reality of why the socket
can't be created (e.g. ESOCKTNOSUPPORT) because such scenario has to be
caught upon pru_attach, at which point kicking back the more-appropriate
EAFNOSUPPORT would seem weird.
The initial consumer of this will be hvsock, which is only available on
HyperV guests.
Reviewed by: cem (earlier version), bcr (manpages)
Differential Revision: https://reviews.freebsd.org/D25062
Set NIRES_EMPTYPATH earlies, to have use of EMPTYPATH recorded even if
we are going to return error. When namei_setup() refused to accept dirfd,
which is not of the vnode type, and indicated by ENOTDIR error return,
fall back to kern_fstat(dirfd).
Reported by: dchagin
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D31530
This implements fspacectl(2) support on shared memory objects. The
semantic of SPACECTL_DEALLOC is equivalent to clearing the backing
store and free the pages within the affected range. If the call
succeeds, subsequent reads on the affected range return all zero.
tests/sys/posixshm/posixshm_tests.c is expanded to include a
fspacectl(2) functional test.
Sponsored by: The FreeBSD Foundation
Reviewed by: kevans, kib
Differential Revision: https://reviews.freebsd.org/D31490
The addition of ioflag allows callers passing
IO_SYNC/IO_DATASYNC/IO_DIRECT down to the file system implementation.
The vop_stddeallocate fallback implementation is updated to pass the
ioflag to the file system implementation. vn_deallocate(9) internally is
also changed to pass ioflag to the VOP_DEALLOCATE call.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D31500
Converted vn_write to use this helper.
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31513
At least Linux x86 ABI's does not use carry bit and expects that the dx register
is preserved. For this add a new sv_set_fork_retval hook and call it from cpu_fork().
Add a short comment about touching dx in x86_set_fork_retval(), for more details
see phab comments from kib@ and imp@.
Reviewed by: kib
Differential revision: https://reviews.freebsd.org/D31472
MFC after: 2 weeks
Previously, if an encrypted netdump failed, such as due to a timeout or
network failure, the key was not saved, so a partial dump was
completely useless.
Send the key first, so the partial dump can be decrypted, because even a
partial dump can be useful.
Reviewed by: bdrewery, markj
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D31453
When cloning a ktls session (which is needed when we need to
switch output NICs for a NIC TLS session), we need to also
init the reset task, like we do when creating a new tls session.
Reviewed by: jhb
Sponsored by: Netflix
Make kdb_thr_first() and kdb_thr_next() return sane values if the
allproc list and pidhashtbl haven't been initialized yet. This can
happen if the debugger is entered very early on, for example with the
'-d' boot flag.
This allows remote gdb to attach at such a time, and fixes some ddb
commands like 'show threads'.
Be explicit about the static initialization of these variables. This
part has no functional change.
Reviewed by: markj, imp (previous version)
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D31495
This includes a style fix around ioflag checking as well.
Sponsored by: The FreeBSD Foundation
Reviewed by: kib, bcr
Differential Revision: https://reviews.freebsd.org/D31505
For now, just hook the allocation path: upon allocation, items are
marked as initialized (absent M_ZERO). Some zones are exempted from
this when it would otherwise raise false positives.
Use kmsan_orig() to update the origin map for UMA and malloc(9)
allocations. This allows KMSAN to print the return address when an
uninitialized UMA item is implicated in a report. For example:
panic: MSan: Uninitialized UMA memory from m_getm2+0x7fe
Sponsored by: The FreeBSD Foundation
Sanitizer instrumentation of course cannot automatically update shadow
state when devices write to host memory. KMSAN thus hooks into busdma,
both to update shadow state after a device write, and to verify that the
kernel does not publish uninitalized bytes to devices.
To implement this, when KMSAN is configured, each dmamap embeds a memory
descriptor describing the region currently loaded into the map.
bus_dmamap_sync() uses the operation flags to determine whether to
validate the loaded region or to mark it as initialized in the shadow
map.
Note that in cases where the amount of data written is less than the
buffer size, the entire buffer is marked initialized even when it is
not. For example, if a NIC writes a 128B packet into a 2KB buffer, the
entire buffer will be marked initialized, but subsequent accesses past
the first 128 bytes are likely caused by bugs.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31338
Interrupt and exception handlers must call kmsan_intr_enter() prior to
calling any C code. This is because the KMSAN runtime maintains some
TLS in order to track initialization state of function parameters and
return values across function calls. Then, to ensure that this state is
kept consistent in the face of asynchronous kernel-mode excpeptions, the
runtime uses a stack of TLS blocks, and kmsan_intr_enter() and
kmsan_intr_leave() push and pop that stack, respectively.
Use these functions in amd64 interrupt and exception handlers. Note
that handlers for user->kernel transitions need not be annotated.
Also ensure that trap frames pushed by the CPU and by handlers are
marked as initialized before they are used.
Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31467
- During boot, allocate PDP pages for the shadow maps. The region above
KERNBASE is currently not shadowed.
- Create a dummy shadow for the vm page array. For now, this array is
not protected by the shadow map to help reduce kernel memory usage.
- Grow shadows when growing the kernel map.
- Increase the default kernel stack size when KMSAN is enabled. As with
KASAN, sanitizer instrumentation appears to create stack frames large
enough that the default value is not sufficient.
- Disable UMA's use of the direct map when KMSAN is configured. KMSAN
cannot validate the direct map.
- Disable unmapped I/O when KMSAN configured.
- Lower the limit on paging buffers when KMSAN is configured. Each
buffer has a static MAXPHYS-sized allocation of KVA, which in turn
eats 2*MAXPHYS of space in the shadow map.
Reviewed by: alc, kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31295
KMSAN enables the use of LLVM's MemorySanitizer in the kernel. This
enables precise detection of uses of uninitialized memory. As with
KASAN, this feature has substantial runtime overhead and is intended to
be used as part of some automated testing regime.
The runtime maintains a pair of shadow maps. One is used to track the
state of memory in the kernel map at bit-granularity: a bit in the
kernel map is initialized when the corresponding shadow bit is clear,
and is uninitialized otherwise. The second shadow map stores
information about the origin of uninitialized regions of the kernel map,
simplifying debugging.
KMSAN relies on being able to intercept certain functions which cannot
be instrumented by the compiler. KMSAN thus implements interceptors
which manually update shadow state and in some cases explicitly check
for uninitialized bytes. For instance, all calls to copyout() are
subject to such checks.
The runtime exports several functions which can be used to verify the
shadow map for a given buffer. Helpers provide the same functionality
for a few structures commonly used for I/O, such as CAM CCBs, BIOs and
mbufs. These are handy when debugging a KMSAN report whose
proximate and root causes are far away from each other.
Obtained from: NetBSD
Sponsored by: The FreeBSD Foundation
Some filesystems, e.g., devfs, do not populate va_birthtime in their
GETATTR implementations. To handle this, make sure that va_birthtime is
initialized to the quasi-standard value of { VNOVAL, 0 } before calling
VOP_GETATTR.
Reported by: KMSAN
Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31468
When the device name is provided, we can simply run strncmp() for each
line to quickly skip unrelated ones, that is much faster than sscanf()
and only then strcmp().
MFC after: 2 weeks
These ones were unambiguous cases where the Foundation was the only
listed copyright holder (in the associated license block).
Sponsored by: The FreeBSD Foundation
VOP_LOOKUP() is called with cn_flags bits ISLASTCN and ISOPEN
to indicate that the lookup is for the last component of a pathname
when doing open.
If the cn_flags also indicates if the open is for Reading, Writing or Both,
the NFSv4 client can do an NFSv4 Open operation in the same compound
RPC as Lookup, often avoiding the additional Open RPC now done when
VOP_OPEN() is called.
This patch defines two new cn_flags bits called OPENREAD and OPENWRITE
and sets these in open2nameif() based on FREAD, FWRITE flag bits.
This will allow a subsequent patch to the NFSv4 client to do the Open
operation in the same RPC as Lookup.
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D31431
Use the new PNOLOCK flag to tsleep() to indicate that
we are managing potential races, and don't need to
sleep with a lock, or have a backstop timeout.
Reviewed by: jhb
Sponsored by: Netflix
Add a PNOLOCK flag so that, in the race circumstance where
wakeup races are externally mitigated, tsleep() can be
called with a sleep time of 0 without triggering an
an assertion.
Reviewed by: jhb
Sponsored by: Netflix