OpenVPN Data Channel Offload (DCO) moves OpenVPN data plane processing
(i.e. tunneling and cryptography) into the kernel, rather than using tap
devices.
This avoids significant copying and context switching overhead between
kernel and user space and improves OpenVPN throughput.
In my test setup throughput improved from around 660Mbit/s to around
2Gbit/s.
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34340
A one-to-many unix/dgram socket is a socket that has been bound
with bind(2) and can get multiple connections. A typical example
is /var/run/log bound by syslogd(8) and receiving multiple
connections from libc syslog(3) API. Until now all of these
connections shared the same receive socket buffer of the bound
socket. This made the socket vulnerable to overflow attack.
See 240d5a9b1c for a historical attempt to workaround the problem.
This commit creates a per-connection socket buffer for every single
connected socket and eliminates the problem. The new behavior will
optimize seldom writers over frequent writers. See added test case
scenarios and code comments for more detailed description of the
new behavior.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35303
o Use m_pkthdr.memlen from m_uiotombuf()
o Modify unp_internalize() to keep track of allocated space and memory
as well as pointer to the last buffer.
o Modify unp_addsockcred() to keep track of allocated space and memory
as well as pointer to the last buffer.
o Record the datagram len/memlen/ctllen in the first (from) mbuf of the
chain in uipc_sosend_dgram() and reuse it in uipc_soreceive_dgram().
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35302
Data allocated by m_uiotombuf() usually goes into a socket buffer.
We are interested in the length of useful data to be added to sb_acc,
as well as total memory used by mbufs. The later would be added to
sb_mbcnt. Calculating this value at allocation time allows to save
on extra traversal of the mbuf chain.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35301
This change fully splits away PF_UNIX/SOCK_DGRAM from other socket
buffer implementations, without any behavior changes.
Generic socket implementation is reduced down to one STAILQ and very
little code.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35300
Split struct sockbuf into common shared fields and protocol specific
union, where protocols are free to implement whatever buffer they
want. Such protocols should mark themselves with PR_SOCKBUF and are
expected to initialize their buffers in their pr_attach and tear
them down in pr_detach.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35299
Use this new version in unix/dgram socket when sending to a target
address. This removes extra lock release/acquisition and possible
counter-intuitive ENOTCONN.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35298
This allows to remove one M_NOWAIT allocation and also makes it
more clear what's going on.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35297
With this second step PF_UNIX/SOCK_DGRAM has protocol specific
implementation. This gives some possibility performance
optimizations. However, it still operates on the same struct
socket as all other sockets do.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35296
Remove the dead code. The new uipc_sosend_dgram() handles send()
on PF_UNIX/SOCK_DGRAM in full.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35294
This is first step towards splitting classic BSD socket
implementation into separate classes. The first to be
split is PF_UNIX/SOCK_DGRAM as it has most differencies
to SOCK_STREAM sockets and to PF_INET sockets.
Historically a protocol shall provide two methods for sendmsg(2):
pru_sosend and pru_send. The former is a generic send method,
e.g. sosend_generic() which would internally call the latter,
uipc_send() in our case. There is one important exception, though,
the sendfile(2) code will call pru_send directly. But sendfile
doesn't work on SOCK_DGRAM, so we can do the trick. We will create
socket class specific uipc_sosend_dgram() which will carry only
important bits from sosend_generic() and uipc_send().
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35293
Partially revert the previous change; we need to keep this method as a
specific override for pci_driver subclasses which should not use
pci_rescan_method() -- cardbus and ofw_pcibus. However, change the return
value to ENODEV for the same reasoning given in the original commit, and
use this as the default rescan method in bus_if.m.
Reported by: jhb
Fixes: 36a8572ee8 ("bus_if: provide a default null rescan method")
MFC with: 36a8572ee8
The third argument to this function indicates whether the supplied
ticker is fixed or variable, i.e. requiring calibration. Give this
argument a type and name that better conveys this purpose.
Reviewed by: kib, markj
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35459
There is an existing helper method in subr_bus.c, but almost no drivers
know to use it. It also returns the same error as an empty method,
making it not very useful. Move this to bus_if.m and return a more
sensible error code.
This gives a slightly more meaningful error message when attempting
'devctl rescan' on buses and devices alike:
"Device not configured" --> "Operation not supported by device"
Reviewed by: imp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35501
Calculate the desired page valid mask using math that will not
overflow the types used.
Sponsored by: Netflix
Reviewed by: mckusick, kib, markj
Differential Revision: https://reviews.freebsd.org/D34837
- Remove the AIO proc zone. This zone gets one allocation per AIO
daemon process, which isn't enough to warrant a dedicated zone. Plus,
unlike other AIO structures, aiops are small (32 bytes with LP64), so
UMA doesn't provide better space efficiency than malloc(9). Change
one of the malloc types in vfs_aio.c to make it more general.
- Don't set the NOFREE flag on the other AIO zones. This flag means
that memory allocated to the AIO subsystem is never freed back to the
VM, so it's always preferable to avoid using it when possible. NOFREE
was set without explanation when AIO was converted to use UMA 20 years
ago, but it does not appear to be required; all of the structures
allocated from UMA (per-process kaioinfo, kaiocb, and aioliojob) keep
track of references and get freed only when none exist. Plus, these
structures will contain dangling pointer after they're freed (e.g.,
the "cred", "fd_file" and "uiop" fields of struct kaiocb), so
use-after-frees are dangerous even when the structures themselves are
type-stable.
Reviewed by: asomers
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35493
Add kf_pipe_buffer_[in/out/size] fields to kf_pipe, and populate them.
Add a kf_kqueue struct to the kf_un union, to allow querying kqueue state,
and populate it.
Populate the kf_sock_rcv_sb_state and kf_sock_snd_sb_state fields in
kf_sock for INET/INET6 sockets, and populate all other fields for all
transport layer protocols, not just TCP.
Bump __FreeBSD_version.
Differential revision: https://reviews.freebsd.org/D34184
Reviewed by: jhb, kib, se
MFC after: 1 week
When locking the knote list for a socket, we check whether the socket is
a listening socket in order to select the appropriate mutex; a listening
socket uses the socket lock, while data sockets use socket buffer
mutexes.
If SOLISTENING(so) is false and the knote lock routine locks a socket
buffer, then it must re-check whether the socket is a listening socket
since solisten_proto() could have changed the socket's identity while we
were blocked on the socket buffer lock.
Reported by: syzkaller
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35492
When the kernel is compiled with -asan-stack=true, the address sanitizer
will emit inline accesses to the shadow map. In other words, some
shadow map accesses are not intercepted by the KASAN runtime, so they
cannot be disabled even if the runtime is not yet initialized by
kasan_init() at the end of hammer_time().
This went unnoticed because the loader will initialize all PML4 entries
of the bootstrap page table to point to the same PDP page, so early
shadow map accesses do not raise a page fault, though they are silently
corrupting memory. In fact, when the loader does not copy the staging
area, we do get a page fault since in that case only the first and last
PML4Es are populated by the loader. But due to another bug, the loader
always treated KASAN kernels as non-relocatable and thus always copied
the staging area.
It is not really practical to annotate hammer_time() and all callees
with __nosanitizeaddress, so instead add some early initialization which
creates a shadow for the boot stack used by hammer_time(). This is only
needed by KASAN, not by KMSAN, but the shared pmap code handles both.
Reported by: mhorne
Reviewed by: kib
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35449
The pointer to the mount values may be null if an error occurred while
copying them in, so fix the assertion condition to reflect that
possibility.
While here, move some initialization code into the error == 0 block. No
functional change intended.
Reported by: syzkaller
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
... rather than setting and clearing flags inline. No functional change
intended.
Reviewed by: alc, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35469
Suppose a thread tries to read from an empty pipe. pipe_read() does the
following:
1. pipelock(), possibly sleeping
2. check for buffered data
3. pipeunlock()
4. set PIPE_WANTR and sleep
5. goto 1
pipelock() is an open-coded mutex; if a thread blocks in pipelock(), it
sleeps until the lock holder calls pipeunlock().
Both sleeps use the same wait channel. So if there are multiple threads
in pipe_read(), a thread T1 in step 3 can wake up a thread T2 sleeping
in step 4. Then T1 goes to sleep in step 4, and T2 acquires and
releases the pipelock, waking up T1 again. This can go on indefinitely,
livelocking the process (and potentially starving a would-be writer).
Fix the problem by using a separate wait channel for pipelock().
Reported by: Paul Floyd <paulf2718@gmail.com>
Reviewed by: mjg, kib
PR: 264441
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D35415
This is racy because curproc process lock is not used, but allows the
process to exit faster. It is userspace issue to create such race
anyway, and not fullfilling the guarantee that all reaper descendants
are signalled should be fine.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
We drop proctree_lock, which allows the process to exit while memoized
in the list to proceed.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
Recorded reaper might loose its reaper status, so we should not assert
it, but check and avoid signalling if this happens.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 week
Differential revision: https://reviews.freebsd.org/D35310
The failure means that the process does single-threading itself, which
makes our action not needed.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
If we try to single-thread a process which thread entered
procctl(REAP_KILL_SUBTREE), and sleeping waiting for us unlocking
stop_all_proc_blocker, we must be able to finish single-threading. This
requires the sleep to be interruptible.
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
markj wrote:
tdsendsignal() may unsuspend a target thread. I think there is at least
one bug there: suppose thread T is suspended in
thread_single(SINGLE_ALLPROC) when trying to kill another process with
REAP_KILL. Suppose a different thread sends SIGKILL to T->td_proc. Then,
tdsendsignal() calls thread_unsuspend(T, T->td_proc). thread_unsuspend()
incorrectly decrements T->td_proc->p_suspcount to -1.
Later, when T->td_proc exits, it will wait forever in
thread_single(SINGLE_EXIT) since T->td_proc->p_suspcount never reaches 1.
Since the thread suspension is bounded by time needed to do
thread_single(), skipping the thread_unsuspend_one() call there should
not affect signal delivery if this thread is selected as target.
Reported by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
Since both self single-threading and remote single-threading rely on
suspending the thread doing thread_single(), it cannot be mixed: thread
doing thread_suspend_switch() might be subject to thread_suspend_one()
and vice versa.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
suspended for SINGLE_ALLPROC mode. There is no need to check for
boundary state. It is only required to see that the suspension comes
from the ALLPROC mode.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
to avoid ALLPROC mode to try to race with any other single-threading
mode.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
Places that will wait for curproc->p_singlethr to become zero (in the
next commit, the counter of number of external single-threading is
to be introduced), must wait for it interruptible, otherwise we
deadlock. On the other hand, a signal delivered during this window,
if directed to the waiting thread, would cause the wait loop to become
a busy loop.
Since we are exiting, it is safe to ignore the signals.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
before the process itself does thread_single(SINGLE_EXIT). We cannot
single-thread such process in ALLPROC (external) mode, and properly
detect and report the failure to do so due to the process becoming
zombie is easier to prevent than handle.
In collaboration with: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D35310
This avoids the need to drop into the ddb to figure out vnode
usage per file system. It helps to see if they are or are not
being freed. Suggestion to report active vnode count was from
kib@
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D35436
Do this by reducing the size of the MBUF_PEXT_MAX_PGS, causing "struct mbuf" to
be bigger than M_SIZE, and also add a missing padding field to ensure 64-bit
alignment.
Reviewed by: gallatin@
Reported by: Elliott Mitchell
Differential revision: https://reviews.freebsd.org/D35339
MFC after: 1 week
Sponsored by: NVIDIA Networking
Basic TLS RX offloading uses the "csum_flags" field in the mbuf packet
header to figure out if an incoming mbuf has been fully offloaded or
not. This information follows the packet stream via the LRO engine, IP
stack and finally to the TCP stack. The TCP stack preserves the mbuf
packet header also when re-assembling packets after packet loss. When
the mbuf goes into the socket buffer the packet header is demoted and
the offload information is transferred to "m_flags" . Later on a
worker thread will analyze the mbuf flags and decide if the mbufs
making up a TLS record indicate a fully-, partially- or not decrypted
TLS record. Based on these three cases the worker thread will either
pass the packet on as-is or recrypt the decrypted bits, if any, or
decrypt the packet as usual.
During packet loss the kernel TLS code will call back into the network
driver using the send tag, informing about the TCP starting sequence
number of every TLS record that is not fully decrypted by the network
interface. The network interface then stores this information in a
compressed table and starts asking the hardware if it has found a
valid TLS header in the TCP data payload. If the hardware has found a
valid TLS header and the referred TLS header is at a valid TCP
sequence number according to the TCP sequence numbers provided by the
kernel TLS code, the network driver then informs the hardware that it
can resume decryption.
Care has been taken to not merge encrypted and decrypted mbuf chains,
in the LRO engine and when appending mbufs to the socket buffer.
The mbuf's leaf network interface pointer is used to figure out from
which network interface the offloading rule should be allocated. Also
this pointer is used to track route changes.
Currently mbuf send tags are used in both transmit and receive
direction, due to convenience, but may get a new name in the future to
better reflect their usage.
Reviewed by: jhb@ and gallatin@
Differential revision: https://reviews.freebsd.org/D32356
Sponsored by: NVIDIA Networking
So that the asserts and the actual code see the same values.
Differential revision: https://reviews.freebsd.org/D32356
MFC after: 1 week
Sponsored by: NVIDIA Networking
When packets are received they may traverse several network interfaces like
vlan(4) and lagg(9). When doing receive side offloads it is important to
know the first network interface entry point, because that is where all
offloading is taking place. This makes it possible to track receive
side route changes for multiport setups, for example when lagg(9) receives
traffic from more than one port. This avoids having to install multiple
offloading rules for the same stream.
This field works similar to the existing "rcvif" mbuf packet header field.
Submitted by: jhb@
Reviewed by: gallatin@ and gnn@
Differential revision: https://reviews.freebsd.org/D35339
Sponsored by: NVIDIA Networking
Sponsored by: Netflix
Make it a complex, but a single for(;;) statement. The previous cycle
with some loop logic in the beginning and some loop logic at the end
was confusing. Both me and markj@ were misleaded to a conclusion that
some checks are unnecessary, while they actually were necessary.
While here, handle an edge case found by Mark, when on 64-bit platform
an incorrect message from userland would underflow length counter, but
return without any error. Provide a test case for such message.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D35375