freebsd-dev

Author	SHA1	Message	Date
Mark Johnston	bf25678226	ktls: Fix error/mode confusion in TCP_*TLS_MODE getsockopt handlers ktls_get_(rx\|tx)_mode() can return an errno value or a TLS mode, so errors are effectively hidden. Fix this by using a separate output parameter. Convert to the new socket buffer locking macros while here. Note that the socket buffer lock is not needed to synchronize the SOLISTENING check here, we can rely on the PCB lock. Reviewed by: jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31977	2021-09-17 14:19:05 -04:00
John Baldwin	c782ea8bb5	Add a switch structure for send tags. Move the type and function pointers for operations on existing send tags (modify, query, next, free) out of 'struct ifnet' and into a new 'struct if_snd_tag_sw'. A pointer to this structure is added to the generic part of send tags and is initialized by m_snd_tag_init() (which now accepts a switch structure as a new argument in place of the type). Previously, device driver ifnet methods switched on the type to call type-specific functions. Now, those type-specific functions are saved in the switch structure and invoked directly. In addition, this more gracefully permits multiple implementations of the same tag within a driver. In particular, NIC TLS for future Chelsio adapters will use a different implementation than the existing NIC TLS support for T6 adapters. Reviewed by: gallatin, hselasky, kib (older version) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D31572	2021-09-14 11:43:41 -07:00
Mark Johnston	f94acf52a4	socket: Rename sb(un)lock() and interlock with listen(2) In preparation for moving sockbuf locks into the containing socket, provide alternative macros for the sockbuf I/O locks: SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK(). These operate on a socket rather than a socket buffer. Note that these locks are used only to prevent concurrent readers and writters from interleaving I/O. When locking for I/O, return an error if the socket is a listening socket. Currently the check is racy since the sockbuf sx locks are destroyed during the transition to a listening socket, but that will no longer be true after some follow-up changes. Modify a few places to check for errors from sblock()/SOCK_IO_(SEND\|RECV)_LOCK() where they were not before. In particular, add checks to sendfile() and sorflush(). Reviewed by: tuexen, gallatin MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31657	2021-09-07 15:06:48 -04:00
John Baldwin	470e851c4b	ktls: Support asynchronous dispatch of AEAD ciphers. KTLS OCF support was originally targeted at software backends that used host CPU cycles to encrypt TLS records. As a result, each KTLS worker thread queued a single TLS record at a time and waited for it to be encrypted before processing another TLS record. This works well for software backends but limits throughput on OCF drivers for coprocessors that support asynchronous operation such as qat(4) or ccr(4). This change uses an alternate function (ktls_encrypt_async) when encrypt TLS records via a coprocessor. This function queues TLS records for encryption and returns. It defers the work done after a TLS record has been encrypted (such as marking the mbufs ready) to a callback invoked asynchronously by the coprocessor driver when a record has been encrypted. - Add a struct ktls_ocf_state that holds the per-request state stored on the stack for synchronous requests. Asynchronous requests malloc this structure while synchronous requests continue to allocate this structure on the stack. - Add a ktls_encrypt_async() variant of ktls_encrypt() which does not perform request completion after dispatching a request to OCF. Instead, the ktls_ocf backends invoke ktls_encrypt_cb() when a TLS record request completes for an asynchronous request. - Flag AEAD software TLS sessions as async if the backend driver selected by OCF is an async driver. - Pull code to create and dispatch an OCF request out of ktls_encrypt() into a new ktls_encrypt_one() function used by both ktls_encrypt() and ktls_encrypt_async(). - Pull code to "finish" the VM page shuffling for a file-backed TLS record into a helper function ktls_finish_noanon() used by both ktls_encrypt() and ktls_encrypt_cb(). Reviewed by: markj Tested on: ccr(4) (jhb), qat(4) (markj) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31665	2021-08-30 13:11:52 -07:00
John Baldwin	d16cb228c1	ktls: Fix accounting for TLS 1.0 empty fragments. TLS 1.0 empty fragment mbufs have no payload and thus m_epg_npgs is zero. However, these mbufs need to occupy a "unit" of space for the purposes of M_NOTREADY tracking similar to regular mbufs. Previously this was done for the page count returned from ktls_frame() and passed to ktls_enqueue() as well as the page count passed to pru_ready(). However, sbready() and mb_free_notready() only use m_epg_nrdy to determine the number of "units" of space in an M_EXT mbuf, so when a TLS 1.0 fragment was marked ready it would mark one unit of the next mbuf in the socket buffer as ready as well. To fix, set m_epg_nrdy to 1 for empty fragments. This actually simplifies the code as now only ktls_frame() has to handle TLS 1.0 fragments explicitly and the rest of the KTLS functions can just use m_epg_nrdy. Reviewed by: gallatin MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31536	2021-08-16 10:42:46 -07:00
Andrew Gallatin	95c51fafa4	ktls: Init reset tag task for cloned sessions When cloning a ktls session (which is needed when we need to switch output NICs for a NIC TLS session), we need to also init the reset task, like we do when creating a new tls session. Reviewed by: jhb Sponsored by: Netflix	2021-08-11 14:06:43 -04:00
Andrew Gallatin	09066b9866	ktls: Use the new PNOLOCK flag Use the new PNOLOCK flag to tsleep() to indicate that we are managing potential races, and don't need to sleep with a lock, or have a backstop timeout. Reviewed by: jhb Sponsored by: Netflix	2021-08-05 17:19:12 -04:00
Andrew Gallatin	2694c869ff	ktls: fix a panic with INVARIANTS `98215005b7` introduced a new thread that uses tsleep(..0) to sleep forever. This hit an assert due to sleeping with a 0 timeout. So spell "forever" using SBT_MAX instead, which does not trigger the assert. Pointy hat to: gallatin Pointed out by: emaste Sponsored by: Netflix	2021-08-05 13:09:06 -04:00
Andrew Gallatin	98215005b7	ktls: start a thread to keep the 16k ktls buffer zone populated Ktls recently received an optimization where we allocate 16k physically contiguous crypto destination buffers. This provides a large (more than 5%) reduction in CPU use in our workload. However, after several days of uptime, the performance benefit disappears because we have frequent allocation failures from the ktls buffer zone. It turns out that when load drops off, the ktls buffer zone is trimmed, and some 16k buffers are freed back to the OS. When load picks back up again, re-allocating those 16k buffers fails after some number of days of uptime because physical memory has become fragmented. This causes allocations to fail, because they are intentionally done without M_NORECLAIM, so as to avoid pausing the ktls crytpo work thread while the VM system defragments memory. To work around this, this change starts one thread per VM domain to allocate ktls buffers with M_NORECLAIM, as we don't care if this thread is paused while memory is defragged. The thread then frees the buffers back into the ktls buffer zone, thus allowing future allocations to succeed. Note that waking up the thread is intentionally racy, but neither of the races really matter. In the worst case, we could have either spurious wakeups or we could have to wait 1 second until the next rate-limited allocation failure to wake up the thread. This patch has been in use at Netflix on a handful of servers, and seems to fix the issue. Differential Revision: https://reviews.freebsd.org/D31260 Reviewed by: jhb, markj, (jtl, rrs, and dhw reviewed earlier version) Sponsored by: Netflix	2021-08-05 10:19:12 -04:00
Andrew Gallatin	4150a5a87e	ktls: fix NOINET build Reported by: mjguzik Sponsored by: Netflix	2021-07-07 10:40:02 -04:00
Andrew Gallatin	28d0a740dd	ktls: auto-disable ifnet (inline hw) kTLS Ifnet (inline) hw kTLS NICs typically keep state within a TLS record, so that when transmitting in-order, they can continue encryption on each segment sent without DMA'ing extra state from the host. This breaks down when transmits are out of order (eg, TCP retransmits). In this case, the NIC must re-DMA the entire TLS record up to and including the segment being retransmitted. This means that when re-transmitting the last 1448 byte segment of a TLS record, the NIC will have to re-DMA the entire 16KB TLS record. This can lead to the NIC running out of PCIe bus bandwidth well before it saturates the network link if a lot of TCP connections have a high retransmoit rate. This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct), where TCP connections with higher retransmit rate will be switched to SW kTLS so as to conserve PCIe bandwidth. Reviewed by: hselasky, markj, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30908	2021-07-06 10:28:32 -04:00
Mateusz Guzik	904a08f342	ktls: switch bare zone_mbuf use to m_free_raw Reviewed by: gallatin Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D30955	2021-07-02 08:30:22 +00:00
John Baldwin	faf0224ff2	ktls: Don't mark existing received mbufs notready for TOE TLS. The TOE driver might receive decrypted TLS records that are enqueued to the socket buffer after ktls_try_toe() returns and before ktls_enable_rx() locks the receive buffer to call sb_mark_notready(). In that case, sb_mark_notready() would incorrectly treat the decrypted TLS record as an encrypted record and schedule it for decryption. This always resulted in the connection being dropped as the data in the control message did not look like a valid TLS header. To fix, don't try to handle software decryption of existing buffers in the socket buffer for TOE TLS in ktls_enable_rx(). If a TOE TLS driver needs to decrypt existing data in the socket buffer, the driver will need to manage that in its tod_alloc_tls_session method. Sponsored by: Chelsio Communications	2021-06-15 17:45:21 -07:00
Andrew Gallatin	ed5e13cfc2	ktls: Fix interaction with RATELIMIT uipc_ktls.c was missing opt_ratelimit.h, so it was never noticing that RATELIMIT was enabled. Once it was enabled, it failed to compile as ktls_modify_txrtlmt() had accrued a compilation error when it was not being compiled in. Sponsored by: Netflix	2021-06-14 10:51:16 -04:00
John Baldwin	6b313a3a60	Include the trailer in the original dst_iov. This avoids creating a duplicate copy on the stack just to append the trailer. Reviewed by: gallatin, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30139	2021-05-25 16:59:19 -07:00
John Baldwin	21e3c1fbe2	Assume OCF is the only KTLS software backend. This removes support for loadable software backends. The KTLS OCF support is now always included in kernels with KERN_TLS and the ktls_ocf.ko module has been removed. The software encryption routines now take an mbuf directly and use the TLS mbuf as the crypto buffer when possible. Bump __FreeBSD_version for software backends in ports. Reviewed by: gallatin, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30138	2021-05-25 16:59:19 -07:00
Mark Johnston	89b650872b	ktls: Hide initialization message behind bootverbose We don't typically print anything when a subsystem initializes itself, and KTLS is currently disabled by default anyway. Reviewed by: jhb MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29097	2021-03-05 13:11:02 -05:00
Mark Johnston	49f6925ca3	ktls: Cache output buffers for software encryption Maintain a cache of physically contiguous runs of pages for use as output buffers when software encryption is configured and in-place encryption is not possible. This makes allocation and free cheaper since in the common case we avoid touching the vm_page structures for the buffer, and fewer calls into UMA are needed. gallatin@ reports a ~10% absolute decrease in CPU usage with sendfile/KTLS on a Xeon after this change. It is possible that we will not be able to allocate these buffers if physical memory is fragmented. To avoid frequently calling into the physical memory allocator in this scenario, rate-limit allocation attempts after a failure. In the failure case we fall back to the old behaviour of allocating a page at a time. N.B.: this scheme could be simplified, either by simply using malloc() and looking up the PAs of the pages backing the buffer, or by falling back to page by page allocation and creating a mapping in the cache zone. This requires some way to save a mapping of an M_EXTPG page array in the mbuf, though. m_data is not really appropriate. The second approach may be possible by saving the mapping in the plinks union of the first vm_page structure of the array, but this would force a vm_page access when freeing an mbuf. Reviewed by: gallatin, jhb Tested by: gallatin Sponsored by: Ampere Computing Submitted by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D28556	2021-03-03 17:34:01 -05:00
John Baldwin	90972f0402	ktls: Use COUNTER_U64_DEFINE_EARLY for the ktls_toe_chacha20 counter. I missed updating this counter when rebasing the changes in `9c64fc4029` after the switch to COUNTER_U64_DEFINE_EARLY in `1755b2b989`. Fixes: `9c64fc4029` Add Chacha20-Poly1305 as a KTLS cipher suite. Sponsored by: Netflix	2021-02-25 15:00:13 -08:00
John Baldwin	9c64fc4029	Add Chacha20-Poly1305 as a KTLS cipher suite. Chacha20-Poly1305 for TLS is an AEAD cipher suite for both TLS 1.2 and TLS 1.3 (RFCs 7905 and 8446). For both versions, Chacha20 uses the server and client IVs as implicit nonces xored with the record sequence number to generate the per-record nonce matching the construction used with AES-GCM for TLS 1.3. Reviewed by: gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27839	2021-02-18 09:26:32 -08:00
Mark Johnston	b5aa9ad43a	ktls: Make configuration sysctls available as tunables Reviewed by: gallatin, jhb Sponsored by: Ampere Computing Submitted by: Klara, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28499	2021-02-08 09:19:02 -05:00
Mark Johnston	1755b2b989	ktls: Use COUNTER_U64_DEFINE_EARLY This makes it a bit more straightforward to add new counters when debugging. No functional change intended. Reviewed by: jhb Sponsored by: Ampere Computing Submitted by: Klara, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28498	2021-02-08 09:18:51 -05:00
Gleb Smirnoff	3f43ada98c	Catch up with `6edfd179c8`: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG. Originally IFCAP_NOMAP meant that the mbuf has external storage pointer that points to unmapped address. Then, this was extended to array of such pointers. Then, such mbufs were augmented with header/trailer. Basically, extended mbufs are extended, and set of features is subject to change. The new name should be generic enough to avoid further renaming.	2021-01-29 11:46:24 -08:00
Mark Johnston	4dc1b17dbb	ktls: Improve handling of the bind_threads tunable a bit - Only check for empty domains if we actually tried to configure domain affinity in the first place. Otherwise setting bind_threads=1 will always cause the sysctl value to be reported as zero. This is harmless since the threads end up being bound, but it's confusing. - Try to improve the sysctl description a bit. Reviewed by: gallatin, jhb Submitted by: Klara, Inc. Sponsored by: Ampere Computing Differential Revision: https://reviews.freebsd.org/D28161	2021-01-19 21:32:33 -05:00
Michael Tuexen	6685e259e3	tcp: don't use KTLS socket option on listening sockets KTLS socket options make use of socket buffers, which are not available for listening sockets. Reported by: syzbot+a8829e888a93a4a04619@syzkaller.appspotmail.com Reviewed by: jhb@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D27948	2021-01-08 08:57:11 +01:00
Andrew Gallatin	02bc3865aa	Optionally bind ktls threads to NUMA domains When ktls_bind_thread is 2, we pick a ktls worker thread that is bound to the same domain as the TCP connection associated with the socket. We use roughly the same code as netinet/tcp_hpts.c to do this. This allows crypto to run on the same domain as the TCP connection is associated with. Assuming TCP_REUSPORT_LB_NUMA (D21636) is in place & in use, this ensures that the crypto source and destination buffers are local to the same NUMA domain as we're running crypto on. This change (when TCP_REUSPORT_LB_NUMA, D21636, is used) reduces cross-domain traffic from over 37% down to about 13% as measured by pcm.x on a dual-socket Xeon using nginx and a Netflix workload. Reviewed by: jhb Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21648	2020-12-19 21:46:09 +00:00
John Baldwin	36e0a362ac	Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc(). This gives a more uniform API for send tag life cycle management. Reviewed by: gallatin, hselasky Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27000	2020-10-29 23:28:39 +00:00
John Baldwin	521eac97f3	Support hardware rate limiting (pacing) with TLS offload. - Add a new send tag type for a send tag that supports both rate limiting (packet pacing) and TLS offload (mostly similar to D22669 but adds a separate structure when allocating the new tag type). - When allocating a send tag for TLS offload, check to see if the connection already has a pacing rate. If so, allocate a tag that supports both rate limiting and TLS offload rather than a plain TLS offload tag. - When setting an initial rate on an existing ifnet KTLS connection, set the rate in the TCP control block inp and then reset the TLS send tag (via ktls_output_eagain) to reallocate a TLS + ratelimit send tag. This allocates the TLS send tag asynchronously from a task queue, so the TLS rate limit tag alloc is always sleepable. - When modifying a rate on a connection using KTLS, look for a TLS send tag. If the send tag is only a plain TLS send tag, assume we failed to allocate a TLS ratelimit tag (either during the TCP_TXTLS_ENABLE socket option, or during the send tag reset triggered by ktls_output_eagain) and ignore the new rate. If the send tag is a ratelimit TLS send tag, change the rate on the TLS tag and leave the inp tag alone. - Lock the inp lock when setting sb_tls_info for a socket send buffer so that the routines in tcp_ratelimit can safely dereference the pointer without needing to grab the socket buffer lock. - Add an IFCAP_TXTLS_RTLMT capability flag and associated administrative controls in ifconfig(8). TLS rate limit tags are only allocated if this capability is enabled. Note that TLS offload (whether unlimited or rate limited) always requires IFCAP_TXTLS[46]. Reviewed by: gallatin, hselasky Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26691	2020-10-29 00:23:16 +00:00
John Baldwin	6bcf3c46d8	Check TF_TOE not the tod pointer to determine if TOE is active. The TF_TOE flag is the check used in the rest of the network stack to determine if TOE is active on a socket. There is at least one path in the cxgbe(4) TOE driver that can leave the tod pointer non-NULL on a socket not using TOE. Reported by: Sony Arpita Das <sonyarpitad@chelsio.com> Reviewed by: np Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D26803	2020-10-19 18:24:06 +00:00
John Baldwin	c2a8fd6f05	Permit sending empty fragments for TLS 1.0. Due to a weakness in the TLS 1.0 protocol, OpenSSL will periodically send empty TLS records ("empty fragments"). These TLS records have no payload (and thus a page count of zero). m_uiotombuf_nomap() was returning NULL instead of an empty mbuf, and a few places needed to be updated to treat an empty TLS record as having a page count of "1" as 0 means "no work to do" (e.g. nothing to encrypt, or nothing to mark ready via sbready()). Reviewed by: gallatin Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26729	2020-10-13 17:30:34 +00:00
Bjoern A. Zeeb	d29a3de296	uipc_ktls: remove unused static function m_segments() was added with r363464 but never used. Remove it to avoid warnings when compiling kernels. Reported by: rmacklem (also says jhb) Reviewed by: gallatin, jhb Differential Revision: https://reviews.freebsd.org/D26330	2020-09-05 00:19:40 +00:00
Andrew Gallatin	9675d8895a	ktls: Check for a NULL send tag in ktls_cleanup() When using ifnet ktls, and when ktls_reset_send_tag() fails to allocate a replacement tag, it leaves the tls session's snd_tag pointer NULL. ktls_cleanup() tries to release the send tag, and will trip over this NULL pointer and panic unless NULL is checked for. Reviewed by: jhb Sponsored by: Netflix	2020-09-04 17:36:15 +00:00
John Baldwin	3c0e568505	Add support for KTLS RX via software decryption. Allow TLS records to be decrypted in the kernel after being received by a NIC. At a high level this is somewhat similar to software KTLS for the transmit path except in reverse. Protocols enqueue mbufs containing encrypted TLS records (or portions of records) into the tail of a socket buffer and the KTLS layer decrypts those records before returning them to userland applications. However, there is an important difference: - In the transmit case, the socket buffer is always a single "record" holding a chain of mbufs. Not-yet-encrypted mbufs are marked not ready (M_NOTREADY) and released to protocols for transmit by marking mbufs ready once their data is encrypted. - In the receive case, incoming (encrypted) data appended to the socket buffer is still a single stream of data from the protocol, but decrypted TLS records are stored as separate records in the socket buffer and read individually via recvmsg(). Initially I tried to make this work by marking incoming mbufs as M_NOTREADY, but there didn't seemed to be a non-gross way to deal with picking a portion of the mbuf chain and turning it into a new record in the socket buffer after decrypting the TLS record it contained (along with prepending a control message). Also, such mbufs would also need to be "pinned" in some way while they are being decrypted such that a concurrent sbcut() wouldn't free them out from under the thread performing decryption. As such, I settled on the following solution: - Socket buffers now contain an additional chain of mbufs (sb_mtls, sb_mtlstail, and sb_tlscc) containing encrypted mbufs appended by the protocol layer. These mbufs are still marked M_NOTREADY, but soreceive*() generally don't know about them (except that they will block waiting for data to be decrypted for a blocking read). - Each time a new mbuf is appended to this TLS mbuf chain, the socket buffer peeks at the TLS record header at the head of the chain to determine the encrypted record's length. If enough data is queued for the TLS record, the socket is placed on a per-CPU TLS workqueue (reusing the existing KTLS workqueues and worker threads). - The worker thread loops over the TLS mbuf chain decrypting records until it runs out of data. Each record is detached from the TLS mbuf chain while it is being decrypted to keep the mbufs "pinned". However, a new sb_dtlscc field tracks the character count of the detached record and sbcut()/sbdrop() is updated to account for the detached record. After the record is decrypted, the worker thread first checks to see if sbcut() dropped the record. If so, it is freed (can happen when a socket is closed with pending data). Otherwise, the header and trailer are stripped from the original mbufs, a control message is created holding the decrypted TLS header, and the decrypted TLS record is appended to the "normal" socket buffer chain. (Side note: the SBCHECK() infrastucture was very useful as I was able to add assertions there about the TLS chain that caught several bugs during development.) Tested by: rmacklem (various versions) Relnotes: yes Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D24628	2020-07-23 23:48:18 +00:00
John Baldwin	4a711b8d04	Use zfree() instead of explicit_bzero() and free(). In addition to reducing lines of code, this also ensures that the full allocation is always zeroed avoiding possible bugs with incorrect lengths passed to explicit_bzero(). Suggested by: cem Reviewed by: cem, delphij Approved by: csprng (cem) Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D25435	2020-06-25 20:17:34 +00:00
Hans Petter Selasky	4f3c0f3d1e	Fix build issue after r360292 when using both RSS and KERN_TLS options. Sponsored by: Mellanox Technologies	2020-05-26 08:25:24 +00:00
Gleb Smirnoff	6edfd179c8	Step 4.1: mechanically rename M_NOMAP to M_EXTPG Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-03 00:21:11 +00:00
Gleb Smirnoff	7b6c99d08d	Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf within m_epg namespace. All edits except the 'struct mbuf' declaration and mb_dupcl() were done mechanically with sed: s/->m_ext_pgs.nrdy/->m_epg_nrdy/g s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g s/->m_ext_pgs.trail_len/->m_epg_trllen/g s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g s/->m_ext_pgs.flags/->m_epg_flags/g s/->m_ext_pgs.record_type/->m_epg_record_type/g s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g s/->m_ext_pgs.tls/->m_epg_tls/g s/->m_ext_pgs.so/->m_epg_so/g s/->m_ext_pgs.seqno/->m_epg_seqno/g s/->m_ext_pgs.stailq/->m_epg_stailq/g Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-03 00:12:56 +00:00
Gleb Smirnoff	bccf6e26e9	Step 2.5: Stop using 'struct mbuf_ext_pgs' in the kernel itself. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-03 00:08:05 +00:00
Gleb Smirnoff	c4ee38f8e8	Step 2.3: Rename mbuf_ext_pg_len() to m_epg_pagelen() that uses mbuf argument. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-02 23:52:35 +00:00
Gleb Smirnoff	d90fe9d0cd	Step 2.1: Build TLS workqueue from mbufs, not struct mbuf_ext_pgs. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-02 23:38:13 +00:00
Gleb Smirnoff	eeec834855	Get rid of the mbuf self-pointing pointer. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-02 22:56:22 +00:00
Gleb Smirnoff	7433a5a966	Start moving into EPG_/epg_ namespace. There is only one flag, but next commit brings in second flag, so let them already be in the future namespace. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-02 22:49:14 +00:00
Gleb Smirnoff	0c1032665c	Continuation of multi page mbuf redesign from r359919. The following series of patches addresses three things: Now that array of pages is embedded into mbuf, we no longer need separate structure to pass around, so struct mbuf_ext_pgs is an artifact of the first implementation. And struct mbuf_ext_pgs_data is a crutch to accomodate the main idea r359919 with minimal churn. Also, M_EXT of type EXT_PGS are just a synonym of M_NOMAP. The namespace for the newfeature is somewhat inconsistent and sometimes has a lengthy prefixes. In these patches we will gradually bring the namespace to "m_epg" prefix for all mbuf fields and most functions. Step 1 of 4: o Anonymize mbuf_ext_pgs_data, embed in m_ext o Embed mbuf_ext_pgs o Start documenting all this entanglement Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-02 22:39:26 +00:00
John Baldwin	f1f9347546	Initial support for kernel offload of TLS receive. - Add a new TCP_RXTLS_ENABLE socket option to set the encryption and authentication algorithms and keys as well as the initial sequence number. - When reading from a socket using KTLS receive, applications must use recvmsg(). Each successful call to recvmsg() will return a single TLS record. A new TCP control message, TLS_GET_RECORD, will contain the TLS record header of the decrypted record. The regular message buffer passed to recvmsg() will receive the decrypted payload. This is similar to the interface used by Linux's KTLS RX except that Linux does not return the full TLS header in the control message. - Add plumbing to the TOE KTLS interface to request either transmit or receive KTLS sessions. - When a socket is using receive KTLS, redirect reads from soreceive_stream() into soreceive_generic(). - Note that this interface is currently only defined for TLS 1.1 and 1.2, though I believe we will be able to reuse the same interface and structures for 1.3.	2020-04-27 23:17:19 +00:00
John Baldwin	ec1db6e13d	Add the initial sequence number to the TLS enable socket option. This will be needed for KTLS RX. Reviewed by: gallatin Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D24451	2020-04-27 22:31:42 +00:00
Alexander V. Chernikov	454d389645	Fix LINT build #2 after r360292. Pointyhat to: melifaro	2020-04-25 11:35:38 +00:00
Alexander V. Chernikov	983066f05b	Convert route caching to nexthop caching. This change is build on top of nexthop objects introduced in r359823. Nexthops are separate datastructures, containing all necessary information to perform packet forwarding such as gateway interface and mtu. Nexthops are shared among the routes, providing more pre-computed cache-efficient data while requiring less memory. Splitting the LPM code and the attached data solves multiple long-standing problems in the routing layer, drastically reduces the coupling with outher parts of the stack and allows to transparently introduce faster lookup algorithms. Route caching was (re)introduced to minimise (slow) routing lookups, allowing for notably better performance for large TCP senders. Caching works by acquiring rtentry reference, which is protected by per-rtentry mutex. If the routing table is changed (checked by comparing the rtable generation id) or link goes down, cache record gets withdrawn. Nexthops have the same reference counting interface, backed by refcount(9). This change merely replaces rtentry with the actual forwarding nextop as a cached object, which is mostly mechanical. Other moving parts like cache cleanup on rtable change remains the same. Differential Revision: https://reviews.freebsd.org/D24340	2020-04-25 09:06:11 +00:00
Andrew Gallatin	23feb56348	KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself. While the original implementation of unmapped mbufs was a large step forward in terms of reducing cache misses by enabling mbufs to carry more than a single page for sendfile, they are rather cache unfriendly when accessing the ext_pgs metadata and data. This is because the ext_pgs part of the mbuf is allocated separately, and almost guaranteed to be cold in cache. This change takes advantage of the fact that unmapped mbufs are never used at the same time as pkthdr mbufs. Given this fact, we can overlap the ext_pgs metadata with the mbuf pkthdr, and carry the ext_pgs meta directly in the mbuf itself. Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array of pages) directly after the existing m_ext. In order to be able to carry 5 pages (which is the minimum required for a 16K TLS record which is not perfectly aligned) on LP64, I've had to steal ext_arg2. The only user of this in the xmit path is sendfile, and I've adjusted it to use arg1 when using unmapped mbufs. This change is almost entirely mechanical, except that we change mb_alloc_ext_pgs() to no longer allow allocating pkthdrs, the change to avoid ext_arg2 as mentioned above, and the removal of the ext_pgs zone, This change saves roughly 2% "raw" CPU (~59% -> 57%), or over 3% "scaled" CPU on a Netflix 100% software kTLS workload at 90+ Gb/s on Broadwell Xeons. In a follow-on commit, I plan to remove some hacks to avoid access ext_pgs fields of mbufs, since they will now be in cache. Many thanks to glebius for helping to make this better in the Netflix tree. Reviewed by: hselasky, jhb, rrs, glebius (early version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24213	2020-04-14 14:46:06 +00:00
John Baldwin	c034143269	Refactor driver and consumer interfaces for OCF (in-kernel crypto). - The linked list of cryptoini structures used in session initialization is replaced with a new flat structure: struct crypto_session_params. This session includes a new mode to define how the other fields should be interpreted. Available modes include: - COMPRESS (for compression/decompression) - CIPHER (for simply encryption/decryption) - DIGEST (computing and verifying digests) - AEAD (combined auth and encryption such as AES-GCM and AES-CCM) - ETA (combined auth and encryption using encrypt-then-authenticate) Additional modes could be added in the future (e.g. if we wanted to support TLS MtE for AES-CBC in the kernel we could add a new mode for that. TLS modes might also affect how AAD is interpreted, etc.) The flat structure also includes the key lengths and algorithms as before. However, code doesn't have to walk the linked list and switch on the algorithm to determine which key is the auth key vs encryption key. The 'csp_auth_' fields are always used for auth keys and settings and 'csp_cipher_' for cipher. (Compression algorithms are stored in csp_cipher_alg.) - Drivers no longer register a list of supported algorithms. This doesn't quite work when you factor in modes (e.g. a driver might support both AES-CBC and SHA2-256-HMAC separately but not combined for ETA). Instead, a new 'crypto_probesession' method has been added to the kobj interface for symmteric crypto drivers. This method returns a negative value on success (similar to how device_probe works) and the crypto framework uses this value to pick the "best" driver. There are three constants for hardware (e.g. ccr), accelerated software (e.g. aesni), and plain software (cryptosoft) that give preference in that order. One effect of this is that if you request only hardware when creating a new session, you will no longer get a session using accelerated software. Another effect is that the default setting to disallow software crypto via /dev/crypto now disables accelerated software. Once a driver is chosen, 'crypto_newsession' is invoked as before. - Crypto operations are now solely described by the flat 'cryptop' structure. The linked list of descriptors has been removed. A separate enum has been added to describe the type of data buffer in use instead of using CRYPTO_F_* flags to make it easier to add more types in the future if needed (e.g. wired userspace buffers for zero-copy). It will also make it easier to re-introduce separate input and output buffers (in-kernel TLS would benefit from this). Try to make the flags related to IV handling less insane: - CRYPTO_F_IV_SEPARATE means that the IV is stored in the 'crp_iv' member of the operation structure. If this flag is not set, the IV is stored in the data buffer at the 'crp_iv_start' offset. - CRYPTO_F_IV_GENERATE means that a random IV should be generated and stored into the data buffer. This cannot be used with CRYPTO_F_IV_SEPARATE. If a consumer wants to deal with explicit vs implicit IVs, etc. it can always generate the IV however it needs and store partial IVs in the buffer and the full IV/nonce in crp_iv and set CRYPTO_F_IV_SEPARATE. The layout of the buffer is now described via fields in cryptop. crp_aad_start and crp_aad_length define the boundaries of any AAD. Previously with GCM and CCM you defined an auth crd with this range, but for ETA your auth crd had to span both the AAD and plaintext (and they had to be adjacent). crp_payload_start and crp_payload_length define the boundaries of the plaintext/ciphertext. Modes that only do a single operation (COMPRESS, CIPHER, DIGEST) should only use this region and leave the AAD region empty. If a digest is present (or should be generated), it's starting location is marked by crp_digest_start. Instead of using the CRD_F_ENCRYPT flag to determine the direction of the operation, cryptop now includes an 'op' field defining the operation to perform. For digests I've added a new VERIFY digest mode which assumes a digest is present in the input and fails the request with EBADMSG if it doesn't match the internally-computed digest. GCM and CCM already assumed this, and the new AEAD mode requires this for decryption. The new ETA mode now also requires this for decryption, so IPsec and GELI no longer do their own authentication verification. Simple DIGEST operations can also do this, though there are no in-tree consumers. To eventually support some refcounting to close races, the session cookie is now passed to crypto_getop() and clients should no longer set crp_sesssion directly. - Assymteric crypto operation structures should be allocated via crypto_getkreq() and freed via crypto_freekreq(). This permits the crypto layer to track open asym requests and close races with a driver trying to unregister while asym requests are in flight. - crypto_copyback, crypto_copydata, crypto_apply, and crypto_contiguous_subsegment now accept the 'crp' object as the first parameter instead of individual members. This makes it easier to deal with different buffer types in the future as well as separate input and output buffers. It's also simpler for driver writers to use. - bus_dmamap_load_crp() loads a DMA mapping for a crypto buffer. This understands the various types of buffers so that drivers that use DMA do not have to be aware of different buffer types. - Helper routines now exist to build an auth context for HMAC IPAD and OPAD. This reduces some duplicated work among drivers. - Key buffers are now treated as const throughout the framework and in device drivers. However, session key buffers provided when a session is created are expected to remain alive for the duration of the session. - GCM and CCM sessions now only specify a cipher algorithm and a cipher key. The redundant auth information is not needed or used. - For cryptosoft, split up the code a bit such that the 'process' callback now invokes a function pointer in the session. This function pointer is set based on the mode (in effect) though it simplifies a few edge cases that would otherwise be in the switch in 'process'. It does split up GCM vs CCM which I think is more readable even if there is some duplication. - I changed /dev/crypto to support GMAC requests using CRYPTO_AES_NIST_GMAC as an auth algorithm and updated cryptocheck to work with it. - Combined cipher and auth sessions via /dev/crypto now always use ETA mode. The COP_F_CIPHER_FIRST flag is now a no-op that is ignored. This was actually documented as being true in crypto(4) before, but the code had not implemented this before I added the CIPHER_FIRST flag. - I have not yet updated /dev/crypto to be aware of explicit modes for sessions. I will probably do that at some point in the future as well as teach it about IV/nonce and tag lengths for AEAD so we can support all of the NIST KAT tests for GCM and CCM. - I've split up the exising crypto.9 manpage into several pages of which many are written from scratch. - I have converted all drivers and consumers in the tree and verified that they compile, but I have not tested all of them. I have tested the following drivers: - cryptosoft - aesni (AES only) - blake2 - ccr and the following consumers: - cryptodev - IPsec - ktls_ocf - GELI (lightly) I have not tested the following: - ccp - aesni with sha - hifn - kgssapi_krb5 - ubsec - padlock - safe - armv8_crypto (aarch64) - glxsb (i386) - sec (ppc) - cesa (armv7) - cryptocteon (mips64) - nlmsec (mips64) Discussed with: cem Relnotes: yes Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D23677	2020-03-27 18:25:23 +00:00
Andrew Gallatin	98085bae8c	make lacp's use_numa hashing aware of send tags When I did the use_numa support, I missed the fact that there is a separate hash function for send tag nic selection. So when use_numa is enabled, ktls offload does not work properly, as it does not reliably allocate a send tag on the proper egress nic since different egress nics are selected for send-tag allocation and packet transmit. To fix this, this change: - refectors lacp_select_tx_port_by_hash() and lacp_select_tx_port() to make lacp_select_tx_port_by_hash() always called by lacp_select_tx_port() - pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash() - adds a numa_domain field to if_snd_tag_alloc_params - plumbs the numa domain into places where we allocate send tags In testing with NIC TLS setup on a NUMA machine, I see thousands of output errors before the change when enabling kern.ipc.tls.ifnet.permitted=1. After the change, I see no errors, and I see the NIC sysctl counters showing active TLS offload sessions. Reviewed by: rrs, hselasky, jhb Sponsored by: Netflix	2020-03-09 13:44:51 +00:00

1 2

63 Commits