freebsd-dev

Author	SHA1	Message	Date
Richard Scheffenegger	2169f71277	tcp: use IPV6_FLOWLABEL_LEN Avoid magic numbers when handling the IPv6 flow ID for DSCP and ECN fields and use the named variable instead. Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D39503	2023-04-11 18:53:51 +02:00
Randall Stewart	69c7c81190	Move access to tcp's t_logstate into inline functions and provide new tracepoint and bbpoint capabilities. The TCP stacks have long accessed t_logstate directly, but in order to do tracepoints and the new bbpoints we need to move to using the new inline functions. This adds them and moves rack to now use the tcp_tracepoints. Reviewed by: tuexen, gallatin Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D38831	2023-03-16 11:43:16 -04:00
Alfonso	2f201df1f8	Change hw_tls to a bool Reviewed by: imp Pull Request: https://github.com/freebsd/freebsd-src/pull/512	2023-02-25 09:59:11 -07:00
Andrew Gallatin	c0e4090e3d	ktls: Accurately track if ifnet ktls is enabled This allows us to avoid spurious calls to ktls_disable_ifnet() When we implemented ifnet kTLSe, we set a flag in the tx socket buffer (SB_TLS_IFNET) to indicate ifnet kTLS. This flag meant that now, or in the past, ifnet ktls was active on a socket. Later, I added code to switch ifnet ktls sessions to software in the case of lossy TCP connections that have a high retransmit rate. Because TCP was using SB_TLS_IFNET to know if it needed to do math to calculate the retransmit ratio and potentially call into ktls_disable_ifnet(), it was doing unneeded work long after a session was moved to software. This patch carefully tracks whether or not ifnet ktls is still enabled on a TCP connection. Because the inp is now embedded in the tcpcb, and because TCP is the most frequent accessor of this state, it made sense to move this from the socket buffer flags to the tcpcb. Because we now need reliable access to the tcbcb, we take a ref on the inp when creating a tx ktls session. While here, I noticed that rack/bbr were incorrectly implementing tfb_hwtls_change(), and applying the change to all pending sends, when it should apply only to future sends. This change reduces spurious calls to ktls_disable_ifnet() by 95% or so in a Netflix CDN environment. Reviewed by: markj, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D38380	2023-02-09 12:44:44 -05:00
Gleb Smirnoff	eaabc93764	tcp: retire TCPDEBUG This subsystem is superseded by modern debugging facilities, e.g. DTrace probes and TCP black box logging. We intentionally leave SO_DEBUG in place, as many utilities may set it on a socket. Also the tcp::debug DTrace probes look at this flag on a socket. Reviewed by: gnn, tuexen Discussed with: rscheff, rrs, jtl Differential revision: https://reviews.freebsd.org/D37694	2022-12-14 09:54:06 -08:00
Gleb Smirnoff	e68b379244	tcp: embed inpcb into tcpcb For the TCP protocol inpcb storage specify allocation size that would provide space to most of the data a TCP connection needs, embedding into struct tcpcb several structures, that previously were allocated separately. The most import one is the inpcb itself. With embedding we can provide strong guarantee that with a valid TCP inpcb the tcpcb is always valid and vice versa. Also we reduce number of allocs/frees per connection. The embedded inpcb is placed in the beginning of the struct tcpcb, since in_pcballoc() requires that. However, later we may want to move it around for cache line efficiency, and this can be done with a little effort. The new intotcpcb() macro is ready for such move. The congestion algorithm data, the TCP timers and osd(9) data are also embedded into tcpcb, and temprorary struct tcpcb_mem goes away. There was no extra allocation here, but we went through extra pointer every time we accessed this data. One interesting side effect is that now TCP data is allocated from SMR-protected zone. Potentially this allows the TCP stacks or other TCP related modules to utilize that for their own synchronization. Large part of the change was done with sed script: s/tp->ccv->/tp->t_ccv./g s/tp->ccv/\&tp->t_ccv/g s/tp->cc_algo/tp->t_cc/g s/tp->t_timers->tt_/tp->tt_/g s/CCV\(ccv, osd\)/\&CCV(ccv, t_osd)/g Dependency side effect is that code that needs to know struct tcpcb should also know struct inpcb, that added several <netinet/in_pcb.h>. Differential revision: https://reviews.freebsd.org/D37127	2022-12-07 09:00:48 -08:00
Gleb Smirnoff	9eb0e8326d	tcp: provide macros to access inpcb and socket from a tcpcb There should be no functional changes with this commit. Reviewed by: rscheff Differential revision: https://reviews.freebsd.org/D37123	2022-11-08 10:24:40 -08:00
Randall Stewart	cd84e78f09	tcp idle reduce does not work for a server. TCP has an idle-reduce feature that allows a connection to reduce its cwnd after it has been idle more than an RTT. This feature only works for a sending side connection. It does this by at output checking the idle time (t_rcvtime vs ticks) to see if its more than the RTO timeout. The problem comes if you are a web server. You get a request and then send out all the data.. then go idle. The next time you would send is in response to a request from the peer asking for more data. But the thing is you updated t_rcvtime when the request came in so you never reduce. The fix is to do the idle reduce check also on inbound. Reviewed by: tuexen, rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D36721	2022-10-04 07:09:01 -04:00
Randall Stewart	08af8aac2a	Tcp progress timeout Rack has had the ability to timeout connections that just sit idle automatically. This feature of course is off by default and requires the user set it on (though the socket option has been missing in tcp_usrreq.c). Lets get the progress timeout fully supported in the base stack as well as rack. Reviewed by: tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D36716	2022-09-27 13:38:20 -04:00
Richard Scheffenegger	a743fc8826	tcp: fix cwnd restricted SACK retransmission loop While doing the initial SACK retransmission segment while heavily cwnd constrained, tcp_ouput can erroneously send out the entire sendbuffer again. This may happen after an retransmission timeout, which resets snd_nxt to snd_una while the SACK scoreboard is still populated. Reviewed By: tuexen, #transport PR: 264257 PR: 263445 PR: 260393 MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D36637	2022-09-22 13:28:43 +02:00
Michael Tuexen	6d9e911fba	tcp: fix computation of offset Only update the offset if actually retransmitting from the scoreboard. If not done correctly, this may result in trying to (re)-transmit data not being being in the socket buffe and therefore resulting in a panic. PR: 264257 PR: 263445 PR: 260393 Reviewed by: rscheff@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D36626	2022-09-19 12:49:31 +02:00
Richard Scheffenegger	4012ef7754	tcp: Functional implementation of Accurate ECN The AccECN handshake and TCP header flags are supported, no support yet for the AccECN option. This minimalistic implementation is sufficient to support DCTCP while dramatically cutting the number of ACKs, and provide ECN response from the receiver to the CC modules. Reviewed By: #transport, #manpages, rrs, pauamma Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D21011	2022-08-31 15:05:53 +02:00
Michael Tuexen	bd30a1216e	tcp: improve BBLog for output events when using the FreeBSD stack Put the return value of ip_output()/ip6_output in the output event instead of adding another one in case of an error. This improves consistency with other similar places. Reviewed by: rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D36085	2022-08-08 13:07:10 +02:00
Richard Scheffenegger	66605ff791	tcp: Undo the increase in sequence number by 1 due to the FIN flag in case of a transient error. If an error occurs while processing a TCP segment with some data and the FIN flag, the back out of the sequence number advance does not take into account the increase by 1 due to the FIN flag. Reviewed By: jch, gnn, #transport, tuexen Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D2970	2022-07-14 03:18:19 +02:00
Hans Petter Selasky	28173d49dc	tcp: Correctly compute the retransmit length for all 64-bit platforms. When the TCP sequence number subtracted is greater than 232 minus the window size, or 231 minus the window size, the use of unsigned long as an intermediate variable, may result in an incorrect retransmit length computation on all 64-bit platforms. While at it create a helper macro to facilitate the computation of the difference between two TCP sequence numbers. Differential Revision: https://reviews.freebsd.org/D35388 Reviewed by: rscheff MFC after: 3 days Sponsored by: NVIDIA Networking	2022-06-03 10:49:17 +02:00
Gleb Smirnoff	4328318445	sockets: use socket buffer mutexes in struct socket directly Since `c67f3b8b78` the sockbuf mutexes belong to the containing socket, and socket buffers just point to it. In `74a68313b5` macros that access this mutex directly were added. Go over the core socket code and eliminate code that reaches the mutex by dereferencing the sockbuf compatibility pointer. This change requires a KPI change, as some functions were given the sockbuf pointer only without any hint if it is a receive or send buffer. This change doesn't cover the whole kernel, many protocols still use compatibility pointers internally. However, it allows operation of a protocol that doesn't use them. Reviewed by: markj Differential revision: https://reviews.freebsd.org/D35152	2022-05-12 13:22:12 -07:00
John Baldwin	732b6d4d50	netinet: Use __diagused for variables only used in KASSERT().	2022-04-13 16:08:19 -07:00
Richard Scheffenegger	2ff07d9220	tcp: Restore correct ECT marking behavior on SACK retransmissions While coalescing all ECN-related code into new common source files, the flag to deal with SACK retransmissions was skipped. This leads to non-compliant ECT-marking of SACK retransmissions, as well as the premature sending of other TCP ECN flags (CWR). Reviewed By: rrs, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34376	2022-02-25 20:05:32 +01:00
Richard Scheffenegger	f7220c486c	tcp: move ECN handling code to a common file Reduce the burden to maintain correct and extensible ECN related code across multiple stacks and codepaths. Formally no functional change. Incidentially this establishes correct ECN operation in one instance. Reviewed By: rrs, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34162	2022-02-05 15:04:42 +01:00
Richard Scheffenegger	7994ef3c39	Revert "tcp: move ECN handling code to a common file" This reverts commit `0c424c90ea`.	2022-02-05 01:07:51 +01:00
Richard Scheffenegger	0c424c90ea	tcp: move ECN handling code to a common file Reduce the burden to maintain correct and extensible ECN related code across multiple stacks and codepaths. Formally no functional change. Incidentially this establishes correct ECN operation in one instance. Reviewed By: rrs, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34162	2022-02-04 22:54:41 +01:00
Richard Scheffenegger	f026275e26	tcp: set IP ECN header codepoint properly TCP RACK can cache the IP header while preparing a new TCP packet for transmission. Thus all the IP ECN codepoint bits need to be assigned, without assuming a clear field beforehand. Reviewed By: tuexen, kbowling, #transport MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34148	2022-02-03 16:53:41 +01:00
Richard Scheffenegger	1ebf460758	tcp: Access all 12 TCP header flags via inline function In order to consistently provide access to all (including reserved) TCP header flag bits, use an accessor function tcp_get_flags and tcp_set_flags. Also expand any flag variable from uint8_t / char to uint16_t. Reviewed By: hselasky, tuexen, glebius, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34130	2022-02-03 16:21:58 +01:00
Gleb Smirnoff	5b08b46a6d	tcp: welcome back tcp_output() as the right way to run output on tcpcb. Reviewed by: rrs, tuexen Differential revision: https://reviews.freebsd.org/D33365	2021-12-26 08:47:42 -08:00
Randall Stewart	9e4d9e4c4d	tcp: Preparation for allowing hardware TLS to be able to kick a tcp connection that is retransmitting too much out of hardware and back to software. Hardware TLS is now supported in some interface cards and it works well. Except that when we have connections that retransmit a lot we get into trouble with all the retransmits. This prep step makes way for change that Drew will be making so that we can "kick out" a session from hardware TLS. Reviewed by: mtuexen, gallatin Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30895	2021-06-25 09:30:54 -04:00
Randall Stewart	67e892819b	tcp: Mbuf leak while holding a socket buffer lock. When running at NF the current Rack and BBR changes with the recent commits from Richard that cause the socket buffer lock to be held over the ip_output() call and then finally culminating in a call to tcp_handle_wakeup() we get a lot of leaked mbufs. I don't think that this leak is actually caused by holding the lock or what Richard has done, but is exposing some other bug that has probably been lying dormant for a long time. I will continue to look (using his changes) at what is going on to try to root cause out the issue. In the meantime I can't leave the leaks out for everyone else. So this commit will revert all of Richards changes and move both Rack and BBR back to just doing the old sorwakeup_locked() calls after messing with the so_rcv buffer. We may want to look at adding back in Richards changes after I have pinpointed the root cause of the mbuf leak and fixed it. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30704	2021-06-10 08:33:57 -04:00
Michael Tuexen	500eb6dd80	tcp: Fix sending of TCP segments with IP level options When bringing in TCP over UDP support in https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605, the length of IP level options was considered when locating the transport header. This was incorrect and is fixed by this patch. X-MFC with: https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605 MFC after: 3 days Reviewed by: markj, rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D30358	2021-05-21 09:49:45 +02:00
Richard Scheffenegger	0471a8c734	tcp: SACK Lost Retransmission Detection (LRD) Recover from excessive losses without reverting to a retransmission timeout (RTO). Disabled by default, enable with sysctl net.inet.tcp.do_lrd=1 Reviewed By: #transport, rrs, tuexen, #manpages Sponsored by: Netapp, Inc. Differential Revision: https://reviews.freebsd.org/D28931	2021-05-10 19:06:20 +02:00
Michael Tuexen	9e644c2300	tcp: add support for TCP over UDP Adding support for TCP over UDP allows communication with TCP stacks which can be implemented in userspace without requiring special priviledges or specific support by the OS. This is joint work with rrs. Reviewed by: rrs Sponsored by: Netflix, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29469	2021-04-18 16:16:42 +02:00
Richard Scheffenegger	9f2eeb0262	[tcp] Fix ECN on finalizing sessions. A subtle oversight would subtly change new data packets sent after a shutdown() or close() call, while the send buffer is still draining. MFC after: 3 days Reviewed By: #transport, tuexen Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29616	2021-04-08 15:26:09 +02:00
Richard Scheffenegger	e53138694a	tcp: Add prr_out in preparation for PRR/nonSACK and LRD Reviewed By: #transport, kbowling MFC after: 3 days Sponsored By: Netapp, Inc. Differential Revision: https://reviews.freebsd.org/D29058	2021-03-06 00:38:22 +01:00
Michael Tuexen	ed782b9f5a	tcp: improve behaviour when using TCP_NOOPT Use ISS for SEG.SEQ when sending a SYN-ACK segment in response to an SYN segment received in the SYN-SENT state on a socket having the IPPROTO_TCP level socket option TCP_NOOPT enabled. Reviewed by: rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D28656	2021-02-14 12:16:57 +01:00
Richard Scheffenegger	4b72ae16ed	Stop sending tiny new data segments during SACK recovery Consider the currently in-use TCP options when calculating the amount of new data to be injected during SACK loss recovery. That addresses the effect that very small (new) segments could be injected on partial ACKs while still performing a SACK loss recovery. Reported by: Liang Tian Reviewed by: tuexen, chengc_netapp.com MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26446	2020-10-09 12:44:56 +00:00
Richard Scheffenegger	e399566123	TCP: send full initial window when timestamps are in use The fastpath in tcp_output tries to send out full segments, and avoid sending partial segments by comparing against the static t_maxseg variable. That value does not consider tcp options like timestamps, while the initial window calculation is using the correct dynamic tcp_maxseg() function. Due to this interaction, the last, full size segment is considered too short and not sent out immediately. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26478	2020-09-25 10:38:19 +00:00
Mateusz Guzik	662c13053f	net: clean up empty lines in .c and .h files	2020-09-01 21:19:14 +00:00
Andrew Gallatin	b99781834f	TCP: remove special treatment for hardware (ifnet) TLS Remove most special treatment for ifnet TLS in the TCP stack, except for code to avoid mixing handshakes and bulk data. This code made heroic efforts to send down entire TLS records to NICs. It was added to improve the PCIe bus efficiency of older TLS offload NICs which did not keep state per-session, and so would need to re-DMA the first part(s) of a TLS record if a TLS record was sent in multiple TCP packets or TSOs. Newer TLS offload NICs do not need this feature. At Netflix, we've run extensive QoE tests which show that this feature reduces client quality metrics, presumably because the effort to send TLS records atomically causes the server to both wait too long to send data (leading to buffers running dry), and to send too much data at once (leading to packet loss). Reviewed by: hselasky, jhb, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26103	2020-08-19 17:59:06 +00:00
Richard Scheffenegger	9dc7d8a246	TCP: make after-idle work for transactional sessions. The use of t_rcvtime as proxy for the last transmission fails for transactional IO, where the client requests data before the server can respond with a bulk transfer. Set aside a dedicated variable to actually track the last locally sent segment going forward. Reported by: rrs Reviewed by: rrs, tuexen (mentor) Approved by: tuexen (mentor), rgrimes (mentor) MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D25016	2020-06-24 13:42:42 +00:00
Randall Stewart	f092a3c71c	So it turns out with the right window scaling you can get the code in all stacks to always want to do a window update, even when no data can be sent. Now in cases where you are not pacing thats probably ok, you just send an extra window update or two. However with bbr (and rack if its paced) every time the pacer goes off its going to send a "window update". Also in testing bbr I have found that if we are not responding to data right away we end up staying in startup but incorrectly holding a pacing gain of 192 (a loss). This is because the idle window code does not restict itself to only work with PROBE_BW. In all other states you dont want it doing a PROBE_BW state change. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D25247	2020-06-12 19:56:19 +00:00
Richard Scheffenegger	af2fb894c9	With RFC3168 ECN, CWR SHOULD only be sent with new data Overly conservative data receivers may ignore the CWR flag on other packets, and keep ECE latched. This can result in continous reduction of the congestion window, and very poor performance when ECN is enabled. Reviewed by: rgrimes (mentor), rrs Approved by: rgrimes (mentor), tuexen (mentor) MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D23364	2020-05-21 21:33:15 +00:00
Richard Scheffenegger	6e16d87751	Handle ECN handshake in simultaneous open While testing simultaneous open TCP with ECN, found that negotiation fails to arrive at the expected final state. Reviewed by: tuexen (mentor) Approved by: tuexen (mentor), rgrimes (mentor) MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D23373	2020-05-21 21:15:25 +00:00
Gleb Smirnoff	61664ee700	Step 4.2: start divorce of M_EXT and M_EXTPG They have more differencies than similarities. For now there is lots of code that would check for M_EXT only and work correctly on M_EXTPG buffers, so still carry M_EXT bit together with M_EXTPG. However, prepare some code for explicit check for M_EXTPG. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-03 00:37:16 +00:00
Gleb Smirnoff	6edfd179c8	Step 4.1: mechanically rename M_NOMAP to M_EXTPG Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-03 00:21:11 +00:00
Gleb Smirnoff	7b6c99d08d	Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbuf within m_epg namespace. All edits except the 'struct mbuf' declaration and mb_dupcl() were done mechanically with sed: s/->m_ext_pgs.nrdy/->m_epg_nrdy/g s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g s/->m_ext_pgs.trail_len/->m_epg_trllen/g s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g s/->m_ext_pgs.flags/->m_epg_flags/g s/->m_ext_pgs.record_type/->m_epg_record_type/g s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g s/->m_ext_pgs.tls/->m_epg_tls/g s/->m_ext_pgs.so/->m_epg_so/g s/->m_ext_pgs.seqno/->m_epg_seqno/g s/->m_ext_pgs.stailq/->m_epg_stailq/g Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598	2020-05-03 00:12:56 +00:00
Richard Scheffenegger	9028b6e0d9	Prevent premature shrinking of the scaled receive window which can cause a TCP client to use invalid or stale TCP sequence numbers for ACK packets. Packets with old sequence numbers are ignored and not used to update the send window size. This might cause the TCP session to hang indefinitely under some circumstances. Reported by: Cui Cheng Reviewed by: tuexen (mentor), rgrimes (mentor) Approved by: tuexen (mentor), rgrimes (mentor) MFC after: 3 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D24515	2020-04-29 22:01:33 +00:00
Alexander V. Chernikov	983066f05b	Convert route caching to nexthop caching. This change is build on top of nexthop objects introduced in r359823. Nexthops are separate datastructures, containing all necessary information to perform packet forwarding such as gateway interface and mtu. Nexthops are shared among the routes, providing more pre-computed cache-efficient data while requiring less memory. Splitting the LPM code and the attached data solves multiple long-standing problems in the routing layer, drastically reduces the coupling with outher parts of the stack and allows to transparently introduce faster lookup algorithms. Route caching was (re)introduced to minimise (slow) routing lookups, allowing for notably better performance for large TCP senders. Caching works by acquiring rtentry reference, which is protected by per-rtentry mutex. If the routing table is changed (checked by comparing the rtable generation id) or link goes down, cache record gets withdrawn. Nexthops have the same reference counting interface, backed by refcount(9). This change merely replaces rtentry with the actual forwarding nextop as a cached object, which is mostly mechanical. Other moving parts like cache cleanup on rtable change remains the same. Differential Revision: https://reviews.freebsd.org/D24340	2020-04-25 09:06:11 +00:00
Andrew Gallatin	23feb56348	KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself. While the original implementation of unmapped mbufs was a large step forward in terms of reducing cache misses by enabling mbufs to carry more than a single page for sendfile, they are rather cache unfriendly when accessing the ext_pgs metadata and data. This is because the ext_pgs part of the mbuf is allocated separately, and almost guaranteed to be cold in cache. This change takes advantage of the fact that unmapped mbufs are never used at the same time as pkthdr mbufs. Given this fact, we can overlap the ext_pgs metadata with the mbuf pkthdr, and carry the ext_pgs meta directly in the mbuf itself. Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array of pages) directly after the existing m_ext. In order to be able to carry 5 pages (which is the minimum required for a 16K TLS record which is not perfectly aligned) on LP64, I've had to steal ext_arg2. The only user of this in the xmit path is sendfile, and I've adjusted it to use arg1 when using unmapped mbufs. This change is almost entirely mechanical, except that we change mb_alloc_ext_pgs() to no longer allow allocating pkthdrs, the change to avoid ext_arg2 as mentioned above, and the removal of the ext_pgs zone, This change saves roughly 2% "raw" CPU (~59% -> 57%), or over 3% "scaled" CPU on a Netflix 100% software kTLS workload at 90+ Gb/s on Broadwell Xeons. In a follow-on commit, I plan to remove some hacks to avoid access ext_pgs fields of mbufs, since they will now be in cache. Many thanks to glebius for helping to make this better in the Netflix tree. Reviewed by: hselasky, jhb, rrs, glebius (early version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24213	2020-04-14 14:46:06 +00:00
Andrew Gallatin	ee7a9e506e	Avoid a cache miss accessing an mbuf ext_pgs pointer when doing SW kTLS. For a Netflix 90Gb/s 100% TLS software kTLS workload, this reduces the CPI of tcp_m_copym() from ~3.5 to ~2.5 as reported by vtune. Reviewed by: jtl, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D23998	2020-03-16 14:03:27 +00:00
Michael Tuexen	a357466592	sack_newdata and snd_recover hold the same value. Therefore, use only a single instance: use snd_recover also where sack_newdata was used. Submitted by: Richard Scheffenegger Differential Revision: https://reviews.freebsd.org/D18811	2020-02-13 15:14:46 +00:00
Randall Stewart	481be5de9d	White space cleanup -- remove trailing tab's or spaces from any line. Sponsored by: Netflix Inc.	2020-02-12 13:31:36 +00:00
Michael Tuexen	47e2c17c12	Don't set the ECT codepoint on retransmitted packets during SACK loss recovery. This is required by RFC 3168. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, tuexen@, Cheng Cui MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23118	2020-01-25 13:34:29 +00:00

1 2 3 4 5 ...

333 Commits