freebsd-skq

Author	SHA1	Message	Date
Kristof Provost	c139b3c19b	arp/nd: Cope with late calls to iflladdr_event When tearing down vnet jails we can move an if_bridge out (as part of the normal vnet_if_return()). This can, when it's clearing out its list of member interfaces, change its link layer address. That sends an iflladdr_event, but at that point we've already freed the AF_INET/AF_INET6 if_afdata pointers. In other words: when the iflladdr_event callbacks fire we can't assume that ifp->if_afdata[AF_INET] will be set. Reviewed by: donner@, melifaro@ MFC after: 1 week Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D28860	2021-02-23 13:54:07 +01:00
Hans Petter Selasky	9febbc4541	Fix for natd(8) sending wrong sequence number after TCP retransmission, terminating a TCP connection. If a TCP packet must be retransmitted and the data length has changed in the retransmitted packet, due to the internal workings of TCP, typically when ACK packets are lost, then there is a 30% chance that the logic in GetDeltaSeqOut() will find the correct length, which is the last length received. This can be explained as follows: If a "227 Entering Passive Mode" packet must be retransmittet and the length changes from 51 to 50 bytes, for example, then we have three cases for the list scan in GetDeltaSeqOut(), depending on how many prior packets were received modulus N_LINK_TCP_DATA=3: case 1: index 0: original packet 51 index 1: retransmitted packet 50 index 2: not relevant case 2: index 0: not relevant index 1: original packet 51 index 2: retransmitted packet 50 case 3: index 0: retransmitted packet 50 index 1: not relevant index 2: original packet 51 This patch simply changes the searching order for TCP packets, always starting at the last received packet instead of any received packet, in GetDeltaAckIn() and GetDeltaSeqOut(). Else no functional changes. Discussed with: rscheff@ Submitted by: Andreas Longwitz <longwitz@incore.de> PR: 230755 MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-02-22 17:13:58 +01:00
Michael Tuexen	b963ce4588	sctp: improve computation of an alternate net Espeially handle the case where the net passed in is about to be deleted and therefore not in the list of nets anymore. MFC after: 3 days Reported by: syzbot+9756917a7c8381adf5e8@syzkaller.appspotmail.com	2021-02-21 17:13:06 +01:00
Michael Tuexen	5ac839029d	sctp: clear a pointer to a net which will be removed MFC after: 3 days	2021-02-21 13:06:05 +01:00
Richard Scheffenegger	a8e431e153	PRR: use accurate rfc6675_pipe when enabled Reviewed By: #transport, tuexen MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28816	2021-02-20 20:11:48 +01:00
Richard Scheffenegger	853fd7a2e3	Ensure cwnd doesn't shrink to zero with PRR Under some circumstances, PRR may end up with a fully collapsed cwnd when finalizing the loss recovery. Reviewed By: #transport, kbowling Reported by: Liang Tian MFC after: 1 week Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28780	2021-02-19 13:55:32 +01:00
Kyle Evans	4c0bef07be	kern: net: remove TCP_LINGERTIME TCP_LINGERTIME can be traced back to BSD 4.4 Lite and perhaps beyond, in exactly the same form that it appears here modulo slightly different context. It used to be the case that there was a single pr_usrreq method with requests dispatched to it; these exact two lines appeared in tcp_usrreq's PRU_ATTACH handling. The only purpose of this that I can find is to cause surprising behavior on accepted connections. Newly-created sockets will never hit these paths as one cannot set SO_LINGER prior to socket(2). If SO_LINGER is set on a listening socket and inherited, one would expect the timeout to be inherited rather than changed arbitrarily like this -- noting that SO_LINGER is nonsense on a listening socket beyond inheritance, since they cannot be 'connected' by definition. Neither Illumos nor Linux reset the timer like this based on testing and inspection of Illumos, and testing of Linux. Reviewed by: rscheff, tuexen Differential Revision: https://reviews.freebsd.org/D28265	2021-02-18 22:36:01 -06:00
Randall Stewart	e13e4fa6c4	fix Navdeeps LINT_NOINET error.	2021-02-18 07:29:12 -05:00
Randall Stewart	0a4f851074	Fix another pesky missing #ifdef TCPHPTS	2021-02-18 01:27:30 -05:00
Randall Stewart	ab4fad4be1	Add ifdef TCPHPTS around build_ack_entry and do_bpf_and_csum to avoid warnings when HPTS is not included Thanks to Gary Jennejohn for pointing this out.	2021-02-17 12:49:42 -05:00
Randall Stewart	69a34e8d02	Update the LRO processing code so that we can support a further CPU enhancements for compressed acks. These are acks that are compressed into an mbuf. The transport has to be aware of how to process these, and an upcoming update to rack will do so. You need the rack changes to actually test and validate these since if the transport does not support mbuf compression, then the old code paths stay in place. We do in this commit take out the concept of logging if you don't have a lock (which was quite dangerous and was only for some early debugging but has been left in the code). Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D28374	2021-02-17 10:41:01 -05:00
Alexander V. Chernikov	9fdbf7eef5	Make in_localip_more() fib-aware. It fixes loopback route installation for the interfaces in the different fibs using the same prefix. Reviewed By: donner PR: 189088 Differential Revision: https://reviews.freebsd.org/D28673 MFC after: 1 week	2021-02-16 20:00:46 +00:00
Richard Scheffenegger	3c40e1d52c	update the SACK loss recovery to RFC6675, with the following new features: - improved pipe calculation which does not degrade under heavy loss - engaging in Loss Recovery earlier under adverse conditions - Rescue Retransmission in case some of the trailing packets of a request got lost All above changes are toggled with the sysctl "rfc6675_pipe" (disabled by default). Reviewers: #transport, tuexen, lstewart, slavash, jtl, hselasky, kib, rgrimes, chengc_netapp.com, thj, #manpages, kbowling, #netapp, rscheff Reviewed By: #transport Subscribers: imp, melifaro MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D18985	2021-02-16 13:08:37 +01:00
Alexander V. Chernikov	8268d82cff	Remove per-packet ifa refcounting from IPv6 fast path. Currently ip6_input() calls in6ifa_ifwithaddr() for every local packet, in order to check if the target ip belongs to the local ifa in proper state and increase its counters. in6ifa_ifwithaddr() references found ifa. With epoch changes, both `ip6_input()` and all other current callers of `in6ifa_ifwithaddr()` do not need this reference anymore, as epoch provides stability guarantee. Given that, update `in6ifa_ifwithaddr()` to allow it to return ifa without referencing it, while preserving option for getting referenced ifa if so desired. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28648	2021-02-15 22:33:12 +00:00
Michael Tuexen	ed782b9f5a	tcp: improve behaviour when using TCP_NOOPT Use ISS for SEG.SEQ when sending a SYN-ACK segment in response to an SYN segment received in the SYN-SENT state on a socket having the IPPROTO_TCP level socket option TCP_NOOPT enabled. Reviewed by: rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D28656	2021-02-14 12:16:57 +01:00
Andrey V. Elsukov	c6ded47d0b	[udp] fix possible mbuf and lock leak in udp_input(). In error case we can leave `inp' locked, also we need to free mbuf chain `m' in the same case. Release the lock and use `badunlocked' label to exit with freed mbuf. Also modify UDP error statistic to match the IPv6 code. Remove redundant INP_RUNLOCK() from the `if (last == NULL)' block, there are no ways to reach this point with locked `inp'. Obtained from: Yandex LLC MFC after: 3 days Sponsored by: Yandex LLC	2021-02-11 12:08:41 +03:00
Alexander V. Chernikov	924d1c9a05	Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors." Wrong version of the change was pushed inadvertenly. This reverts commit `4a01b854ca`.	2021-02-08 22:32:32 +00:00
Alexander V. Chernikov	4a01b854ca	SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports.	2021-02-08 21:42:20 +00:00
Neel Chauhan	a08cdb6cfb	Allow setting alias port ranges in libalias and ipfw. This will allow a system to be a true RFC 6598 NAT444 setup, where each network segment (e.g. user, subnet) can have their own dedicated port aliasing ranges. Reviewed by: donner, kp Approved by: 0mp (mentor), donner, kp Differential Revision: https://reviews.freebsd.org/D23450	2021-02-02 13:24:17 -08:00
Hans Petter Selasky	db46c0d0cb	Fix LINT kernel builds after `1a714ff204` . MFC after: 1 week Discussed with: rrs@ Differential Revision: https://reviews.freebsd.org/D28357 Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-02-01 14:24:15 +01:00
Michael Tuexen	bdd4630c9a	sctp: small cleanup, no functional change intended. MFC after: 3 days	2021-02-01 14:04:57 +01:00
Michael Tuexen	af885c57d6	sctp: improve input validation Improve the handling of INIT chunks in specific szenarios and report and appropriate error cause. Thanks to Anatoly Korniltsev for reporting the issue for the userland stack. MFC after: 3 days	2021-01-31 23:46:53 +01:00
Michael Tuexen	8dc6a1edca	sctp: fix a locking issue for old unordered data Thanks to Anatoly Korniltsev for reporting the issue for the userland stack. MFC after: 3 days	2021-01-31 10:46:23 +01:00
Gleb Smirnoff	3f43ada98c	Catch up with `6edfd179c8`: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG. Originally IFCAP_NOMAP meant that the mbuf has external storage pointer that points to unmapped address. Then, this was extended to array of such pointers. Then, such mbufs were augmented with header/trailer. Basically, extended mbufs are extended, and set of features is subject to change. The new name should be generic enough to avoid further renaming.	2021-01-29 11:46:24 -08:00
Randall Stewart	1a714ff204	This pulls over all the changes that are in the netflix tree that fix the ratelimit code. There were several bugs in tcp_ratelimit itself and we needed further work to support the multiple tag format coming for the joint TLS and Ratelimit dances. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D28357	2021-01-28 11:53:05 -05:00
Hans Petter Selasky	093e723190	Add missing decrement of active ratelimit connections. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-01-26 18:00:21 +01:00
Hans Petter Selasky	85d8d30f9f	Don't allow allocating a new send tag on an INP which is being torn down. This fixes a potential send tag leak. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-01-26 18:00:02 +01:00
Richard Scheffenegger	6a376af0cd	TCP PRR: Patch div/0 in tcp_prr_partialack With clearing of recover_fs in `bc7ee8e5bc`, div/0 was observed while processing partial_acks. Suspect that rewind of an erraneous RTO may be causing this - with the above change, recover_fs would no longer retained at the last calculated value, and reset. But CC_RTO_ERR can reenable IN_RECOVERY(), without setting this again. Adding a safety net prior to the division in that function, which I missed in D28114.	2021-01-26 16:06:32 +01:00
Richard Scheffenegger	84761f3df5	Adjust line length in tcp_prr_partialack Summary: Wrap lines before column 80 in new prr code checked in recently. No functional changes. Reviewers: tuexen, rrs, jtl, mm, kbowling, #transport Reviewed By: tuexen, mm, #transport Subscribers: imp, melifaro Differential Revision: https://reviews.freebsd.org/D28329	2021-01-26 14:47:19 +01:00
Michael Tuexen	0f7573ffd6	sctp: fix PR-SCTP stats when adding addtional streams MFC after: 1 week	2021-01-24 00:50:33 +01:00
Michael Tuexen	7a051c0a78	sctp: improve consistency No functional change intended. MFC: 1 week	2021-01-24 00:07:41 +01:00
Alexander V. Chernikov	130aebbab0	Further refactor IPv4 interface route creation. * Fix bug with /32 aliases introduced in `81728a538d`. * Explicitly document business logic for IPv4 ifa routes. * Remove remnants of rtinit() * Deduplicate ifa->route prefix code by moving it into ia_getrtprefix() * Deduplicate conditional check for ifa_maintain_loopback_route() by moving into ia_need_loopback_route() * Remove now-unused flags argument from in_addprefix(). Reviewed by: donner PR: 252883 Differential Revision: https://reviews.freebsd.org/D28246	2021-01-21 21:48:49 +00:00
Richard Scheffenegger	bc7ee8e5bc	Address panic with PRR due to missed initialization of recover_fs Summary: When using the base stack in conjunction with RACK, it appears that infrequently, ++tp->t_dupacks is instantly larger than tcprexmtthresh. This leaves the recover flightsize (sackhint.recover_fs) uninitialized, leading to a div/0 panic. Address this by properly initializing the variable just prior to first use, if it is not properly initialized. In order to prevent stale information from a prior recovery to negatively impact the PRR calculations in this event, also clear recover_fs once loss recovery is finished. Finally, improve the readability of the initialization of recover_fs when t_dupacks == tcprexmtthresh by adjusting the indentation and using the max(1, snd_nxt - snd_una) macro. Reviewers: rrs, kbowling, tuexen, jtl, #transport, gnn!, jmg, manu, #manpages Reviewed By: rrs, kbowling, #transport Subscribers: bdrewery, andrew, rpokala, ae, emaste, bz, bcran, #linuxkpi, imp, melifaro Differential Revision: https://reviews.freebsd.org/D28114	2021-01-20 12:06:34 +01:00
Alex Richardson	a81c165bce	Require uint32_t alignment for ipfw_insn There are many casts of this struct to uint32_t, so we also need to ensure that it is sufficiently aligned to safely perform this cast on architectures that don't allow unaligned accesses. This fixes lots of -Wcast-align warnings. Reviewed By: ae Differential Revision: https://reviews.freebsd.org/D27879	2021-01-19 21:23:25 +00:00
Alex Richardson	be5972695f	libalias: Fix remaining compiler warnings This fixes some sign-compare warnings and adds a missing static to a variable declaration. Differential Revision: https://reviews.freebsd.org/D27883	2021-01-19 21:23:24 +00:00
Alex Richardson	bc596e5632	libalias: Fix -Wcast-align compiler warnings This fixes -Wcast-align warnings caused by the underaligned `struct ip`. This also silences them in the public functions by changing the function signature from char * to void *. This is source and binary compatible and avoids the -Wcast-align warning. Reviewed By: ae, gbe (manpages) Differential Revision: https://reviews.freebsd.org/D27882	2021-01-19 21:23:24 +00:00
Alexander V. Chernikov	f879876721	Fix IPv4 fib bsearch4() lookup array construction. Current code didn't properly handle the case with nested prefixes like 10.0.0.0/24 && 10.0.0.0/25.	2021-01-17 20:32:26 +00:00
Alexander V. Chernikov	81728a538d	Split rtinit() into multiple functions. rtinit[1]() is a function used to add or remove interface address prefix routes, similar to ifa_maintain_loopback_route(). It was intended to be family-agnostic. There is a problem with this approach in reality. 1) IPv6 code does not use it for the ifa routes. There is a separate layer, nd6_prelist_(), providing interface for maintaining interface routes. Its part, responsible for the actual route table interaction, mimics rtenty() code. 2) rtinit tries to combine multiple actions in the same function: constructing proper route attributes and handling iterations over multiple fibs, for the non-zero net.add_addr_allfibs use case. It notably increases the code complexity. 3) dstaddr handling. flags parameter re-uses RTF_ flags. As there is no special flag for p2p connections, host routes and p2p routes are handled in the same way. Additionally, mapping IFA flags to RTF flags makes the interface pretty messy. It make rtinit() to clash with ifa_mainain_loopback_route() for IPV4 interface aliases. 4) rtinit() is the last customer passing non-masked prefixes to rib_action(), complicating rib_action() implementation. 5) rtinit() coupled ifa announce/withdrawal notifications, producing "false positive" ifa messages in certain corner cases. To address all these points, the following has been done: * rtinit() has been split into multiple functions: - Route attribute construction were moved to the per-address-family functions, dealing with (2), (3) and (4). - funnction providing net.add_addr_allfibs handling and route rtsock notificaions is the new routing table inteface. - rtsock ifa notificaion has been moved out as well. resulting set of funcion are only responsible for the actual route notifications. Side effects: * /32 alias does not result in interface routes (/32 route and "host" route) * RTF_PINNED is now set for IPv6 prefixes corresponding to the interface addresses Differential revision: https://reviews.freebsd.org/D28186	2021-01-16 22:42:41 +00:00
Michael Tuexen	d2b3ceddcc	tcp: add sysctl to tolerate TCP segments missing timestamps When timestamp support has been negotiated, TCP segements received without a timestamp should be discarded. However, there are broken TCP implementations (for example, stacks used by Omniswitch 63xx and 64xx models), which send TCP segments without timestamps although they negotiated timestamp support. This patch adds a sysctl variable which tolerates such TCP segments and allows to interoperate with broken stacks. Reviewed by: jtl@, rscheff@ Differential Revision: https://reviews.freebsd.org/D28142 Sponsored by: Netflix, Inc. PR: 252449 MFC after: 1 week	2021-01-14 19:28:25 +01:00
Michael Tuexen	cc3c34859e	tcp: fix handling of TCP RST segments missing timestamps A TCP RST segment should be processed even it is missing TCP timestamps. Reported by: dmgk@, kevans@ Reviewed by: rscheff@, dmgk@ Sponsored by: Netflix, Inc. MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28143	2021-01-14 14:39:35 +01:00
Mateusz Guzik	6b3a9a0f3d	Convert remaining cap_rights_init users to cap_rights_init_one semantic patch: @@ expression rights, r; @@ - cap_rights_init(&rights, r) + cap_rights_init_one(&rights, r)	2021-01-12 13:16:10 +00:00
Alexander V. Chernikov	2defbe9f0e	Use rn_match instead of doing indirect calls in fib_algo. Relevant inet/inet6 code has the control over deciding what the RIB lookup function currently is. With that in mind, explicitly set it to the current value (rn_match) in the datapath lookups. This avoids cost on indirect call. Differential Revision: https://reviews.freebsd.org/D28066	2021-01-11 23:30:35 +00:00
Alexander V. Chernikov	0da3f8c98d	Bump amount of queued packets in for unresolved ARP/NDP entries to 16. Currently default behaviour is to keep only 1 packet per unresolved entry. Ability to queue more than one packet was added 10 years ago, in r215207, though the default value was kep intact. Things have changed since that time. Systems tend to initiate multiple connections at once for a variety of reasons. For example, recent kern/252278 bug report describe happy-eyeball DNS behaviour sending multiple requests to the DNS server. The primary driver for upper value for the queue length determination is memory consumption. Remote actors should not be able to easily exhaust local memory by sending packets to unresolved arp/ND entries. For now, bump value to 16 packets, to match Darwin implementation. The proper approach would be to switch the limit to calculate memory consumption instead of packet count and limit based on memory. We should MFC this with a variation of D22447. Reviewers: #manpages, #network, bz, emaste Reviewed By: emaste, gbe(doc), jilles(doc) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D28068	2021-01-11 19:51:11 +00:00
Mark Johnston	501159696c	igmp: Avoid leaking mbuf when source validation fails PR: 252504 Submitted by: Panagiotis Tsolakos <panagiotis.tsolakos@gmail.com> MFC after: 3 days	2021-01-08 13:32:04 -05:00
Alexander V. Chernikov	d68cf57b7f	Refactor rt_addrmsg() and rt_routemsg(). Summary: * Refactor rt_addrmsg(): make V_rt_add_addr_allfibs decision locally. * Fix rt_routemsg() and multipath by accepting nexthop instead of interface pointer. * Refactor rtsock_routemsg(): avoid accessing rtentry fields directly. * Simplify in_addprefix() by moving prefix search to a separate function. Reviewers: #network Subscribers: imp, ae, bz Differential Revision: https://reviews.freebsd.org/D28011	2021-01-07 19:38:19 +00:00
Michael Tuexen	a7aa5eea4f	sctp: improve handling of aborted associations Don't clear a flag, when the structure already has been freed. Reported by: syzbot+07667d16c96779c737b4@syzkaller.appspotmail.com	2021-01-01 15:59:10 +01:00
Alexander V. Chernikov	f733d9701b	Fix default route handling in radix4_lockless algo. Improve nexthop debugging. Reported by: Florian Smeets <flo at smeets.xyz>	2020-12-26 22:51:02 +00:00
Alexander V. Chernikov	f5baf8bb12	Add modular fib lookup framework. This change introduces framework that allows to dynamically attach or detach longest prefix match (lpm) lookup algorithms to speed up datapath route tables lookups. Framework takes care of handling initial synchronisation, route subscription, nhop/nhop groups reference and indexing, dataplane attachments and fib instance algorithm setup/teardown. Framework features automatic algorithm selection, allowing for picking the best matching algorithm on-the-fly based on the amount of routes in the routing table. Currently framework code is guarded under FIB_ALGO config option. An idea is to enable it by default in the next couple of weeks. The following algorithms are provided by default: IPv4: * bsearch4 (lockless binary search in a special IP array), tailored for small-fib (<16 routes) * radix4_lockless (lockless immutable radix, re-created on every rtable change), tailored for small-fib (<1000 routes) * radix4 (base system radix backend) * dpdk_lpm4 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized for large-fib (D27412) IPv6: * radix6_lockless (lockless immutable radix, re-created on every rtable change), tailed for small-fib (<1000 routes) * radix6 (base system radix backend) * dpdk_lpm6 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized for large-fib (D27412) Performance changes: Micro benchmarks (I7-7660U, single-core lookups, 2048k dst, code in D27604): IPv4: 8 routes: radix4: ~20mpps radix4_lockless: ~24.8mpps bsearch4: ~69mpps dpdk_lpm4: ~67 mpps 700k routes: radix4_lockless: 3.3mpps dpdk_lpm4: 46mpps IPv6: 8 routes: radix6_lockless: ~20mpps dpdk_lpm6: ~70mpps 100k routes: radix6_lockless: 13.9mpps dpdk_lpm6: 57mpps Forwarding benchmarks: + 10-15% IPv4 forwarding performance (small-fib, bsearch4) + 25% IPv4 forwarding performance (full-view, dpdk_lpm4) + 20% IPv6 forwarding performance (full-view, dpdk_lpm6) Control: Framwork adds the following runtime sysctls: List algos * net.route.algo.inet.algo_list: bsearch4, radix4_lockless, radix4 * net.route.algo.inet6.algo_list: radix6_lockless, radix6, dpdk_lpm6 Debug level (7=LOG_DEBUG, per-route) net.route.algo.debug_level: 5 Algo selection (currently only for fib 0): net.route.algo.inet.algo: bsearch4 net.route.algo.inet6.algo: radix6_lockless Support for manually changing algos in non-default fib will be added soon. Some sysctl names will be changed in the near future. Differential Revision: https://reviews.freebsd.org/D27401	2020-12-25 11:33:17 +00:00
Michael Tuexen	0ec2ce0d32	Improve input validation for parameters in ASCONF and ASCONF-ACK chunks Thanks to Tolya Korniltsev for drawing my attention to this part of the code by reporting an issue for the userland stack.	2020-12-23 18:03:47 +01:00
Andrew Gallatin	a034518ac8	Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636	2020-12-19 22:04:46 +00:00

1 2 3 4 5 ...

6817 Commits