freebsd-skq/sys/netinet6/in6_pcb.c

1407 lines
39 KiB
C
Raw Normal View History

/*-
* SPDX-License-Identifier: BSD-3-Clause
*
* Copyright (C) 1995, 1996, 1997, and 1998 WIDE Project.
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
* Copyright (c) 2010-2011 Juniper Networks, Inc.
* All rights reserved.
*
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
* Portions of this software were developed by Robert N. M. Watson under
* contract to Juniper Networks, Inc.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* 3. Neither the name of the project nor the names of its contributors
* may be used to endorse or promote products derived from this software
* without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE PROJECT OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*
2007-12-10 16:03:40 +00:00
* $KAME: in6_pcb.c,v 1.31 2001/05/21 05:45:10 jinmei Exp $
*/
/*-
* Copyright (c) 1982, 1986, 1991, 1993
* The Regents of the University of California. All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* 3. Neither the name of the University nor the names of its contributors
* may be used to endorse or promote products derived from this software
* without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*
* @(#)in_pcb.c 8.2 (Berkeley) 1/4/94
*/
2007-12-10 16:03:40 +00:00
#include <sys/cdefs.h>
__FBSDID("$FreeBSD$");
#include "opt_inet.h"
#include "opt_inet6.h"
#include "opt_ipsec.h"
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
#include "opt_pcbgroup.h"
#include "opt_route.h"
Several years after initial development, merge prototype support for linking NIC Receive Side Scaling (RSS) to the network stack's connection-group implementation. This prototype (and derived patches) are in use at Juniper and several other FreeBSD-using companies, so despite some reservations about its maturity, merge the patch to the base tree so that it can be iteratively refined in collaboration rather than maintained as a set of gradually diverging patch sets. (1) Merge a software implementation of the Toeplitz hash specified in RSS implemented by David Malone. This is used to allow suitable pcbgroup placement of connections before the first packet is received from the NIC. Software hashing is generally avoided, however, due to high cost of the hash on general-purpose CPUs. (2) In in_rss.c, maintain authoritative versions of RSS state intended to be pushed to each NIC, including keying material, hash algorithm/ configuration, and buckets. Provide software-facing interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both the RSS standardised Toeplitz and a 'naive' variation with a hash efficient in software but with poor distribution properties. Implement rss_m2cpuid()to be used by netisr and other load balancing code to look up the CPU on which an mbuf should be processed. (3) In the Ethernet link layer, allow netisr distribution using RSS as a source of policy as an alternative to source ordering; continue to default to direct dispatch (i.e., don't try and requeue packets for processing on the 'right' CPU if they arrive in a directly dispatchable context). (4) Allow RSS to control tuning of connection groups in order to align groups with RSS buckets. If a packet arrives on a protocol using connection groups, and contains a suitable hardware-generated hash, use that hash value to select the connection group for pcb lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz hash is available, we fall back on regular PCB lookup risking contention rather than pay the cost of Toeplitz in software -- this is a less scalable but, at my last measurement, faster approach. As core counts go up, we may want to revise this strategy despite CPU overhead. Where device drivers suitably configure NICs, and connection groups / RSS are enabled, this should avoid both lock and line contention during connection lookup for TCP. This commit does not modify any device drivers to tune device RSS configuration to the global RSS configuration; patches are in circulation to do this for at least Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device drivers is not particularly robust, nor aware of more advanced features such as runtime reconfiguration/rebalancing. This will hopefully prove a useful starting point for refinement. No MFC is scheduled as we will first want to nail down a more mature and maintainable KPI/KBI for device drivers. Sponsored by: Juniper Networks (original work) Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
#include "opt_rss.h"
#include <sys/param.h>
#include <sys/systm.h>
#include <sys/malloc.h>
#include <sys/mbuf.h>
#include <sys/domain.h>
#include <sys/protosw.h>
#include <sys/socket.h>
#include <sys/socketvar.h>
#include <sys/sockio.h>
#include <sys/errno.h>
#include <sys/time.h>
#include <sys/priv.h>
#include <sys/proc.h>
#include <sys/jail.h>
#include <vm/uma.h>
#include <net/if.h>
#include <net/if_var.h>
#include <net/if_llatbl.h>
#include <net/if_types.h>
#include <net/route.h>
#include <net/route/nhop.h>
#include <netinet/in.h>
#include <netinet/in_var.h>
#include <netinet/in_systm.h>
#include <netinet/tcp_var.h>
#include <netinet/ip6.h>
#include <netinet/ip_var.h>
#include <netinet6/ip6_var.h>
#include <netinet6/nd6.h>
#include <netinet/in_pcb.h>
#include <netinet6/in6_pcb.h>
#include <netinet6/in6_fib.h>
#include <netinet6/scope6_var.h>
int
in6_pcbbind(struct inpcb *inp, struct sockaddr *nam,
struct ucred *cred)
{
struct socket *so = inp->inp_socket;
struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)NULL;
struct inpcbinfo *pcbinfo = inp->inp_pcbinfo;
u_short lport = 0;
int error, lookupflags = 0;
int reuseport = (so->so_options & SO_REUSEPORT);
/*
* XXX: Maybe we could let SO_REUSEPORT_LB set SO_REUSEPORT bit here
* so that we don't have to add to the (already messy) code below.
*/
int reuseport_lb = (so->so_options & SO_REUSEPORT_LB);
INP_WLOCK_ASSERT(inp);
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
INP_HASH_WLOCK_ASSERT(pcbinfo);
ifnet: Replace if_addr_lock rwlock with epoch + mutex Run on LLNW canaries and tested by pho@ gallatin: Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5 based ConnectX 4-LX NIC, I see an almost 12% improvement in received packet rate, and a larger improvement in bytes delivered all the way to userspace. When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1, I see, using nstat -I mce0 1 before the patch: InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 4.98 0.00 4.42 0.00 4235592 33 83.80 4720653 2149771 1235 247.32 4.73 0.00 4.20 0.00 4025260 33 82.99 4724900 2139833 1204 247.32 4.72 0.00 4.20 0.00 4035252 33 82.14 4719162 2132023 1264 247.32 4.71 0.00 4.21 0.00 4073206 33 83.68 4744973 2123317 1347 247.32 4.72 0.00 4.21 0.00 4061118 33 80.82 4713615 2188091 1490 247.32 4.72 0.00 4.21 0.00 4051675 33 85.29 4727399 2109011 1205 247.32 4.73 0.00 4.21 0.00 4039056 33 84.65 4724735 2102603 1053 247.32 After the patch InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 5.43 0.00 4.20 0.00 3313143 33 84.96 5434214 1900162 2656 245.51 5.43 0.00 4.20 0.00 3308527 33 85.24 5439695 1809382 2521 245.51 5.42 0.00 4.19 0.00 3316778 33 87.54 5416028 1805835 2256 245.51 5.42 0.00 4.19 0.00 3317673 33 90.44 5426044 1763056 2332 245.51 5.42 0.00 4.19 0.00 3314839 33 88.11 5435732 1792218 2499 245.52 5.44 0.00 4.19 0.00 3293228 33 91.84 5426301 1668597 2121 245.52 Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15366
2018-05-18 20:13:34 +00:00
if (CK_STAILQ_EMPTY(&V_in6_ifaddrhead)) /* XXX broken! */
return (EADDRNOTAVAIL);
if (inp->inp_lport || !IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr))
return (EINVAL);
if ((so->so_options & (SO_REUSEADDR|SO_REUSEPORT|SO_REUSEPORT_LB)) == 0)
lookupflags = INPLOOKUP_WILDCARD;
2017-02-10 05:58:16 +00:00
if (nam == NULL) {
if ((error = prison_local_ip6(cred, &inp->in6p_laddr,
((inp->inp_flags & IN6P_IPV6_V6ONLY) != 0))) != 0)
return (error);
} else {
sin6 = (struct sockaddr_in6 *)nam;
KASSERT(sin6->sin6_family == AF_INET6,
("%s: invalid address family for %p", __func__, sin6));
KASSERT(sin6->sin6_len == sizeof(*sin6),
("%s: invalid address length for %p", __func__, sin6));
if ((error = sa6_embedscope(sin6, V_ip6_use_defzone)) != 0)
return(error);
if ((error = prison_local_ip6(cred, &sin6->sin6_addr,
((inp->inp_flags & IN6P_IPV6_V6ONLY) != 0))) != 0)
return (error);
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
lport = sin6->sin6_port;
if (IN6_IS_ADDR_MULTICAST(&sin6->sin6_addr)) {
/*
* Treat SO_REUSEADDR as SO_REUSEPORT for multicast;
* allow compepte duplication of binding if
* SO_REUSEPORT is set, or if SO_REUSEADDR is set
* and a multicast address is bound on both
* new and duplicated sockets.
*/
if ((so->so_options & (SO_REUSEADDR|SO_REUSEPORT)) != 0)
reuseport = SO_REUSEADDR|SO_REUSEPORT;
/*
* XXX: How to deal with SO_REUSEPORT_LB here?
* Treat same as SO_REUSEPORT for now.
*/
if ((so->so_options &
(SO_REUSEADDR|SO_REUSEPORT_LB)) != 0)
reuseport_lb = SO_REUSEADDR|SO_REUSEPORT_LB;
} else if (!IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr)) {
struct epoch_tracker et;
struct ifaddr *ifa;
sin6->sin6_port = 0; /* yech... */
NET_EPOCH_ENTER(et);
if ((ifa = ifa_ifwithaddr((struct sockaddr *)sin6)) ==
NULL &&
(inp->inp_flags & INP_BINDANY) == 0) {
NET_EPOCH_EXIT(et);
return (EADDRNOTAVAIL);
}
/*
* XXX: bind to an anycast address might accidentally
* cause sending a packet with anycast source address.
* We should allow to bind to a deprecated address, since
* the application dares to use it.
*/
if (ifa != NULL &&
((struct in6_ifaddr *)ifa)->ia6_flags &
(IN6_IFF_ANYCAST|IN6_IFF_NOTREADY|IN6_IFF_DETACHED)) {
NET_EPOCH_EXIT(et);
return (EADDRNOTAVAIL);
}
NET_EPOCH_EXIT(et);
}
if (lport) {
struct inpcb *t;
struct tcptw *tw;
/* GROSS */
if (ntohs(lport) <= V_ipport_reservedhigh &&
ntohs(lport) >= V_ipport_reservedlow &&
priv_check_cred(cred, PRIV_NETINET_RESERVEDPORT))
return (EACCES);
if (!IN6_IS_ADDR_MULTICAST(&sin6->sin6_addr) &&
priv_check_cred(inp->inp_cred, PRIV_NETINET_REUSEPORT) != 0) {
t = in6_pcblookup_local(pcbinfo,
&sin6->sin6_addr, lport,
INPLOOKUP_WILDCARD, cred);
if (t &&
((inp->inp_flags2 & INP_BINDMULTI) == 0) &&
((t->inp_flags & INP_TIMEWAIT) == 0) &&
(so->so_type != SOCK_STREAM ||
IN6_IS_ADDR_UNSPECIFIED(&t->in6p_faddr)) &&
(!IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr) ||
!IN6_IS_ADDR_UNSPECIFIED(&t->in6p_laddr) ||
(t->inp_flags2 & INP_REUSEPORT) ||
(t->inp_flags2 & INP_REUSEPORT_LB) == 0) &&
(inp->inp_cred->cr_uid !=
t->inp_cred->cr_uid))
return (EADDRINUSE);
/*
* If the socket is a BINDMULTI socket, then
* the credentials need to match and the
* original socket also has to have been bound
* with BINDMULTI.
*/
if (t && (! in_pcbbind_check_bindmulti(inp, t)))
return (EADDRINUSE);
#ifdef INET
if ((inp->inp_flags & IN6P_IPV6_V6ONLY) == 0 &&
IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr)) {
struct sockaddr_in sin;
in6_sin6_2_sin(&sin, sin6);
t = in_pcblookup_local(pcbinfo,
sin.sin_addr, lport,
INPLOOKUP_WILDCARD, cred);
if (t &&
((inp->inp_flags2 & INP_BINDMULTI) == 0) &&
((t->inp_flags &
INP_TIMEWAIT) == 0) &&
(so->so_type != SOCK_STREAM ||
ntohl(t->inp_faddr.s_addr) ==
INADDR_ANY) &&
(inp->inp_cred->cr_uid !=
t->inp_cred->cr_uid))
return (EADDRINUSE);
if (t && (! in_pcbbind_check_bindmulti(inp, t)))
return (EADDRINUSE);
}
#endif
}
t = in6_pcblookup_local(pcbinfo, &sin6->sin6_addr,
lport, lookupflags, cred);
if (t && (t->inp_flags & INP_TIMEWAIT)) {
/*
* XXXRW: If an incpb has had its timewait
* state recycled, we treat the address as
* being in use (for now). This is better
* than a panic, but not desirable.
*/
tw = intotw(t);
if (tw == NULL ||
((reuseport & tw->tw_so_options) == 0 &&
(reuseport_lb & tw->tw_so_options) == 0))
return (EADDRINUSE);
} else if (t && (reuseport & inp_so_options(t)) == 0 &&
(reuseport_lb & inp_so_options(t)) == 0) {
return (EADDRINUSE);
}
#ifdef INET
if ((inp->inp_flags & IN6P_IPV6_V6ONLY) == 0 &&
IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr)) {
struct sockaddr_in sin;
in6_sin6_2_sin(&sin, sin6);
t = in_pcblookup_local(pcbinfo, sin.sin_addr,
lport, lookupflags, cred);
if (t && t->inp_flags & INP_TIMEWAIT) {
tw = intotw(t);
if (tw == NULL)
return (EADDRINUSE);
if ((reuseport & tw->tw_so_options) == 0
&& (reuseport_lb & tw->tw_so_options) == 0
&& (ntohl(t->inp_laddr.s_addr) !=
INADDR_ANY || ((inp->inp_vflag &
INP_IPV6PROTO) ==
(t->inp_vflag & INP_IPV6PROTO))))
return (EADDRINUSE);
} else if (t &&
(reuseport & inp_so_options(t)) == 0 &&
(reuseport_lb & inp_so_options(t)) == 0 &&
(ntohl(t->inp_laddr.s_addr) != INADDR_ANY ||
(t->inp_vflag & INP_IPV6PROTO) != 0)) {
return (EADDRINUSE);
}
}
#endif
}
inp->in6p_laddr = sin6->sin6_addr;
}
if (lport == 0) {
2017-02-10 05:58:16 +00:00
if ((error = in6_pcbsetport(&inp->in6p_laddr, inp, cred)) != 0) {
/* Undo an address bind that may have occurred. */
inp->in6p_laddr = in6addr_any;
return (error);
}
} else {
inp->inp_lport = lport;
if (in_pcbinshash(inp) != 0) {
inp->in6p_laddr = in6addr_any;
inp->inp_lport = 0;
return (EAGAIN);
}
}
return (0);
}
/*
* Transform old in6_pcbconnect() into an inner subroutine for new
* in6_pcbconnect(): Do some validity-checking on the remote
* address (in mbuf 'nam') and then determine local host address
* (i.e., which interface) to use to access that remote host.
*
* This preserves definition of in6_pcbconnect(), while supporting a
* slightly different version for T/TCP. (This is more than
* a bit of a kludge, but cleaning up the internal interfaces would
* have forced minor changes in every protocol).
*/
static int
in6_pcbladdr(struct inpcb *inp, struct sockaddr_in6 *sin6,
struct in6_addr *plocal_addr6)
{
int error = 0;
int scope_ambiguous = 0;
struct in6_addr in6a;
struct epoch_tracker et;
INP_WLOCK_ASSERT(inp);
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo); /* XXXRW: why? */
if (sin6->sin6_port == 0)
return (EADDRNOTAVAIL);
if (sin6->sin6_scope_id == 0 && !V_ip6_use_defzone)
scope_ambiguous = 1;
if ((error = sa6_embedscope(sin6, V_ip6_use_defzone)) != 0)
return(error);
ifnet: Replace if_addr_lock rwlock with epoch + mutex Run on LLNW canaries and tested by pho@ gallatin: Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5 based ConnectX 4-LX NIC, I see an almost 12% improvement in received packet rate, and a larger improvement in bytes delivered all the way to userspace. When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1, I see, using nstat -I mce0 1 before the patch: InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 4.98 0.00 4.42 0.00 4235592 33 83.80 4720653 2149771 1235 247.32 4.73 0.00 4.20 0.00 4025260 33 82.99 4724900 2139833 1204 247.32 4.72 0.00 4.20 0.00 4035252 33 82.14 4719162 2132023 1264 247.32 4.71 0.00 4.21 0.00 4073206 33 83.68 4744973 2123317 1347 247.32 4.72 0.00 4.21 0.00 4061118 33 80.82 4713615 2188091 1490 247.32 4.72 0.00 4.21 0.00 4051675 33 85.29 4727399 2109011 1205 247.32 4.73 0.00 4.21 0.00 4039056 33 84.65 4724735 2102603 1053 247.32 After the patch InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 5.43 0.00 4.20 0.00 3313143 33 84.96 5434214 1900162 2656 245.51 5.43 0.00 4.20 0.00 3308527 33 85.24 5439695 1809382 2521 245.51 5.42 0.00 4.19 0.00 3316778 33 87.54 5416028 1805835 2256 245.51 5.42 0.00 4.19 0.00 3317673 33 90.44 5426044 1763056 2332 245.51 5.42 0.00 4.19 0.00 3314839 33 88.11 5435732 1792218 2499 245.52 5.44 0.00 4.19 0.00 3293228 33 91.84 5426301 1668597 2121 245.52 Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15366
2018-05-18 20:13:34 +00:00
if (!CK_STAILQ_EMPTY(&V_in6_ifaddrhead)) {
/*
* If the destination address is UNSPECIFIED addr,
* use the loopback addr, e.g ::1.
*/
if (IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr))
sin6->sin6_addr = in6addr_loopback;
}
if ((error = prison_remote_ip6(inp->inp_cred, &sin6->sin6_addr)) != 0)
return (error);
NET_EPOCH_ENTER(et);
error = in6_selectsrc_socket(sin6, inp->in6p_outputopts,
inp, inp->inp_cred, scope_ambiguous, &in6a, NULL);
NET_EPOCH_EXIT(et);
if (error)
return (error);
/*
* Do not update this earlier, in case we return with an error.
*
* XXX: this in6_selectsrc_socket result might replace the bound local
* address with the address specified by setsockopt(IPV6_PKTINFO).
* Is it the intended behavior?
*/
*plocal_addr6 = in6a;
/*
* Don't do pcblookup call here; return interface in
* plocal_addr6
* and exit to caller, that will do the lookup.
*/
return (0);
}
/*
* Outer subroutine:
* Connect from a socket to a specified address.
* Both address and port must be specified in argument sin.
* If don't have a local address for this socket yet,
* then pick one.
*/
int
in6_pcbconnect_mbuf(struct inpcb *inp, struct sockaddr *nam,
struct ucred *cred, struct mbuf *m, bool rehash)
{
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
struct inpcbinfo *pcbinfo = inp->inp_pcbinfo;
struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)nam;
struct sockaddr_in6 laddr6;
int error;
KASSERT(sin6->sin6_family == AF_INET6,
("%s: invalid address family for %p", __func__, sin6));
KASSERT(sin6->sin6_len == sizeof(*sin6),
("%s: invalid address length for %p", __func__, sin6));
bzero(&laddr6, sizeof(laddr6));
laddr6.sin6_family = AF_INET6;
INP_WLOCK_ASSERT(inp);
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
INP_HASH_WLOCK_ASSERT(pcbinfo);
#ifdef ROUTE_MPATH
if (CALC_FLOWID_OUTBOUND) {
uint32_t hash_type, hash_val;
hash_val = fib6_calc_software_hash(&inp->in6p_laddr,
&sin6->sin6_addr, 0, sin6->sin6_port,
inp->inp_socket->so_proto->pr_protocol, &hash_type);
inp->inp_flowid = hash_val;
inp->inp_flowtype = hash_type;
}
#endif
/*
* Call inner routine, to assign local interface address.
* in6_pcbladdr() may automatically fill in sin6_scope_id.
*/
if ((error = in6_pcbladdr(inp, sin6, &laddr6.sin6_addr)) != 0)
return (error);
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
if (in6_pcblookup_hash_locked(pcbinfo, &sin6->sin6_addr,
sin6->sin6_port,
IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr)
? &laddr6.sin6_addr : &inp->in6p_laddr,
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
inp->inp_lport, 0, NULL, M_NODOM) != NULL) {
return (EADDRINUSE);
}
if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr)) {
if (inp->inp_lport == 0) {
/*
* rehash was required to be true in the past for
* this case; retain that convention. However,
* we now call in_pcb_lport_dest rather than
* in6_pcbbind; the former does not insert into
* the hash table, the latter does. Change rehash
* to false to do the in_pcbinshash below.
*/
KASSERT(rehash == true,
("Rehashing required for unbound inps"));
rehash = false;
error = in_pcb_lport_dest(inp,
(struct sockaddr *) &laddr6, &inp->inp_lport,
(struct sockaddr *) sin6, sin6->sin6_port, cred,
INPLOOKUP_WILDCARD);
2017-02-10 05:58:16 +00:00
if (error)
return (error);
}
inp->in6p_laddr = laddr6.sin6_addr;
}
inp->in6p_faddr = sin6->sin6_addr;
inp->inp_fport = sin6->sin6_port;
/* update flowinfo - draft-itojun-ipv6-flowlabel-api-00 */
inp->inp_flow &= ~IPV6_FLOWLABEL_MASK;
if (inp->inp_flags & IN6P_AUTOFLOWLABEL)
inp->inp_flow |=
(htonl(ip6_randomflowlabel()) & IPV6_FLOWLABEL_MASK);
if (rehash) {
in_pcbrehash_mbuf(inp, m);
} else {
in_pcbinshash_mbuf(inp, m);
}
return (0);
}
int
in6_pcbconnect(struct inpcb *inp, struct sockaddr *nam, struct ucred *cred)
{
return (in6_pcbconnect_mbuf(inp, nam, cred, NULL, true));
}
void
in6_pcbdisconnect(struct inpcb *inp)
{
INP_WLOCK_ASSERT(inp);
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
bzero((caddr_t)&inp->in6p_faddr, sizeof(inp->in6p_faddr));
inp->inp_fport = 0;
/* clear flowinfo - draft-itojun-ipv6-flowlabel-api-00 */
inp->inp_flow &= ~IPV6_FLOWLABEL_MASK;
in_pcbrehash(inp);
}
struct sockaddr *
in6_sockaddr(in_port_t port, struct in6_addr *addr_p)
{
struct sockaddr_in6 *sin6;
sin6 = malloc(sizeof *sin6, M_SONAME, M_WAITOK);
bzero(sin6, sizeof *sin6);
sin6->sin6_family = AF_INET6;
sin6->sin6_len = sizeof(*sin6);
sin6->sin6_port = port;
sin6->sin6_addr = *addr_p;
(void)sa6_recoverscope(sin6); /* XXX: should catch errors */
return (struct sockaddr *)sin6;
}
struct sockaddr *
in6_v4mapsin6_sockaddr(in_port_t port, struct in_addr *addr_p)
{
struct sockaddr_in sin;
struct sockaddr_in6 *sin6_p;
bzero(&sin, sizeof sin);
sin.sin_family = AF_INET;
sin.sin_len = sizeof(sin);
sin.sin_port = port;
sin.sin_addr = *addr_p;
sin6_p = malloc(sizeof *sin6_p, M_SONAME,
M_WAITOK);
in6_sin_2_v4mapsin6(&sin, sin6_p);
return (struct sockaddr *)sin6_p;
}
int
in6_getsockaddr(struct socket *so, struct sockaddr **nam)
{
struct inpcb *inp;
struct in6_addr addr;
in_port_t port;
inp = sotoinpcb(so);
KASSERT(inp != NULL, ("in6_getsockaddr: inp == NULL"));
INP_RLOCK(inp);
port = inp->inp_lport;
addr = inp->in6p_laddr;
INP_RUNLOCK(inp);
*nam = in6_sockaddr(port, &addr);
return 0;
}
int
in6_getpeeraddr(struct socket *so, struct sockaddr **nam)
{
struct inpcb *inp;
struct in6_addr addr;
in_port_t port;
inp = sotoinpcb(so);
KASSERT(inp != NULL, ("in6_getpeeraddr: inp == NULL"));
INP_RLOCK(inp);
port = inp->inp_fport;
addr = inp->in6p_faddr;
INP_RUNLOCK(inp);
*nam = in6_sockaddr(port, &addr);
return 0;
}
int
in6_mapped_sockaddr(struct socket *so, struct sockaddr **nam)
{
struct inpcb *inp;
int error;
inp = sotoinpcb(so);
KASSERT(inp != NULL, ("in6_mapped_sockaddr: inp == NULL"));
#ifdef INET
if ((inp->inp_vflag & (INP_IPV4 | INP_IPV6)) == INP_IPV4) {
error = in_getsockaddr(so, nam);
1999-12-21 11:14:12 +00:00
if (error == 0)
in6_sin_2_v4mapsin6_in_sock(nam);
} else
#endif
{
/* scope issues will be handled in in6_getsockaddr(). */
error = in6_getsockaddr(so, nam);
}
return error;
}
int
in6_mapped_peeraddr(struct socket *so, struct sockaddr **nam)
{
struct inpcb *inp;
int error;
inp = sotoinpcb(so);
KASSERT(inp != NULL, ("in6_mapped_peeraddr: inp == NULL"));
#ifdef INET
if ((inp->inp_vflag & (INP_IPV4 | INP_IPV6)) == INP_IPV4) {
error = in_getpeeraddr(so, nam);
1999-12-21 11:14:12 +00:00
if (error == 0)
in6_sin_2_v4mapsin6_in_sock(nam);
} else
#endif
/* scope issues will be handled in in6_getpeeraddr(). */
error = in6_getpeeraddr(so, nam);
return error;
}
/*
* Pass some notification to all connections of a protocol
* associated with address dst. The local address and/or port numbers
* may be specified to limit the search. The "usual action" will be
* taken, depending on the ctlinput cmd. The caller must filter any
* cmds that are uninteresting (e.g., no error in the map).
* Call the protocol specific routine (if any) to report
* any errors for each matching socket.
*/
void
in6_pcbnotify(struct inpcbinfo *pcbinfo, struct sockaddr *dst,
u_int fport_arg, const struct sockaddr *src, u_int lport_arg,
int cmd, void *cmdarg,
2008-01-08 19:08:58 +00:00
struct inpcb *(*notify)(struct inpcb *, int))
{
struct inpcb *inp, *inp_temp;
struct sockaddr_in6 sa6_src, *sa6_dst;
u_short fport = fport_arg, lport = lport_arg;
u_int32_t flowinfo;
int errno;
if ((unsigned)cmd >= PRC_NCMDS || dst->sa_family != AF_INET6)
return;
sa6_dst = (struct sockaddr_in6 *)dst;
if (IN6_IS_ADDR_UNSPECIFIED(&sa6_dst->sin6_addr))
return;
/*
* note that src can be NULL when we get notify by local fragmentation.
*/
sa6_src = (src == NULL) ? sa6_any : *(const struct sockaddr_in6 *)src;
flowinfo = sa6_src.sin6_flowinfo;
/*
* Redirects go to all references to the destination,
* and use in6_rtchange to invalidate the route cache.
* Dead host indications: also use in6_rtchange to invalidate
* the cache, and deliver the error to all the sockets.
* Otherwise, if we have knowledge of the local port and address,
* deliver only to that socket.
*/
if (PRC_IS_REDIRECT(cmd) || cmd == PRC_HOSTDEAD) {
fport = 0;
lport = 0;
bzero((caddr_t)&sa6_src.sin6_addr, sizeof(sa6_src.sin6_addr));
if (cmd != PRC_HOSTDEAD)
notify = in6_rtchange;
}
errno = inet6ctlerrmap[cmd];
INP_INFO_WLOCK(pcbinfo);
CK_LIST_FOREACH_SAFE(inp, pcbinfo->ipi_listhead, inp_list, inp_temp) {
INP_WLOCK(inp);
if ((inp->inp_vflag & INP_IPV6) == 0) {
INP_WUNLOCK(inp);
continue;
}
/*
* If the error designates a new path MTU for a destination
* and the application (associated with this socket) wanted to
* know the value, notify.
* XXX: should we avoid to notify the value to TCP sockets?
*/
if (cmd == PRC_MSGSIZE && cmdarg != NULL)
ip6_notify_pmtu(inp, (struct sockaddr_in6 *)dst,
*(u_int32_t *)cmdarg);
/*
* Detect if we should notify the error. If no source and
* destination ports are specifed, but non-zero flowinfo and
* local address match, notify the error. This is the case
* when the error is delivered with an encrypted buffer
* by ESP. Otherwise, just compare addresses and ports
* as usual.
*/
if (lport == 0 && fport == 0 && flowinfo &&
inp->inp_socket != NULL &&
flowinfo == (inp->inp_flow & IPV6_FLOWLABEL_MASK) &&
IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, &sa6_src.sin6_addr))
goto do_notify;
else if (!IN6_ARE_ADDR_EQUAL(&inp->in6p_faddr,
&sa6_dst->sin6_addr) ||
inp->inp_socket == 0 ||
(lport && inp->inp_lport != lport) ||
(!IN6_IS_ADDR_UNSPECIFIED(&sa6_src.sin6_addr) &&
!IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr,
&sa6_src.sin6_addr)) ||
(fport && inp->inp_fport != fport)) {
INP_WUNLOCK(inp);
continue;
}
do_notify:
if (notify) {
if ((*notify)(inp, errno))
INP_WUNLOCK(inp);
} else
INP_WUNLOCK(inp);
}
INP_INFO_WUNLOCK(pcbinfo);
}
/*
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
* Lookup a PCB based on the local address and port. Caller must hold the
* hash lock. No inpcb locks or references are acquired.
*/
struct inpcb *
in6_pcblookup_local(struct inpcbinfo *pcbinfo, struct in6_addr *laddr,
u_short lport, int lookupflags, struct ucred *cred)
{
struct inpcb *inp;
int matchwild = 3, wildcard;
KASSERT((lookupflags & ~(INPLOOKUP_WILDCARD)) == 0,
("%s: invalid lookup flags %d", __func__, lookupflags));
INP_HASH_LOCK_ASSERT(pcbinfo);
if ((lookupflags & INPLOOKUP_WILDCARD) == 0) {
struct inpcbhead *head;
/*
* Look for an unconnected (wildcard foreign addr) PCB that
* matches the local address and port we're looking for.
*/
head = &pcbinfo->ipi_hashbase[INP_PCBHASH(
INP6_PCBHASHKEY(&in6addr_any), lport, 0,
pcbinfo->ipi_hashmask)];
CK_LIST_FOREACH(inp, head, inp_hash) {
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
/* XXX inp locking */
1999-12-21 11:14:12 +00:00
if ((inp->inp_vflag & INP_IPV6) == 0)
continue;
if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr) &&
IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, laddr) &&
inp->inp_lport == lport) {
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
/* Found. */
if (cred == NULL ||
prison_equal_ip6(cred->cr_prison,
inp->inp_cred->cr_prison))
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
return (inp);
}
}
/*
* Not found.
*/
return (NULL);
} else {
struct inpcbporthead *porthash;
struct inpcbport *phd;
struct inpcb *match = NULL;
/*
* Best fit PCB lookup.
*
* First see if this local port is in use by looking on the
* port hash list.
*/
porthash = &pcbinfo->ipi_porthashbase[INP_PCBPORTHASH(lport,
pcbinfo->ipi_porthashmask)];
CK_LIST_FOREACH(phd, porthash, phd_hash) {
if (phd->phd_port == lport)
break;
}
if (phd != NULL) {
/*
* Port is in use by one or more PCBs. Look for best
* fit.
*/
CK_LIST_FOREACH(inp, &phd->phd_pcblist, inp_portlist) {
wildcard = 0;
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
if (cred != NULL &&
!prison_equal_ip6(cred->cr_prison,
inp->inp_cred->cr_prison))
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
continue;
/* XXX inp locking */
1999-12-21 11:14:12 +00:00
if ((inp->inp_vflag & INP_IPV6) == 0)
continue;
if (!IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr))
wildcard++;
if (!IN6_IS_ADDR_UNSPECIFIED(
&inp->in6p_laddr)) {
if (IN6_IS_ADDR_UNSPECIFIED(laddr))
wildcard++;
else if (!IN6_ARE_ADDR_EQUAL(
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
&inp->in6p_laddr, laddr))
continue;
} else {
if (!IN6_IS_ADDR_UNSPECIFIED(laddr))
wildcard++;
}
if (wildcard < matchwild) {
match = inp;
matchwild = wildcard;
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
if (matchwild == 0)
break;
}
}
}
return (match);
}
}
void
in6_pcbpurgeif0(struct inpcbinfo *pcbinfo, struct ifnet *ifp)
{
struct inpcb *inp;
struct in6_multi *inm;
struct in6_mfilter *imf;
struct ip6_moptions *im6o;
INP_INFO_WLOCK(pcbinfo);
CK_LIST_FOREACH(inp, pcbinfo->ipi_listhead, inp_list) {
INP_WLOCK(inp);
if (__predict_false(inp->inp_flags2 & INP_FREED)) {
INP_WUNLOCK(inp);
continue;
}
im6o = inp->in6p_moptions;
if ((inp->inp_vflag & INP_IPV6) && im6o != NULL) {
/*
Bite the bullet, and make the IPv6 SSM and MLDv2 mega-commit: import from p4 bms_netdev. Summary of changes: * Connect netinet6/in6_mcast.c to build. The legacy KAME KPIs are mostly preserved. * Eliminate now dead code from ip6_output.c. Don't do mbuf bingo, we are not going to do RFC 2292 style CMSG tricks for multicast options as they are not required by any current IPv6 normative reference. * Refactor transports (UDP, raw_ip6) to do own mcast filtering. SCTP, TCP unaffected by this change. * Add ip6_msource, in6_msource structs to in6_var.h. * Hookup mld_ifinfo state to in6_ifextra, allocate from domifattach path. * Eliminate IN6_LOOKUP_MULTI(), it is no longer referenced. Kernel consumers which need this should use in6m_lookup(). * Refactor IPv6 socket group memberships to use a vector (like IPv4). * Update ifmcstat(8) for IPv6 SSM. * Add witness lock order for IN6_MULTI_LOCK. * Move IN6_MULTI_LOCK out of lower ip6_output()/ip6_input() paths. * Introduce IP6STAT_ADD/SUB/INC/DEC as per rwatson's IPv4 cleanup. * Update carp(4) for new IPv6 SSM KPIs. * Virtualize ip6_mrouter socket. Changes mostly localized to IPv6 MROUTING. * Don't do a local group lookup in MROUTING. * Kill unused KAME prototypes in6_purgemkludge(), in6_restoremkludge(). * Preserve KAME DAD timer jitter behaviour in MLDv1 compatibility mode. * Bump __FreeBSD_version to 800084. * Update UPDATING. NOTE WELL: * This code hasn't been tested against real MLDv2 queriers (yet), although the on-wire protocol has been verified in Wireshark. * There are a few unresolved issues in the socket layer APIs to do with scope ID propagation. * There is a LOR present in ip6_output()'s use of in6_setscope() which needs to be resolved. See comments in mld6.c. This is believed to be benign and can't be avoided for the moment without re-introducing an indirect netisr. This work was mostly derived from the IGMPv3 implementation, and has been sponsored by a third party.
2009-04-29 19:19:13 +00:00
* Unselect the outgoing ifp for multicast if it
* is being detached.
*/
if (im6o->im6o_multicast_ifp == ifp)
im6o->im6o_multicast_ifp = NULL;
/*
* Drop multicast group membership if we joined
* through the interface being detached.
*/
restart:
IP6_MFILTER_FOREACH(imf, &im6o->im6o_head) {
if ((inm = imf->im6f_in6m) == NULL)
continue;
if (inm->in6m_ifp != ifp)
continue;
ip6_mfilter_remove(&im6o->im6o_head, imf);
IN6_MULTI_LOCK_ASSERT();
in6_leavegroup_locked(inm, NULL);
ip6_mfilter_free(imf);
goto restart;
}
}
INP_WUNLOCK(inp);
}
INP_INFO_WUNLOCK(pcbinfo);
}
/*
* Check for alternatives when higher level complains
* about service problems. For now, invalidate cached
* routing information. If the route was created dynamically
* (by a redirect), time to try a default gateway again.
*/
void
in6_losing(struct inpcb *inp)
{
RO_INVALIDATE_CACHE(&inp->inp_route6);
}
/*
* After a routing change, flush old routing
* and allocate a (hopefully) better one.
*/
struct inpcb *
in6_rtchange(struct inpcb *inp, int errno __unused)
{
RO_INVALIDATE_CACHE(&inp->inp_route6);
return inp;
}
static struct inpcb *
in6_pcblookup_lbgroup(const struct inpcbinfo *pcbinfo,
const struct in6_addr *laddr, uint16_t lport, const struct in6_addr *faddr,
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
uint16_t fport, int lookupflags, uint8_t numa_domain)
{
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
struct inpcb *local_wild, *numa_wild;
const struct inpcblbgrouphead *hdr;
struct inpcblbgroup *grp;
uint32_t idx;
INP_HASH_LOCK_ASSERT(pcbinfo);
hdr = &pcbinfo->ipi_lbgrouphashbase[
INP_PCBPORTHASH(lport, pcbinfo->ipi_lbgrouphashmask)];
/*
* Order of socket selection:
* 1. non-wild.
* 2. wild (if lookupflags contains INPLOOKUP_WILDCARD).
*
* NOTE:
* - Load balanced group does not contain jailed sockets.
* - Load balanced does not contain IPv4 mapped INET6 wild sockets.
*/
local_wild = NULL;
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
numa_wild = NULL;
CK_LIST_FOREACH(grp, hdr, il_list) {
#ifdef INET
if (!(grp->il_vflag & INP_IPV6))
continue;
#endif
if (grp->il_lport != lport)
continue;
idx = INP_PCBLBGROUP_PKTHASH(INP6_PCBHASHKEY(faddr), lport,
fport) % grp->il_inpcnt;
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
if (IN6_ARE_ADDR_EQUAL(&grp->il6_laddr, laddr)) {
if (numa_domain == M_NODOM ||
grp->il_numa_domain == numa_domain) {
return (grp->il_inp[idx]);
}
else
numa_wild = grp->il_inp[idx];
}
if (IN6_IS_ADDR_UNSPECIFIED(&grp->il6_laddr) &&
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
(lookupflags & INPLOOKUP_WILDCARD) != 0 &&
(local_wild == NULL || numa_domain == M_NODOM ||
grp->il_numa_domain == numa_domain)) {
local_wild = grp->il_inp[idx];
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
}
}
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
if (numa_wild != NULL)
return (numa_wild);
return (local_wild);
}
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
#ifdef PCBGROUP
/*
* Lookup PCB in hash list, using pcbgroup tables.
*/
static struct inpcb *
in6_pcblookup_group(struct inpcbinfo *pcbinfo, struct inpcbgroup *pcbgroup,
struct in6_addr *faddr, u_int fport_arg, struct in6_addr *laddr,
u_int lport_arg, int lookupflags, struct ifnet *ifp)
{
struct inpcbhead *head;
struct inpcb *inp, *tmpinp;
u_short fport = fport_arg, lport = lport_arg;
bool locked;
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
/*
* First look for an exact match.
*/
tmpinp = NULL;
INP_GROUP_LOCK(pcbgroup);
head = &pcbgroup->ipg_hashbase[INP_PCBHASH(
INP6_PCBHASHKEY(faddr), lport, fport, pcbgroup->ipg_hashmask)];
CK_LIST_FOREACH(inp, head, inp_pcbgrouphash) {
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
/* XXX inp locking */
if ((inp->inp_vflag & INP_IPV6) == 0)
continue;
if (IN6_ARE_ADDR_EQUAL(&inp->in6p_faddr, faddr) &&
IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, laddr) &&
inp->inp_fport == fport &&
inp->inp_lport == lport) {
/*
* XXX We should be able to directly return
* the inp here, without any checks.
* Well unless both bound with SO_REUSEPORT?
*/
if (prison_flag(inp->inp_cred, PR_IP6))
goto found;
if (tmpinp == NULL)
tmpinp = inp;
}
}
if (tmpinp != NULL) {
inp = tmpinp;
goto found;
}
/*
* Then look for a wildcard match in the pcbgroup.
*/
if ((lookupflags & INPLOOKUP_WILDCARD) != 0) {
struct inpcb *local_wild = NULL, *local_exact = NULL;
struct inpcb *jail_wild = NULL;
int injail;
/*
* Order of socket selection - we always prefer jails.
* 1. jailed, non-wild.
* 2. jailed, wild.
* 3. non-jailed, non-wild.
* 4. non-jailed, wild.
*/
head = &pcbgroup->ipg_hashbase[
INP_PCBHASH(INADDR_ANY, lport, 0, pcbgroup->ipg_hashmask)];
CK_LIST_FOREACH(inp, head, inp_pcbgrouphash) {
/* XXX inp locking */
if ((inp->inp_vflag & INP_IPV6) == 0)
continue;
if (!IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr) ||
inp->inp_lport != lport) {
continue;
}
injail = prison_flag(inp->inp_cred, PR_IP6);
if (injail) {
if (prison_check_ip6(inp->inp_cred,
laddr) != 0)
continue;
} else {
if (local_exact != NULL)
continue;
}
if (IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, laddr)) {
if (injail)
goto found;
else
local_exact = inp;
} else if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr)) {
if (injail)
jail_wild = inp;
else
local_wild = inp;
}
} /* LIST_FOREACH */
inp = jail_wild;
if (inp == NULL)
inp = jail_wild;
if (inp == NULL)
inp = local_exact;
if (inp == NULL)
inp = local_wild;
if (inp != NULL)
goto found;
}
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
/*
* Then look for a wildcard match, if requested.
*/
if ((lookupflags & INPLOOKUP_WILDCARD) != 0) {
struct inpcb *local_wild = NULL, *local_exact = NULL;
struct inpcb *jail_wild = NULL;
int injail;
/*
* Order of socket selection - we always prefer jails.
* 1. jailed, non-wild.
* 2. jailed, wild.
* 3. non-jailed, non-wild.
* 4. non-jailed, wild.
*/
head = &pcbinfo->ipi_wildbase[INP_PCBHASH(
INP6_PCBHASHKEY(&in6addr_any), lport, 0,
pcbinfo->ipi_wildmask)];
CK_LIST_FOREACH(inp, head, inp_pcbgroup_wild) {
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
/* XXX inp locking */
if ((inp->inp_vflag & INP_IPV6) == 0)
continue;
if (!IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr) ||
inp->inp_lport != lport) {
continue;
}
injail = prison_flag(inp->inp_cred, PR_IP6);
if (injail) {
if (prison_check_ip6(inp->inp_cred,
laddr) != 0)
continue;
} else {
if (local_exact != NULL)
continue;
}
if (IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, laddr)) {
if (injail)
goto found;
else
local_exact = inp;
} else if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr)) {
if (injail)
jail_wild = inp;
else
local_wild = inp;
}
} /* LIST_FOREACH */
inp = jail_wild;
if (inp == NULL)
inp = jail_wild;
if (inp == NULL)
inp = local_exact;
if (inp == NULL)
inp = local_wild;
if (inp != NULL)
goto found;
} /* if ((lookupflags & INPLOOKUP_WILDCARD) != 0) */
INP_GROUP_UNLOCK(pcbgroup);
return (NULL);
found:
if (lookupflags & INPLOOKUP_WLOCKPCB)
locked = INP_TRY_WLOCK(inp);
else if (lookupflags & INPLOOKUP_RLOCKPCB)
locked = INP_TRY_RLOCK(inp);
else
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
panic("%s: locking buf", __func__);
if (!locked)
in_pcbref(inp);
INP_GROUP_UNLOCK(pcbgroup);
if (!locked) {
if (lookupflags & INPLOOKUP_WLOCKPCB) {
INP_WLOCK(inp);
if (in_pcbrele_wlocked(inp))
return (NULL);
} else {
INP_RLOCK(inp);
if (in_pcbrele_rlocked(inp))
return (NULL);
}
}
#ifdef INVARIANTS
if (lookupflags & INPLOOKUP_WLOCKPCB)
INP_WLOCK_ASSERT(inp);
else
INP_RLOCK_ASSERT(inp);
#endif
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
return (inp);
}
#endif /* PCBGROUP */
/*
* Lookup PCB in hash list. Used in in_pcb.c as well as here.
*/
struct inpcb *
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
in6_pcblookup_hash_locked(struct inpcbinfo *pcbinfo, struct in6_addr *faddr,
u_int fport_arg, struct in6_addr *laddr, u_int lport_arg,
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
int lookupflags, struct ifnet *ifp, uint8_t numa_domain)
{
struct inpcbhead *head;
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
struct inpcb *inp, *tmpinp;
u_short fport = fport_arg, lport = lport_arg;
KASSERT((lookupflags & ~(INPLOOKUP_WILDCARD)) == 0,
("%s: invalid lookup flags %d", __func__, lookupflags));
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
INP_HASH_LOCK_ASSERT(pcbinfo);
/*
* First look for an exact match.
*/
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
tmpinp = NULL;
head = &pcbinfo->ipi_hashbase[INP_PCBHASH(
INP6_PCBHASHKEY(faddr), lport, fport, pcbinfo->ipi_hashmask)];
CK_LIST_FOREACH(inp, head, inp_hash) {
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
/* XXX inp locking */
1999-12-21 11:14:12 +00:00
if ((inp->inp_vflag & INP_IPV6) == 0)
continue;
if (IN6_ARE_ADDR_EQUAL(&inp->in6p_faddr, faddr) &&
IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, laddr) &&
inp->inp_fport == fport &&
inp->inp_lport == lport) {
/*
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
* XXX We should be able to directly return
* the inp here, without any checks.
* Well unless both bound with SO_REUSEPORT?
*/
if (prison_flag(inp->inp_cred, PR_IP6))
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
return (inp);
if (tmpinp == NULL)
tmpinp = inp;
}
}
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
if (tmpinp != NULL)
return (tmpinp);
/*
* Then look in lb group (for wildcard match).
*/
if ((lookupflags & INPLOOKUP_WILDCARD) != 0) {
inp = in6_pcblookup_lbgroup(pcbinfo, laddr, lport, faddr,
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
fport, lookupflags, numa_domain);
if (inp != NULL)
return (inp);
}
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
/*
* Then look for a wildcard match, if requested.
*/
if ((lookupflags & INPLOOKUP_WILDCARD) != 0) {
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
struct inpcb *local_wild = NULL, *local_exact = NULL;
struct inpcb *jail_wild = NULL;
int injail;
/*
* Order of socket selection - we always prefer jails.
* 1. jailed, non-wild.
* 2. jailed, wild.
* 3. non-jailed, non-wild.
* 4. non-jailed, wild.
*/
head = &pcbinfo->ipi_hashbase[INP_PCBHASH(
INP6_PCBHASHKEY(&in6addr_any), lport, 0,
pcbinfo->ipi_hashmask)];
CK_LIST_FOREACH(inp, head, inp_hash) {
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
/* XXX inp locking */
1999-12-21 11:14:12 +00:00
if ((inp->inp_vflag & INP_IPV6) == 0)
continue;
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
if (!IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr) ||
inp->inp_lport != lport) {
continue;
}
injail = prison_flag(inp->inp_cred, PR_IP6);
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
if (injail) {
if (prison_check_ip6(inp->inp_cred,
laddr) != 0)
continue;
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
} else {
if (local_exact != NULL)
continue;
}
if (IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, laddr)) {
if (injail)
return (inp);
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
else
local_exact = inp;
} else if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr)) {
if (injail)
jail_wild = inp;
else
local_wild = inp;
}
MFp4: Bring in updated jail support from bz_jail branch. This enhances the current jail implementation to permit multiple addresses per jail. In addtion to IPv4, IPv6 is supported as well. Due to updated checks it is even possible to have jails without an IP address at all, which basically gives one a chroot with restricted process view, no networking,.. SCTP support was updated and supports IPv6 in jails as well. Cpuset support permits jails to be bound to specific processor sets after creation. Jails can have an unrestricted (no duplicate protection, etc.) name in addition to the hostname. The jail name cannot be changed from within a jail and is considered to be used for management purposes or as audit-token in the future. DDB 'show jails' command was added to aid debugging. Proper compat support permits 32bit jail binaries to be used on 64bit systems to manage jails. Also backward compatibility was preserved where possible: for jail v1 syscalls, as well as with user space management utilities. Both jail as well as prison version were updated for the new features. A gap was intentionally left as the intermediate versions had been used by various patches floating around the last years. Bump __FreeBSD_version for the afore mentioned and in kernel changes. Special thanks to: - Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches and Olivier Houchard (cognet) for initial single-IPv6 patches. - Jeff Roberson (jeff) and Randall Stewart (rrs) for their help, ideas and review on cpuset and SCTP support. - Robert Watson (rwatson) for lots and lots of help, discussions, suggestions and review of most of the patch at various stages. - John Baldwin (jhb) for his help. - Simon L. Nielsen (simon) as early adopter testing changes on cluster machines as well as all the testers and people who provided feedback the last months on freebsd-jail and other channels. - My employer, CK Software GmbH, for the support so I could work on this. Reviewed by: (see above) MFC after: 3 months (this is just so that I get the mail) X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
} /* LIST_FOREACH */
if (jail_wild != NULL)
return (jail_wild);
if (local_exact != NULL)
return (local_exact);
if (local_wild != NULL)
return (local_wild);
} /* if ((lookupflags & INPLOOKUP_WILDCARD) != 0) */
/*
* Not found.
*/
return (NULL);
}
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
/*
* Lookup PCB in hash list, using pcbinfo tables. This variation locks the
* hash list lock, and will return the inpcb locked (i.e., requires
* INPLOOKUP_LOCKPCB).
*/
static struct inpcb *
in6_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in6_addr *faddr,
u_int fport, struct in6_addr *laddr, u_int lport, int lookupflags,
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
struct ifnet *ifp, uint8_t numa_domain)
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
{
struct inpcb *inp;
inp = in6_pcblookup_hash_locked(pcbinfo, faddr, fport, laddr, lport,
lookupflags & INPLOOKUP_WILDCARD, ifp, numa_domain);
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
if (inp != NULL) {
if (lookupflags & INPLOOKUP_WLOCKPCB) {
INP_WLOCK(inp);
} else if (lookupflags & INPLOOKUP_RLOCKPCB) {
INP_RLOCK(inp);
} else
panic("%s: locking bug", __func__);
if (__predict_false(inp->inp_flags2 & INP_FREED)) {
INP_UNLOCK(inp);
inp = NULL;
}
}
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
return (inp);
}
/*
* Public inpcb lookup routines, accepting a 4-tuple, and optionally, an mbuf
* from which a pre-calculated hash value may be extracted.
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
*
* Possibly more of this logic should be in in6_pcbgroup.c.
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
*/
struct inpcb *
in6_pcblookup(struct inpcbinfo *pcbinfo, struct in6_addr *faddr, u_int fport,
struct in6_addr *laddr, u_int lport, int lookupflags, struct ifnet *ifp)
{
Several years after initial development, merge prototype support for linking NIC Receive Side Scaling (RSS) to the network stack's connection-group implementation. This prototype (and derived patches) are in use at Juniper and several other FreeBSD-using companies, so despite some reservations about its maturity, merge the patch to the base tree so that it can be iteratively refined in collaboration rather than maintained as a set of gradually diverging patch sets. (1) Merge a software implementation of the Toeplitz hash specified in RSS implemented by David Malone. This is used to allow suitable pcbgroup placement of connections before the first packet is received from the NIC. Software hashing is generally avoided, however, due to high cost of the hash on general-purpose CPUs. (2) In in_rss.c, maintain authoritative versions of RSS state intended to be pushed to each NIC, including keying material, hash algorithm/ configuration, and buckets. Provide software-facing interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both the RSS standardised Toeplitz and a 'naive' variation with a hash efficient in software but with poor distribution properties. Implement rss_m2cpuid()to be used by netisr and other load balancing code to look up the CPU on which an mbuf should be processed. (3) In the Ethernet link layer, allow netisr distribution using RSS as a source of policy as an alternative to source ordering; continue to default to direct dispatch (i.e., don't try and requeue packets for processing on the 'right' CPU if they arrive in a directly dispatchable context). (4) Allow RSS to control tuning of connection groups in order to align groups with RSS buckets. If a packet arrives on a protocol using connection groups, and contains a suitable hardware-generated hash, use that hash value to select the connection group for pcb lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz hash is available, we fall back on regular PCB lookup risking contention rather than pay the cost of Toeplitz in software -- this is a less scalable but, at my last measurement, faster approach. As core counts go up, we may want to revise this strategy despite CPU overhead. Where device drivers suitably configure NICs, and connection groups / RSS are enabled, this should avoid both lock and line contention during connection lookup for TCP. This commit does not modify any device drivers to tune device RSS configuration to the global RSS configuration; patches are in circulation to do this for at least Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device drivers is not particularly robust, nor aware of more advanced features such as runtime reconfiguration/rebalancing. This will hopefully prove a useful starting point for refinement. No MFC is scheduled as we will first want to nail down a more mature and maintainable KPI/KBI for device drivers. Sponsored by: Juniper Networks (original work) Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
#if defined(PCBGROUP) && !defined(RSS)
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
struct inpcbgroup *pcbgroup;
#endif
KASSERT((lookupflags & ~INPLOOKUP_MASK) == 0,
("%s: invalid lookup flags %d", __func__, lookupflags));
KASSERT((lookupflags & (INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)) != 0,
("%s: LOCKPCB not set", __func__));
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
Several years after initial development, merge prototype support for linking NIC Receive Side Scaling (RSS) to the network stack's connection-group implementation. This prototype (and derived patches) are in use at Juniper and several other FreeBSD-using companies, so despite some reservations about its maturity, merge the patch to the base tree so that it can be iteratively refined in collaboration rather than maintained as a set of gradually diverging patch sets. (1) Merge a software implementation of the Toeplitz hash specified in RSS implemented by David Malone. This is used to allow suitable pcbgroup placement of connections before the first packet is received from the NIC. Software hashing is generally avoided, however, due to high cost of the hash on general-purpose CPUs. (2) In in_rss.c, maintain authoritative versions of RSS state intended to be pushed to each NIC, including keying material, hash algorithm/ configuration, and buckets. Provide software-facing interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both the RSS standardised Toeplitz and a 'naive' variation with a hash efficient in software but with poor distribution properties. Implement rss_m2cpuid()to be used by netisr and other load balancing code to look up the CPU on which an mbuf should be processed. (3) In the Ethernet link layer, allow netisr distribution using RSS as a source of policy as an alternative to source ordering; continue to default to direct dispatch (i.e., don't try and requeue packets for processing on the 'right' CPU if they arrive in a directly dispatchable context). (4) Allow RSS to control tuning of connection groups in order to align groups with RSS buckets. If a packet arrives on a protocol using connection groups, and contains a suitable hardware-generated hash, use that hash value to select the connection group for pcb lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz hash is available, we fall back on regular PCB lookup risking contention rather than pay the cost of Toeplitz in software -- this is a less scalable but, at my last measurement, faster approach. As core counts go up, we may want to revise this strategy despite CPU overhead. Where device drivers suitably configure NICs, and connection groups / RSS are enabled, this should avoid both lock and line contention during connection lookup for TCP. This commit does not modify any device drivers to tune device RSS configuration to the global RSS configuration; patches are in circulation to do this for at least Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device drivers is not particularly robust, nor aware of more advanced features such as runtime reconfiguration/rebalancing. This will hopefully prove a useful starting point for refinement. No MFC is scheduled as we will first want to nail down a more mature and maintainable KPI/KBI for device drivers. Sponsored by: Juniper Networks (original work) Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
/*
* When not using RSS, use connection groups in preference to the
* reservation table when looking up 4-tuples. When using RSS, just
* use the reservation table, due to the cost of the Toeplitz hash
* in software.
*
* XXXRW: This policy belongs in the pcbgroup code, as in principle
* we could be doing RSS with a non-Toeplitz hash that is affordable
* in software.
*/
#if defined(PCBGROUP) && !defined(RSS)
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
if (in_pcbgroup_enabled(pcbinfo)) {
pcbgroup = in6_pcbgroup_bytuple(pcbinfo, laddr, lport, faddr,
fport);
return (in6_pcblookup_group(pcbinfo, pcbgroup, faddr, fport,
laddr, lport, lookupflags, ifp));
}
#endif
return (in6_pcblookup_hash(pcbinfo, faddr, fport, laddr, lport,
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
lookupflags, ifp, M_NODOM));
}
struct inpcb *
in6_pcblookup_mbuf(struct inpcbinfo *pcbinfo, struct in6_addr *faddr,
u_int fport, struct in6_addr *laddr, u_int lport, int lookupflags,
struct ifnet *ifp, struct mbuf *m)
{
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
#ifdef PCBGROUP
struct inpcbgroup *pcbgroup;
#endif
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
KASSERT((lookupflags & ~INPLOOKUP_MASK) == 0,
("%s: invalid lookup flags %d", __func__, lookupflags));
KASSERT((lookupflags & (INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)) != 0,
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
("%s: LOCKPCB not set", __func__));
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
#ifdef PCBGROUP
Several years after initial development, merge prototype support for linking NIC Receive Side Scaling (RSS) to the network stack's connection-group implementation. This prototype (and derived patches) are in use at Juniper and several other FreeBSD-using companies, so despite some reservations about its maturity, merge the patch to the base tree so that it can be iteratively refined in collaboration rather than maintained as a set of gradually diverging patch sets. (1) Merge a software implementation of the Toeplitz hash specified in RSS implemented by David Malone. This is used to allow suitable pcbgroup placement of connections before the first packet is received from the NIC. Software hashing is generally avoided, however, due to high cost of the hash on general-purpose CPUs. (2) In in_rss.c, maintain authoritative versions of RSS state intended to be pushed to each NIC, including keying material, hash algorithm/ configuration, and buckets. Provide software-facing interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both the RSS standardised Toeplitz and a 'naive' variation with a hash efficient in software but with poor distribution properties. Implement rss_m2cpuid()to be used by netisr and other load balancing code to look up the CPU on which an mbuf should be processed. (3) In the Ethernet link layer, allow netisr distribution using RSS as a source of policy as an alternative to source ordering; continue to default to direct dispatch (i.e., don't try and requeue packets for processing on the 'right' CPU if they arrive in a directly dispatchable context). (4) Allow RSS to control tuning of connection groups in order to align groups with RSS buckets. If a packet arrives on a protocol using connection groups, and contains a suitable hardware-generated hash, use that hash value to select the connection group for pcb lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz hash is available, we fall back on regular PCB lookup risking contention rather than pay the cost of Toeplitz in software -- this is a less scalable but, at my last measurement, faster approach. As core counts go up, we may want to revise this strategy despite CPU overhead. Where device drivers suitably configure NICs, and connection groups / RSS are enabled, this should avoid both lock and line contention during connection lookup for TCP. This commit does not modify any device drivers to tune device RSS configuration to the global RSS configuration; patches are in circulation to do this for at least Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device drivers is not particularly robust, nor aware of more advanced features such as runtime reconfiguration/rebalancing. This will hopefully prove a useful starting point for refinement. No MFC is scheduled as we will first want to nail down a more mature and maintainable KPI/KBI for device drivers. Sponsored by: Juniper Networks (original work) Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
/*
* If we can use a hardware-generated hash to look up the connection
* group, use that connection group to find the inpcb. Otherwise
* fall back on a software hash -- or the reservation table if we're
* using RSS.
*
* XXXRW: As above, that policy belongs in the pcbgroup code.
*/
if (in_pcbgroup_enabled(pcbinfo) &&
M_HASHTYPE_TEST(m, M_HASHTYPE_NONE) == 0) {
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
pcbgroup = in6_pcbgroup_byhash(pcbinfo, M_HASHTYPE_GET(m),
m->m_pkthdr.flowid);
if (pcbgroup != NULL)
return (in6_pcblookup_group(pcbinfo, pcbgroup, faddr,
fport, laddr, lport, lookupflags, ifp));
Several years after initial development, merge prototype support for linking NIC Receive Side Scaling (RSS) to the network stack's connection-group implementation. This prototype (and derived patches) are in use at Juniper and several other FreeBSD-using companies, so despite some reservations about its maturity, merge the patch to the base tree so that it can be iteratively refined in collaboration rather than maintained as a set of gradually diverging patch sets. (1) Merge a software implementation of the Toeplitz hash specified in RSS implemented by David Malone. This is used to allow suitable pcbgroup placement of connections before the first packet is received from the NIC. Software hashing is generally avoided, however, due to high cost of the hash on general-purpose CPUs. (2) In in_rss.c, maintain authoritative versions of RSS state intended to be pushed to each NIC, including keying material, hash algorithm/ configuration, and buckets. Provide software-facing interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both the RSS standardised Toeplitz and a 'naive' variation with a hash efficient in software but with poor distribution properties. Implement rss_m2cpuid()to be used by netisr and other load balancing code to look up the CPU on which an mbuf should be processed. (3) In the Ethernet link layer, allow netisr distribution using RSS as a source of policy as an alternative to source ordering; continue to default to direct dispatch (i.e., don't try and requeue packets for processing on the 'right' CPU if they arrive in a directly dispatchable context). (4) Allow RSS to control tuning of connection groups in order to align groups with RSS buckets. If a packet arrives on a protocol using connection groups, and contains a suitable hardware-generated hash, use that hash value to select the connection group for pcb lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz hash is available, we fall back on regular PCB lookup risking contention rather than pay the cost of Toeplitz in software -- this is a less scalable but, at my last measurement, faster approach. As core counts go up, we may want to revise this strategy despite CPU overhead. Where device drivers suitably configure NICs, and connection groups / RSS are enabled, this should avoid both lock and line contention during connection lookup for TCP. This commit does not modify any device drivers to tune device RSS configuration to the global RSS configuration; patches are in circulation to do this for at least Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device drivers is not particularly robust, nor aware of more advanced features such as runtime reconfiguration/rebalancing. This will hopefully prove a useful starting point for refinement. No MFC is scheduled as we will first want to nail down a more mature and maintainable KPI/KBI for device drivers. Sponsored by: Juniper Networks (original work) Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
#ifndef RSS
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
pcbgroup = in6_pcbgroup_bytuple(pcbinfo, laddr, lport, faddr,
fport);
return (in6_pcblookup_group(pcbinfo, pcbgroup, faddr, fport,
laddr, lport, lookupflags, ifp));
Several years after initial development, merge prototype support for linking NIC Receive Side Scaling (RSS) to the network stack's connection-group implementation. This prototype (and derived patches) are in use at Juniper and several other FreeBSD-using companies, so despite some reservations about its maturity, merge the patch to the base tree so that it can be iteratively refined in collaboration rather than maintained as a set of gradually diverging patch sets. (1) Merge a software implementation of the Toeplitz hash specified in RSS implemented by David Malone. This is used to allow suitable pcbgroup placement of connections before the first packet is received from the NIC. Software hashing is generally avoided, however, due to high cost of the hash on general-purpose CPUs. (2) In in_rss.c, maintain authoritative versions of RSS state intended to be pushed to each NIC, including keying material, hash algorithm/ configuration, and buckets. Provide software-facing interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both the RSS standardised Toeplitz and a 'naive' variation with a hash efficient in software but with poor distribution properties. Implement rss_m2cpuid()to be used by netisr and other load balancing code to look up the CPU on which an mbuf should be processed. (3) In the Ethernet link layer, allow netisr distribution using RSS as a source of policy as an alternative to source ordering; continue to default to direct dispatch (i.e., don't try and requeue packets for processing on the 'right' CPU if they arrive in a directly dispatchable context). (4) Allow RSS to control tuning of connection groups in order to align groups with RSS buckets. If a packet arrives on a protocol using connection groups, and contains a suitable hardware-generated hash, use that hash value to select the connection group for pcb lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz hash is available, we fall back on regular PCB lookup risking contention rather than pay the cost of Toeplitz in software -- this is a less scalable but, at my last measurement, faster approach. As core counts go up, we may want to revise this strategy despite CPU overhead. Where device drivers suitably configure NICs, and connection groups / RSS are enabled, this should avoid both lock and line contention during connection lookup for TCP. This commit does not modify any device drivers to tune device RSS configuration to the global RSS configuration; patches are in circulation to do this for at least Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device drivers is not particularly robust, nor aware of more advanced features such as runtime reconfiguration/rebalancing. This will hopefully prove a useful starting point for refinement. No MFC is scheduled as we will first want to nail down a more mature and maintainable KPI/KBI for device drivers. Sponsored by: Juniper Networks (original work) Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
#endif
Implement a CPU-affine TCP and UDP connection lookup data structure, struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
}
#endif
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
return (in6_pcblookup_hash(pcbinfo, faddr, fport, laddr, lport,
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
lookupflags, ifp, m->m_pkthdr.numa_domain));
Decompose the current single inpcbinfo lock into two locks: - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
}
void
init_sin6(struct sockaddr_in6 *sin6, struct mbuf *m, int srcordst)
{
struct ip6_hdr *ip;
ip = mtod(m, struct ip6_hdr *);
bzero(sin6, sizeof(*sin6));
sin6->sin6_len = sizeof(*sin6);
sin6->sin6_family = AF_INET6;
sin6->sin6_addr = srcordst ? ip->ip6_dst : ip->ip6_src;
(void)sa6_recoverscope(sin6); /* XXX: should catch errors... */
return;
}