2005-01-07 02:30:35 +00:00
|
|
|
/*-
|
2017-11-20 19:43:44 +00:00
|
|
|
* SPDX-License-Identifier: BSD-3-Clause
|
|
|
|
*
|
1999-11-22 02:45:11 +00:00
|
|
|
* Copyright (C) 1995, 1996, 1997, and 1998 WIDE Project.
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
* Copyright (c) 2010-2011 Juniper Networks, Inc.
|
1999-11-22 02:45:11 +00:00
|
|
|
* All rights reserved.
|
|
|
|
*
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
* Portions of this software were developed by Robert N. M. Watson under
|
|
|
|
* contract to Juniper Networks, Inc.
|
|
|
|
*
|
1999-11-22 02:45:11 +00:00
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
* 3. Neither the name of the project nor the names of its contributors
|
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE PROJECT OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
2007-12-10 16:03:40 +00:00
|
|
|
* $KAME: in6_pcb.c,v 1.31 2001/05/21 05:45:10 jinmei Exp $
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
|
|
|
|
2005-01-07 02:30:35 +00:00
|
|
|
/*-
|
1999-11-22 02:45:11 +00:00
|
|
|
* Copyright (c) 1982, 1986, 1991, 1993
|
|
|
|
* The Regents of the University of California. All rights reserved.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
2017-02-28 23:42:47 +00:00
|
|
|
* 3. Neither the name of the University nor the names of its contributors
|
1999-11-22 02:45:11 +00:00
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
|
|
|
* @(#)in_pcb.c 8.2 (Berkeley) 1/4/94
|
|
|
|
*/
|
|
|
|
|
2007-12-10 16:03:40 +00:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
2001-06-11 12:39:29 +00:00
|
|
|
#include "opt_inet.h"
|
|
|
|
#include "opt_inet6.h"
|
1999-12-22 19:13:38 +00:00
|
|
|
#include "opt_ipsec.h"
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#include "opt_pcbgroup.h"
|
2020-10-18 17:15:47 +00:00
|
|
|
#include "opt_route.h"
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
#include "opt_rss.h"
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
#include <sys/param.h>
|
|
|
|
#include <sys/systm.h>
|
|
|
|
#include <sys/malloc.h>
|
|
|
|
#include <sys/mbuf.h>
|
2000-01-09 19:17:30 +00:00
|
|
|
#include <sys/domain.h>
|
1999-11-22 02:45:11 +00:00
|
|
|
#include <sys/protosw.h>
|
|
|
|
#include <sys/socket.h>
|
|
|
|
#include <sys/socketvar.h>
|
|
|
|
#include <sys/sockio.h>
|
|
|
|
#include <sys/errno.h>
|
|
|
|
#include <sys/time.h>
|
2006-11-06 13:42:10 +00:00
|
|
|
#include <sys/priv.h>
|
1999-11-22 02:45:11 +00:00
|
|
|
#include <sys/proc.h>
|
|
|
|
#include <sys/jail.h>
|
|
|
|
|
2002-03-20 08:03:54 +00:00
|
|
|
#include <vm/uma.h>
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
#include <net/if.h>
|
2013-10-26 17:58:36 +00:00
|
|
|
#include <net/if_var.h>
|
2016-06-02 17:51:29 +00:00
|
|
|
#include <net/if_llatbl.h>
|
1999-11-22 02:45:11 +00:00
|
|
|
#include <net/if_types.h>
|
|
|
|
#include <net/route.h>
|
2020-04-25 09:06:11 +00:00
|
|
|
#include <net/route/nhop.h>
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
#include <netinet/in.h>
|
|
|
|
#include <netinet/in_var.h>
|
|
|
|
#include <netinet/in_systm.h>
|
2002-06-10 20:05:46 +00:00
|
|
|
#include <netinet/tcp_var.h>
|
2000-07-04 16:35:15 +00:00
|
|
|
#include <netinet/ip6.h>
|
2000-01-09 19:17:30 +00:00
|
|
|
#include <netinet/ip_var.h>
|
2008-08-20 01:05:56 +00:00
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
#include <netinet6/ip6_var.h>
|
|
|
|
#include <netinet6/nd6.h>
|
|
|
|
#include <netinet/in_pcb.h>
|
|
|
|
#include <netinet6/in6_pcb.h>
|
2020-04-25 09:06:11 +00:00
|
|
|
#include <netinet6/in6_fib.h>
|
2005-07-25 12:31:43 +00:00
|
|
|
#include <netinet6/scope6_var.h>
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
int
|
2017-05-17 00:34:34 +00:00
|
|
|
in6_pcbbind(struct inpcb *inp, struct sockaddr *nam,
|
2007-07-05 16:23:49 +00:00
|
|
|
struct ucred *cred)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
|
|
|
struct socket *so = inp->inp_socket;
|
|
|
|
struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)NULL;
|
|
|
|
struct inpcbinfo *pcbinfo = inp->inp_pcbinfo;
|
|
|
|
u_short lport = 0;
|
2011-05-23 15:23:18 +00:00
|
|
|
int error, lookupflags = 0;
|
|
|
|
int reuseport = (so->so_options & SO_REUSEPORT);
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2018-06-06 15:45:57 +00:00
|
|
|
/*
|
|
|
|
* XXX: Maybe we could let SO_REUSEPORT_LB set SO_REUSEPORT bit here
|
|
|
|
* so that we don't have to add to the (already messy) code below.
|
|
|
|
*/
|
|
|
|
int reuseport_lb = (so->so_options & SO_REUSEPORT_LB);
|
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK_ASSERT(pcbinfo);
|
2004-07-27 23:44:03 +00:00
|
|
|
|
2018-05-18 20:13:34 +00:00
|
|
|
if (CK_STAILQ_EMPTY(&V_in6_ifaddrhead)) /* XXX broken! */
|
1999-11-22 02:45:11 +00:00
|
|
|
return (EADDRNOTAVAIL);
|
|
|
|
if (inp->inp_lport || !IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr))
|
2003-10-06 14:02:09 +00:00
|
|
|
return (EINVAL);
|
2018-06-06 15:45:57 +00:00
|
|
|
if ((so->so_options & (SO_REUSEADDR|SO_REUSEPORT|SO_REUSEPORT_LB)) == 0)
|
2011-05-23 15:23:18 +00:00
|
|
|
lookupflags = INPLOOKUP_WILDCARD;
|
2017-02-10 05:58:16 +00:00
|
|
|
if (nam == NULL) {
|
2009-02-05 14:25:53 +00:00
|
|
|
if ((error = prison_local_ip6(cred, &inp->in6p_laddr,
|
|
|
|
((inp->inp_flags & IN6P_IPV6_V6ONLY) != 0))) != 0)
|
|
|
|
return (error);
|
|
|
|
} else {
|
1999-11-22 02:45:11 +00:00
|
|
|
sin6 = (struct sockaddr_in6 *)nam;
|
2021-05-03 16:51:04 +00:00
|
|
|
KASSERT(sin6->sin6_family == AF_INET6,
|
|
|
|
("%s: invalid address family for %p", __func__, sin6));
|
|
|
|
KASSERT(sin6->sin6_len == sizeof(*sin6),
|
|
|
|
("%s: invalid address length for %p", __func__, sin6));
|
1999-11-22 02:45:11 +00:00
|
|
|
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if ((error = sa6_embedscope(sin6, V_ip6_use_defzone)) != 0)
|
2005-07-25 12:31:43 +00:00
|
|
|
return(error);
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2009-02-05 14:06:09 +00:00
|
|
|
if ((error = prison_local_ip6(cred, &sin6->sin6_addr,
|
|
|
|
((inp->inp_flags & IN6P_IPV6_V6ONLY) != 0))) != 0)
|
|
|
|
return (error);
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
lport = sin6->sin6_port;
|
|
|
|
if (IN6_IS_ADDR_MULTICAST(&sin6->sin6_addr)) {
|
|
|
|
/*
|
|
|
|
* Treat SO_REUSEADDR as SO_REUSEPORT for multicast;
|
|
|
|
* allow compepte duplication of binding if
|
|
|
|
* SO_REUSEPORT is set, or if SO_REUSEADDR is set
|
|
|
|
* and a multicast address is bound on both
|
|
|
|
* new and duplicated sockets.
|
|
|
|
*/
|
2013-07-12 19:08:33 +00:00
|
|
|
if ((so->so_options & (SO_REUSEADDR|SO_REUSEPORT)) != 0)
|
1999-11-22 02:45:11 +00:00
|
|
|
reuseport = SO_REUSEADDR|SO_REUSEPORT;
|
2018-06-06 15:45:57 +00:00
|
|
|
/*
|
|
|
|
* XXX: How to deal with SO_REUSEPORT_LB here?
|
|
|
|
* Treat same as SO_REUSEPORT for now.
|
|
|
|
*/
|
|
|
|
if ((so->so_options &
|
|
|
|
(SO_REUSEADDR|SO_REUSEPORT_LB)) != 0)
|
|
|
|
reuseport_lb = SO_REUSEADDR|SO_REUSEPORT_LB;
|
1999-11-22 02:45:11 +00:00
|
|
|
} else if (!IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr)) {
|
2019-01-09 01:11:19 +00:00
|
|
|
struct epoch_tracker et;
|
2009-06-23 20:19:09 +00:00
|
|
|
struct ifaddr *ifa;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
sin6->sin6_port = 0; /* yech... */
|
2019-01-09 01:11:19 +00:00
|
|
|
NET_EPOCH_ENTER(et);
|
2009-06-23 20:19:09 +00:00
|
|
|
if ((ifa = ifa_ifwithaddr((struct sockaddr *)sin6)) ==
|
|
|
|
NULL &&
|
2009-06-01 10:30:00 +00:00
|
|
|
(inp->inp_flags & INP_BINDANY) == 0) {
|
2019-01-09 01:11:19 +00:00
|
|
|
NET_EPOCH_EXIT(et);
|
2003-10-06 14:02:09 +00:00
|
|
|
return (EADDRNOTAVAIL);
|
2009-06-01 10:30:00 +00:00
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* XXX: bind to an anycast address might accidentally
|
|
|
|
* cause sending a packet with anycast source address.
|
2001-06-11 12:39:29 +00:00
|
|
|
* We should allow to bind to a deprecated address, since
|
2003-10-08 18:26:08 +00:00
|
|
|
* the application dares to use it.
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
2009-06-23 20:19:09 +00:00
|
|
|
if (ifa != NULL &&
|
|
|
|
((struct in6_ifaddr *)ifa)->ia6_flags &
|
2001-06-11 12:39:29 +00:00
|
|
|
(IN6_IFF_ANYCAST|IN6_IFF_NOTREADY|IN6_IFF_DETACHED)) {
|
2019-01-09 01:11:19 +00:00
|
|
|
NET_EPOCH_EXIT(et);
|
2003-10-06 14:02:09 +00:00
|
|
|
return (EADDRNOTAVAIL);
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
2019-01-09 01:11:19 +00:00
|
|
|
NET_EPOCH_EXIT(et);
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
if (lport) {
|
|
|
|
struct inpcb *t;
|
2011-11-06 09:29:52 +00:00
|
|
|
struct tcptw *tw;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
/* GROSS */
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if (ntohs(lport) <= V_ipport_reservedhigh &&
|
|
|
|
ntohs(lport) >= V_ipport_reservedlow &&
|
2018-12-11 19:32:16 +00:00
|
|
|
priv_check_cred(cred, PRIV_NETINET_RESERVEDPORT))
|
2003-10-06 14:02:09 +00:00
|
|
|
return (EACCES);
|
2006-06-27 11:35:53 +00:00
|
|
|
if (!IN6_IS_ADDR_MULTICAST(&sin6->sin6_addr) &&
|
2018-12-11 19:32:16 +00:00
|
|
|
priv_check_cred(inp->inp_cred, PRIV_NETINET_REUSEPORT) != 0) {
|
2000-01-09 19:17:30 +00:00
|
|
|
t = in6_pcblookup_local(pcbinfo,
|
1999-11-22 02:45:11 +00:00
|
|
|
&sin6->sin6_addr, lport,
|
2008-07-10 13:31:11 +00:00
|
|
|
INPLOOKUP_WILDCARD, cred);
|
2004-07-27 16:35:09 +00:00
|
|
|
if (t &&
|
2014-07-12 05:46:33 +00:00
|
|
|
((inp->inp_flags2 & INP_BINDMULTI) == 0) &&
|
2009-03-15 09:58:31 +00:00
|
|
|
((t->inp_flags & INP_TIMEWAIT) == 0) &&
|
2004-07-27 16:35:09 +00:00
|
|
|
(so->so_type != SOCK_STREAM ||
|
|
|
|
IN6_IS_ADDR_UNSPECIFIED(&t->in6p_faddr)) &&
|
2002-05-31 11:52:35 +00:00
|
|
|
(!IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr) ||
|
2007-07-05 16:29:40 +00:00
|
|
|
!IN6_IS_ADDR_UNSPECIFIED(&t->in6p_laddr) ||
|
2018-06-06 15:45:57 +00:00
|
|
|
(t->inp_flags2 & INP_REUSEPORT) ||
|
|
|
|
(t->inp_flags2 & INP_REUSEPORT_LB) == 0) &&
|
2011-11-06 10:47:20 +00:00
|
|
|
(inp->inp_cred->cr_uid !=
|
2008-10-04 15:06:34 +00:00
|
|
|
t->inp_cred->cr_uid))
|
2002-05-31 11:52:35 +00:00
|
|
|
return (EADDRINUSE);
|
2014-07-12 05:46:33 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the socket is a BINDMULTI socket, then
|
|
|
|
* the credentials need to match and the
|
|
|
|
* original socket also has to have been bound
|
|
|
|
* with BINDMULTI.
|
|
|
|
*/
|
|
|
|
if (t && (! in_pcbbind_check_bindmulti(inp, t)))
|
|
|
|
return (EADDRINUSE);
|
|
|
|
|
2011-04-30 11:04:34 +00:00
|
|
|
#ifdef INET
|
2001-06-11 12:39:29 +00:00
|
|
|
if ((inp->inp_flags & IN6P_IPV6_V6ONLY) == 0 &&
|
2000-01-09 19:17:30 +00:00
|
|
|
IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr)) {
|
|
|
|
struct sockaddr_in sin;
|
|
|
|
|
|
|
|
in6_sin6_2_sin(&sin, sin6);
|
|
|
|
t = in_pcblookup_local(pcbinfo,
|
2008-07-10 13:31:11 +00:00
|
|
|
sin.sin_addr, lport,
|
|
|
|
INPLOOKUP_WILDCARD, cred);
|
2004-07-27 16:35:09 +00:00
|
|
|
if (t &&
|
2014-07-12 05:46:33 +00:00
|
|
|
((inp->inp_flags2 & INP_BINDMULTI) == 0) &&
|
2009-03-15 09:58:31 +00:00
|
|
|
((t->inp_flags &
|
2004-07-27 16:35:09 +00:00
|
|
|
INP_TIMEWAIT) == 0) &&
|
|
|
|
(so->so_type != SOCK_STREAM ||
|
|
|
|
ntohl(t->inp_faddr.s_addr) ==
|
|
|
|
INADDR_ANY) &&
|
2008-10-04 15:06:34 +00:00
|
|
|
(inp->inp_cred->cr_uid !=
|
|
|
|
t->inp_cred->cr_uid))
|
2000-01-09 19:17:30 +00:00
|
|
|
return (EADDRINUSE);
|
2014-07-12 05:46:33 +00:00
|
|
|
|
|
|
|
if (t && (! in_pcbbind_check_bindmulti(inp, t)))
|
|
|
|
return (EADDRINUSE);
|
2000-01-09 19:17:30 +00:00
|
|
|
}
|
2011-04-30 11:04:34 +00:00
|
|
|
#endif
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
t = in6_pcblookup_local(pcbinfo, &sin6->sin6_addr,
|
2011-05-23 15:23:18 +00:00
|
|
|
lport, lookupflags, cred);
|
2011-11-06 09:29:52 +00:00
|
|
|
if (t && (t->inp_flags & INP_TIMEWAIT)) {
|
|
|
|
/*
|
|
|
|
* XXXRW: If an incpb has had its timewait
|
|
|
|
* state recycled, we treat the address as
|
|
|
|
* being in use (for now). This is better
|
|
|
|
* than a panic, but not desirable.
|
|
|
|
*/
|
|
|
|
tw = intotw(t);
|
|
|
|
if (tw == NULL ||
|
2018-06-06 15:45:57 +00:00
|
|
|
((reuseport & tw->tw_so_options) == 0 &&
|
|
|
|
(reuseport_lb & tw->tw_so_options) == 0))
|
2011-11-06 09:29:52 +00:00
|
|
|
return (EADDRINUSE);
|
2018-06-06 15:45:57 +00:00
|
|
|
} else if (t && (reuseport & inp_so_options(t)) == 0 &&
|
|
|
|
(reuseport_lb & inp_so_options(t)) == 0) {
|
2003-10-06 14:02:09 +00:00
|
|
|
return (EADDRINUSE);
|
2011-11-06 09:29:52 +00:00
|
|
|
}
|
2011-04-30 11:04:34 +00:00
|
|
|
#ifdef INET
|
2001-06-11 12:39:29 +00:00
|
|
|
if ((inp->inp_flags & IN6P_IPV6_V6ONLY) == 0 &&
|
2018-04-24 19:55:12 +00:00
|
|
|
IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr)) {
|
2000-01-09 19:17:30 +00:00
|
|
|
struct sockaddr_in sin;
|
|
|
|
|
|
|
|
in6_sin6_2_sin(&sin, sin6);
|
|
|
|
t = in_pcblookup_local(pcbinfo, sin.sin_addr,
|
2018-06-06 15:45:57 +00:00
|
|
|
lport, lookupflags, cred);
|
2009-03-15 09:58:31 +00:00
|
|
|
if (t && t->inp_flags & INP_TIMEWAIT) {
|
2011-11-06 09:29:52 +00:00
|
|
|
tw = intotw(t);
|
|
|
|
if (tw == NULL)
|
|
|
|
return (EADDRINUSE);
|
|
|
|
if ((reuseport & tw->tw_so_options) == 0
|
2018-06-06 15:45:57 +00:00
|
|
|
&& (reuseport_lb & tw->tw_so_options) == 0
|
2018-04-24 19:55:12 +00:00
|
|
|
&& (ntohl(t->inp_laddr.s_addr) !=
|
2018-06-06 15:45:57 +00:00
|
|
|
INADDR_ANY || ((inp->inp_vflag &
|
|
|
|
INP_IPV6PROTO) ==
|
|
|
|
(t->inp_vflag & INP_IPV6PROTO))))
|
2003-06-17 00:31:30 +00:00
|
|
|
return (EADDRINUSE);
|
2013-07-04 18:38:00 +00:00
|
|
|
} else if (t &&
|
2018-04-24 19:55:12 +00:00
|
|
|
(reuseport & inp_so_options(t)) == 0 &&
|
2018-06-06 15:45:57 +00:00
|
|
|
(reuseport_lb & inp_so_options(t)) == 0 &&
|
2018-04-24 19:55:12 +00:00
|
|
|
(ntohl(t->inp_laddr.s_addr) != INADDR_ANY ||
|
2018-06-06 15:45:57 +00:00
|
|
|
(t->inp_vflag & INP_IPV6PROTO) != 0)) {
|
2002-05-31 11:52:35 +00:00
|
|
|
return (EADDRINUSE);
|
2018-06-06 15:45:57 +00:00
|
|
|
}
|
2000-01-09 19:17:30 +00:00
|
|
|
}
|
2011-04-30 11:04:34 +00:00
|
|
|
#endif
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
inp->in6p_laddr = sin6->sin6_addr;
|
|
|
|
}
|
|
|
|
if (lport == 0) {
|
2017-02-10 05:58:16 +00:00
|
|
|
if ((error = in6_pcbsetport(&inp->in6p_laddr, inp, cred)) != 0) {
|
2011-03-12 16:45:15 +00:00
|
|
|
/* Undo an address bind that may have occurred. */
|
|
|
|
inp->in6p_laddr = in6addr_any;
|
2009-02-05 14:06:09 +00:00
|
|
|
return (error);
|
2011-03-12 16:45:15 +00:00
|
|
|
}
|
2008-10-04 17:07:58 +00:00
|
|
|
} else {
|
2000-07-04 16:35:15 +00:00
|
|
|
inp->inp_lport = lport;
|
|
|
|
if (in_pcbinshash(inp) != 0) {
|
|
|
|
inp->in6p_laddr = in6addr_any;
|
|
|
|
inp->inp_lport = 0;
|
|
|
|
return (EAGAIN);
|
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
2003-10-06 14:02:09 +00:00
|
|
|
return (0);
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Transform old in6_pcbconnect() into an inner subroutine for new
|
|
|
|
* in6_pcbconnect(): Do some validity-checking on the remote
|
|
|
|
* address (in mbuf 'nam') and then determine local host address
|
|
|
|
* (i.e., which interface) to use to access that remote host.
|
|
|
|
*
|
|
|
|
* This preserves definition of in6_pcbconnect(), while supporting a
|
|
|
|
* slightly different version for T/TCP. (This is more than
|
|
|
|
* a bit of a kludge, but cleaning up the internal interfaces would
|
|
|
|
* have forced minor changes in every protocol).
|
|
|
|
*/
|
2014-09-10 13:17:35 +00:00
|
|
|
static int
|
2021-05-03 16:51:04 +00:00
|
|
|
in6_pcbladdr(struct inpcb *inp, struct sockaddr_in6 *sin6,
|
2009-06-23 22:08:55 +00:00
|
|
|
struct in6_addr *plocal_addr6)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
|
|
|
int error = 0;
|
2005-07-25 12:31:43 +00:00
|
|
|
int scope_ambiguous = 0;
|
2009-06-23 22:08:55 +00:00
|
|
|
struct in6_addr in6a;
|
2021-02-13 14:32:10 +00:00
|
|
|
struct epoch_tracker et;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo); /* XXXRW: why? */
|
2006-04-25 12:09:58 +00:00
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
if (sin6->sin6_port == 0)
|
|
|
|
return (EADDRNOTAVAIL);
|
|
|
|
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if (sin6->sin6_scope_id == 0 && !V_ip6_use_defzone)
|
2005-07-25 12:31:43 +00:00
|
|
|
scope_ambiguous = 1;
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if ((error = sa6_embedscope(sin6, V_ip6_use_defzone)) != 0)
|
2005-07-25 12:31:43 +00:00
|
|
|
return(error);
|
|
|
|
|
2018-05-18 20:13:34 +00:00
|
|
|
if (!CK_STAILQ_EMPTY(&V_in6_ifaddrhead)) {
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
|
|
|
* If the destination address is UNSPECIFIED addr,
|
|
|
|
* use the loopback addr, e.g ::1.
|
|
|
|
*/
|
|
|
|
if (IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr))
|
|
|
|
sin6->sin6_addr = in6addr_loopback;
|
|
|
|
}
|
2009-02-05 14:06:09 +00:00
|
|
|
if ((error = prison_remote_ip6(inp->inp_cred, &sin6->sin6_addr)) != 0)
|
|
|
|
return (error);
|
2005-07-25 12:31:43 +00:00
|
|
|
|
2021-02-13 14:32:10 +00:00
|
|
|
NET_EPOCH_ENTER(et);
|
2016-01-10 13:40:29 +00:00
|
|
|
error = in6_selectsrc_socket(sin6, inp->in6p_outputopts,
|
|
|
|
inp, inp->inp_cred, scope_ambiguous, &in6a, NULL);
|
2021-02-13 14:32:10 +00:00
|
|
|
NET_EPOCH_EXIT(et);
|
2009-06-23 22:08:55 +00:00
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Do not update this earlier, in case we return with an error.
|
|
|
|
*
|
2016-01-10 13:40:29 +00:00
|
|
|
* XXX: this in6_selectsrc_socket result might replace the bound local
|
2010-01-24 10:22:39 +00:00
|
|
|
* address with the address specified by setsockopt(IPV6_PKTINFO).
|
2009-06-23 22:08:55 +00:00
|
|
|
* Is it the intended behavior?
|
|
|
|
*/
|
|
|
|
*plocal_addr6 = in6a;
|
|
|
|
|
2005-07-25 12:31:43 +00:00
|
|
|
/*
|
|
|
|
* Don't do pcblookup call here; return interface in
|
|
|
|
* plocal_addr6
|
|
|
|
* and exit to caller, that will do the lookup.
|
|
|
|
*/
|
|
|
|
|
2003-10-06 14:02:09 +00:00
|
|
|
return (0);
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Outer subroutine:
|
|
|
|
* Connect from a socket to a specified address.
|
|
|
|
* Both address and port must be specified in argument sin.
|
|
|
|
* If don't have a local address for this socket yet,
|
|
|
|
* then pick one.
|
|
|
|
*/
|
|
|
|
int
|
2017-05-17 00:34:34 +00:00
|
|
|
in6_pcbconnect_mbuf(struct inpcb *inp, struct sockaddr *nam,
|
2020-01-12 17:52:32 +00:00
|
|
|
struct ucred *cred, struct mbuf *m, bool rehash)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
struct inpcbinfo *pcbinfo = inp->inp_pcbinfo;
|
2017-05-17 00:34:34 +00:00
|
|
|
struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)nam;
|
2020-05-18 22:53:12 +00:00
|
|
|
struct sockaddr_in6 laddr6;
|
1999-11-22 02:45:11 +00:00
|
|
|
int error;
|
|
|
|
|
2021-05-03 16:51:04 +00:00
|
|
|
KASSERT(sin6->sin6_family == AF_INET6,
|
|
|
|
("%s: invalid address family for %p", __func__, sin6));
|
|
|
|
KASSERT(sin6->sin6_len == sizeof(*sin6),
|
|
|
|
("%s: invalid address length for %p", __func__, sin6));
|
|
|
|
|
2020-05-18 22:53:12 +00:00
|
|
|
bzero(&laddr6, sizeof(laddr6));
|
|
|
|
laddr6.sin6_family = AF_INET6;
|
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK_ASSERT(pcbinfo);
|
2004-07-27 23:44:03 +00:00
|
|
|
|
2020-10-18 17:15:47 +00:00
|
|
|
#ifdef ROUTE_MPATH
|
|
|
|
if (CALC_FLOWID_OUTBOUND) {
|
|
|
|
uint32_t hash_type, hash_val;
|
|
|
|
|
|
|
|
hash_val = fib6_calc_software_hash(&inp->in6p_laddr,
|
|
|
|
&sin6->sin6_addr, 0, sin6->sin6_port,
|
|
|
|
inp->inp_socket->so_proto->pr_protocol, &hash_type);
|
|
|
|
inp->inp_flowid = hash_val;
|
|
|
|
inp->inp_flowtype = hash_type;
|
|
|
|
}
|
|
|
|
#endif
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
2002-04-19 04:46:24 +00:00
|
|
|
* Call inner routine, to assign local interface address.
|
|
|
|
* in6_pcbladdr() may automatically fill in sin6_scope_id.
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
2021-05-03 16:51:04 +00:00
|
|
|
if ((error = in6_pcbladdr(inp, sin6, &laddr6.sin6_addr)) != 0)
|
2003-10-06 14:02:09 +00:00
|
|
|
return (error);
|
1999-11-22 02:45:11 +00:00
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
if (in6_pcblookup_hash_locked(pcbinfo, &sin6->sin6_addr,
|
1999-11-22 02:45:11 +00:00
|
|
|
sin6->sin6_port,
|
|
|
|
IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr)
|
2020-05-18 22:53:12 +00:00
|
|
|
? &laddr6.sin6_addr : &inp->in6p_laddr,
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
|
|
|
inp->inp_lport, 0, NULL, M_NODOM) != NULL) {
|
1999-11-22 02:45:11 +00:00
|
|
|
return (EADDRINUSE);
|
|
|
|
}
|
|
|
|
if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr)) {
|
|
|
|
if (inp->inp_lport == 0) {
|
2020-05-18 22:53:12 +00:00
|
|
|
/*
|
|
|
|
* rehash was required to be true in the past for
|
|
|
|
* this case; retain that convention. However,
|
|
|
|
* we now call in_pcb_lport_dest rather than
|
|
|
|
* in6_pcbbind; the former does not insert into
|
|
|
|
* the hash table, the latter does. Change rehash
|
|
|
|
* to false to do the in_pcbinshash below.
|
|
|
|
*/
|
2020-01-12 17:52:32 +00:00
|
|
|
KASSERT(rehash == true,
|
|
|
|
("Rehashing required for unbound inps"));
|
2020-05-18 22:53:12 +00:00
|
|
|
rehash = false;
|
|
|
|
error = in_pcb_lport_dest(inp,
|
|
|
|
(struct sockaddr *) &laddr6, &inp->inp_lport,
|
2020-11-14 14:50:34 +00:00
|
|
|
(struct sockaddr *) sin6, sin6->sin6_port, cred,
|
|
|
|
INPLOOKUP_WILDCARD);
|
2017-02-10 05:58:16 +00:00
|
|
|
if (error)
|
1999-11-22 02:45:11 +00:00
|
|
|
return (error);
|
|
|
|
}
|
2020-05-18 22:53:12 +00:00
|
|
|
inp->in6p_laddr = laddr6.sin6_addr;
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
inp->in6p_faddr = sin6->sin6_addr;
|
|
|
|
inp->inp_fport = sin6->sin6_port;
|
2001-06-11 12:39:29 +00:00
|
|
|
/* update flowinfo - draft-itojun-ipv6-flowlabel-api-00 */
|
Another step assimilating IPv[46] PCB code - directly use
the inpcb names rather than the following IPv6 compat macros:
in6pcb,in6p_sp, in6p_ip6_nxt,in6p_flowinfo,in6p_vflag,
in6p_flags,in6p_socket,in6p_lport,in6p_fport,in6p_ppcb and
sotoin6pcb().
Apart from removing duplicate code in netipsec, this is a pure
whitespace, not a functional change.
Discussed with: rwatson
Reviewed by: rwatson (version before review requested changes)
MFC after: 4 weeks (set the timer and see then)
2008-12-15 21:50:54 +00:00
|
|
|
inp->inp_flow &= ~IPV6_FLOWLABEL_MASK;
|
|
|
|
if (inp->inp_flags & IN6P_AUTOFLOWLABEL)
|
|
|
|
inp->inp_flow |=
|
2003-10-01 21:24:28 +00:00
|
|
|
(htonl(ip6_randomflowlabel()) & IPV6_FLOWLABEL_MASK);
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2020-01-12 17:52:32 +00:00
|
|
|
if (rehash) {
|
|
|
|
in_pcbrehash_mbuf(inp, m);
|
|
|
|
} else {
|
|
|
|
in_pcbinshash_mbuf(inp, m);
|
|
|
|
}
|
2007-07-01 11:41:27 +00:00
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
int
|
|
|
|
in6_pcbconnect(struct inpcb *inp, struct sockaddr *nam, struct ucred *cred)
|
|
|
|
{
|
|
|
|
|
2020-01-12 17:52:32 +00:00
|
|
|
return (in6_pcbconnect_mbuf(inp, nam, cred, NULL, true));
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
}
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
void
|
2007-07-05 16:23:49 +00:00
|
|
|
in6_pcbdisconnect(struct inpcb *inp)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
2004-07-27 23:44:03 +00:00
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
|
2004-07-27 23:44:03 +00:00
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
bzero((caddr_t)&inp->in6p_faddr, sizeof(inp->in6p_faddr));
|
|
|
|
inp->inp_fport = 0;
|
2001-06-11 12:39:29 +00:00
|
|
|
/* clear flowinfo - draft-itojun-ipv6-flowlabel-api-00 */
|
Another step assimilating IPv[46] PCB code - directly use
the inpcb names rather than the following IPv6 compat macros:
in6pcb,in6p_sp, in6p_ip6_nxt,in6p_flowinfo,in6p_vflag,
in6p_flags,in6p_socket,in6p_lport,in6p_fport,in6p_ppcb and
sotoin6pcb().
Apart from removing duplicate code in netipsec, this is a pure
whitespace, not a functional change.
Discussed with: rwatson
Reviewed by: rwatson (version before review requested changes)
MFC after: 4 weeks (set the timer and see then)
2008-12-15 21:50:54 +00:00
|
|
|
inp->inp_flow &= ~IPV6_FLOWLABEL_MASK;
|
1999-11-22 02:45:11 +00:00
|
|
|
in_pcbrehash(inp);
|
|
|
|
}
|
|
|
|
|
2002-08-21 11:57:12 +00:00
|
|
|
struct sockaddr *
|
2007-07-05 16:23:49 +00:00
|
|
|
in6_sockaddr(in_port_t port, struct in6_addr *addr_p)
|
2002-08-21 11:57:12 +00:00
|
|
|
{
|
|
|
|
struct sockaddr_in6 *sin6;
|
|
|
|
|
2008-10-23 15:53:51 +00:00
|
|
|
sin6 = malloc(sizeof *sin6, M_SONAME, M_WAITOK);
|
2002-08-21 11:57:12 +00:00
|
|
|
bzero(sin6, sizeof *sin6);
|
|
|
|
sin6->sin6_family = AF_INET6;
|
|
|
|
sin6->sin6_len = sizeof(*sin6);
|
|
|
|
sin6->sin6_port = port;
|
|
|
|
sin6->sin6_addr = *addr_p;
|
2005-07-25 12:31:43 +00:00
|
|
|
(void)sa6_recoverscope(sin6); /* XXX: should catch errors */
|
2002-08-21 11:57:12 +00:00
|
|
|
|
|
|
|
return (struct sockaddr *)sin6;
|
|
|
|
}
|
|
|
|
|
|
|
|
struct sockaddr *
|
2007-07-05 16:23:49 +00:00
|
|
|
in6_v4mapsin6_sockaddr(in_port_t port, struct in_addr *addr_p)
|
2002-08-21 11:57:12 +00:00
|
|
|
{
|
|
|
|
struct sockaddr_in sin;
|
|
|
|
struct sockaddr_in6 *sin6_p;
|
|
|
|
|
|
|
|
bzero(&sin, sizeof sin);
|
|
|
|
sin.sin_family = AF_INET;
|
|
|
|
sin.sin_len = sizeof(sin);
|
|
|
|
sin.sin_port = port;
|
|
|
|
sin.sin_addr = *addr_p;
|
|
|
|
|
2008-10-23 15:53:51 +00:00
|
|
|
sin6_p = malloc(sizeof *sin6_p, M_SONAME,
|
2003-02-19 05:47:46 +00:00
|
|
|
M_WAITOK);
|
2002-08-21 11:57:12 +00:00
|
|
|
in6_sin_2_v4mapsin6(&sin, sin6_p);
|
|
|
|
|
|
|
|
return (struct sockaddr *)sin6_p;
|
|
|
|
}
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
int
|
2007-07-05 16:23:49 +00:00
|
|
|
in6_getsockaddr(struct socket *so, struct sockaddr **nam)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
2017-05-17 00:34:34 +00:00
|
|
|
struct inpcb *inp;
|
2002-08-21 11:57:12 +00:00
|
|
|
struct in6_addr addr;
|
|
|
|
in_port_t port;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
inp = sotoinpcb(so);
|
2007-05-11 10:20:51 +00:00
|
|
|
KASSERT(inp != NULL, ("in6_getsockaddr: inp == NULL"));
|
2006-04-12 02:52:14 +00:00
|
|
|
|
2008-04-19 14:36:19 +00:00
|
|
|
INP_RLOCK(inp);
|
2002-08-21 11:57:12 +00:00
|
|
|
port = inp->inp_lport;
|
|
|
|
addr = inp->in6p_laddr;
|
2008-04-19 14:36:19 +00:00
|
|
|
INP_RUNLOCK(inp);
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2002-08-21 11:57:12 +00:00
|
|
|
*nam = in6_sockaddr(port, &addr);
|
1999-11-22 02:45:11 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
2007-07-05 16:23:49 +00:00
|
|
|
in6_getpeeraddr(struct socket *so, struct sockaddr **nam)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
2002-08-21 11:57:12 +00:00
|
|
|
struct in6_addr addr;
|
|
|
|
in_port_t port;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
inp = sotoinpcb(so);
|
2007-05-11 10:20:51 +00:00
|
|
|
KASSERT(inp != NULL, ("in6_getpeeraddr: inp == NULL"));
|
2006-04-12 02:52:14 +00:00
|
|
|
|
2008-04-19 14:36:19 +00:00
|
|
|
INP_RLOCK(inp);
|
2002-08-21 11:57:12 +00:00
|
|
|
port = inp->inp_fport;
|
|
|
|
addr = inp->in6p_faddr;
|
2008-04-19 14:36:19 +00:00
|
|
|
INP_RUNLOCK(inp);
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2002-08-21 11:57:12 +00:00
|
|
|
*nam = in6_sockaddr(port, &addr);
|
1999-11-22 02:45:11 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
in6_mapped_sockaddr(struct socket *so, struct sockaddr **nam)
|
|
|
|
{
|
2006-04-12 02:52:14 +00:00
|
|
|
struct inpcb *inp;
|
1999-11-22 02:45:11 +00:00
|
|
|
int error;
|
|
|
|
|
2006-04-12 02:52:14 +00:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("in6_mapped_sockaddr: inp == NULL"));
|
|
|
|
|
2011-04-30 11:04:34 +00:00
|
|
|
#ifdef INET
|
2004-01-10 08:11:51 +00:00
|
|
|
if ((inp->inp_vflag & (INP_IPV4 | INP_IPV6)) == INP_IPV4) {
|
2007-05-11 10:20:51 +00:00
|
|
|
error = in_getsockaddr(so, nam);
|
1999-12-21 11:14:12 +00:00
|
|
|
if (error == 0)
|
1999-11-22 02:45:11 +00:00
|
|
|
in6_sin_2_v4mapsin6_in_sock(nam);
|
2011-04-30 11:04:34 +00:00
|
|
|
} else
|
|
|
|
#endif
|
|
|
|
{
|
2007-05-11 10:20:51 +00:00
|
|
|
/* scope issues will be handled in in6_getsockaddr(). */
|
|
|
|
error = in6_getsockaddr(so, nam);
|
2003-10-08 18:26:08 +00:00
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
in6_mapped_peeraddr(struct socket *so, struct sockaddr **nam)
|
|
|
|
{
|
2006-04-12 02:52:14 +00:00
|
|
|
struct inpcb *inp;
|
1999-11-22 02:45:11 +00:00
|
|
|
int error;
|
|
|
|
|
2006-04-12 02:52:14 +00:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("in6_mapped_peeraddr: inp == NULL"));
|
|
|
|
|
2011-04-30 11:04:34 +00:00
|
|
|
#ifdef INET
|
2004-01-10 08:11:51 +00:00
|
|
|
if ((inp->inp_vflag & (INP_IPV4 | INP_IPV6)) == INP_IPV4) {
|
2007-05-11 10:20:51 +00:00
|
|
|
error = in_getpeeraddr(so, nam);
|
1999-12-21 11:14:12 +00:00
|
|
|
if (error == 0)
|
1999-11-22 02:45:11 +00:00
|
|
|
in6_sin_2_v4mapsin6_in_sock(nam);
|
|
|
|
} else
|
2011-04-30 11:04:34 +00:00
|
|
|
#endif
|
2007-05-11 10:20:51 +00:00
|
|
|
/* scope issues will be handled in in6_getpeeraddr(). */
|
|
|
|
error = in6_getpeeraddr(so, nam);
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
return error;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Pass some notification to all connections of a protocol
|
|
|
|
* associated with address dst. The local address and/or port numbers
|
|
|
|
* may be specified to limit the search. The "usual action" will be
|
|
|
|
* taken, depending on the ctlinput cmd. The caller must filter any
|
|
|
|
* cmds that are uninteresting (e.g., no error in the map).
|
|
|
|
* Call the protocol specific routine (if any) to report
|
|
|
|
* any errors for each matching socket.
|
|
|
|
*/
|
|
|
|
void
|
2007-07-05 16:23:49 +00:00
|
|
|
in6_pcbnotify(struct inpcbinfo *pcbinfo, struct sockaddr *dst,
|
|
|
|
u_int fport_arg, const struct sockaddr *src, u_int lport_arg,
|
|
|
|
int cmd, void *cmdarg,
|
2008-01-08 19:08:58 +00:00
|
|
|
struct inpcb *(*notify)(struct inpcb *, int))
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
2008-04-06 21:20:56 +00:00
|
|
|
struct inpcb *inp, *inp_temp;
|
2001-06-11 12:39:29 +00:00
|
|
|
struct sockaddr_in6 sa6_src, *sa6_dst;
|
1999-11-22 02:45:11 +00:00
|
|
|
u_short fport = fport_arg, lport = lport_arg;
|
2001-06-11 12:39:29 +00:00
|
|
|
u_int32_t flowinfo;
|
2006-04-12 02:52:14 +00:00
|
|
|
int errno;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2003-09-11 21:40:21 +00:00
|
|
|
if ((unsigned)cmd >= PRC_NCMDS || dst->sa_family != AF_INET6)
|
1999-11-22 02:45:11 +00:00
|
|
|
return;
|
2001-06-11 12:39:29 +00:00
|
|
|
|
|
|
|
sa6_dst = (struct sockaddr_in6 *)dst;
|
|
|
|
if (IN6_IS_ADDR_UNSPECIFIED(&sa6_dst->sin6_addr))
|
1999-11-22 02:45:11 +00:00
|
|
|
return;
|
|
|
|
|
2001-06-11 12:39:29 +00:00
|
|
|
/*
|
|
|
|
* note that src can be NULL when we get notify by local fragmentation.
|
|
|
|
*/
|
2002-02-27 02:44:45 +00:00
|
|
|
sa6_src = (src == NULL) ? sa6_any : *(const struct sockaddr_in6 *)src;
|
2001-06-11 12:39:29 +00:00
|
|
|
flowinfo = sa6_src.sin6_flowinfo;
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
|
|
|
* Redirects go to all references to the destination,
|
2000-07-04 16:35:15 +00:00
|
|
|
* and use in6_rtchange to invalidate the route cache.
|
|
|
|
* Dead host indications: also use in6_rtchange to invalidate
|
|
|
|
* the cache, and deliver the error to all the sockets.
|
1999-11-22 02:45:11 +00:00
|
|
|
* Otherwise, if we have knowledge of the local port and address,
|
|
|
|
* deliver only to that socket.
|
|
|
|
*/
|
|
|
|
if (PRC_IS_REDIRECT(cmd) || cmd == PRC_HOSTDEAD) {
|
|
|
|
fport = 0;
|
|
|
|
lport = 0;
|
2001-06-11 12:39:29 +00:00
|
|
|
bzero((caddr_t)&sa6_src.sin6_addr, sizeof(sa6_src.sin6_addr));
|
2000-07-04 16:35:15 +00:00
|
|
|
|
2001-06-11 12:39:29 +00:00
|
|
|
if (cmd != PRC_HOSTDEAD)
|
|
|
|
notify = in6_rtchange;
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
errno = inet6ctlerrmap[cmd];
|
2004-08-06 03:45:45 +00:00
|
|
|
INP_INFO_WLOCK(pcbinfo);
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_FOREACH_SAFE(inp, pcbinfo->ipi_listhead, inp_list, inp_temp) {
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2007-07-05 16:29:40 +00:00
|
|
|
if ((inp->inp_vflag & INP_IPV6) == 0) {
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
1999-11-22 02:45:11 +00:00
|
|
|
continue;
|
2004-08-06 03:45:45 +00:00
|
|
|
}
|
2000-07-04 16:35:15 +00:00
|
|
|
|
2004-02-13 14:50:01 +00:00
|
|
|
/*
|
|
|
|
* If the error designates a new path MTU for a destination
|
|
|
|
* and the application (associated with this socket) wanted to
|
2015-03-04 11:20:01 +00:00
|
|
|
* know the value, notify.
|
2004-02-13 14:50:01 +00:00
|
|
|
* XXX: should we avoid to notify the value to TCP sockets?
|
|
|
|
*/
|
2015-03-06 05:50:39 +00:00
|
|
|
if (cmd == PRC_MSGSIZE && cmdarg != NULL)
|
2004-02-13 14:50:01 +00:00
|
|
|
ip6_notify_pmtu(inp, (struct sockaddr_in6 *)dst,
|
2015-03-04 11:20:01 +00:00
|
|
|
*(u_int32_t *)cmdarg);
|
2004-02-13 14:50:01 +00:00
|
|
|
|
2001-06-11 12:39:29 +00:00
|
|
|
/*
|
|
|
|
* Detect if we should notify the error. If no source and
|
|
|
|
* destination ports are specifed, but non-zero flowinfo and
|
|
|
|
* local address match, notify the error. This is the case
|
|
|
|
* when the error is delivered with an encrypted buffer
|
|
|
|
* by ESP. Otherwise, just compare addresses and ports
|
|
|
|
* as usual.
|
|
|
|
*/
|
|
|
|
if (lport == 0 && fport == 0 && flowinfo &&
|
|
|
|
inp->inp_socket != NULL &&
|
Another step assimilating IPv[46] PCB code - directly use
the inpcb names rather than the following IPv6 compat macros:
in6pcb,in6p_sp, in6p_ip6_nxt,in6p_flowinfo,in6p_vflag,
in6p_flags,in6p_socket,in6p_lport,in6p_fport,in6p_ppcb and
sotoin6pcb().
Apart from removing duplicate code in netipsec, this is a pure
whitespace, not a functional change.
Discussed with: rwatson
Reviewed by: rwatson (version before review requested changes)
MFC after: 4 weeks (set the timer and see then)
2008-12-15 21:50:54 +00:00
|
|
|
flowinfo == (inp->inp_flow & IPV6_FLOWLABEL_MASK) &&
|
2001-06-11 12:39:29 +00:00
|
|
|
IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, &sa6_src.sin6_addr))
|
|
|
|
goto do_notify;
|
|
|
|
else if (!IN6_ARE_ADDR_EQUAL(&inp->in6p_faddr,
|
|
|
|
&sa6_dst->sin6_addr) ||
|
|
|
|
inp->inp_socket == 0 ||
|
|
|
|
(lport && inp->inp_lport != lport) ||
|
|
|
|
(!IN6_IS_ADDR_UNSPECIFIED(&sa6_src.sin6_addr) &&
|
|
|
|
!IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr,
|
|
|
|
&sa6_src.sin6_addr)) ||
|
2004-08-06 03:45:45 +00:00
|
|
|
(fport && inp->inp_fport != fport)) {
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
1999-11-22 02:45:11 +00:00
|
|
|
continue;
|
2004-08-06 03:45:45 +00:00
|
|
|
}
|
2000-07-04 16:35:15 +00:00
|
|
|
|
2001-06-11 12:39:29 +00:00
|
|
|
do_notify:
|
2004-08-21 17:38:48 +00:00
|
|
|
if (notify) {
|
|
|
|
if ((*notify)(inp, errno))
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2004-08-21 17:38:48 +00:00
|
|
|
} else
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
2004-08-06 03:45:45 +00:00
|
|
|
INP_INFO_WUNLOCK(pcbinfo);
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
* Lookup a PCB based on the local address and port. Caller must hold the
|
|
|
|
* hash lock. No inpcb locks or references are acquired.
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
|
|
|
struct inpcb *
|
2007-07-05 16:23:49 +00:00
|
|
|
in6_pcblookup_local(struct inpcbinfo *pcbinfo, struct in6_addr *laddr,
|
2011-05-23 15:23:18 +00:00
|
|
|
u_short lport, int lookupflags, struct ucred *cred)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
2017-05-17 00:34:34 +00:00
|
|
|
struct inpcb *inp;
|
1999-11-22 02:45:11 +00:00
|
|
|
int matchwild = 3, wildcard;
|
|
|
|
|
2011-05-23 15:23:18 +00:00
|
|
|
KASSERT((lookupflags & ~(INPLOOKUP_WILDCARD)) == 0,
|
|
|
|
("%s: invalid lookup flags %d", __func__, lookupflags));
|
|
|
|
|
2019-11-11 06:28:25 +00:00
|
|
|
INP_HASH_LOCK_ASSERT(pcbinfo);
|
2006-04-25 12:09:58 +00:00
|
|
|
|
2011-05-23 15:23:18 +00:00
|
|
|
if ((lookupflags & INPLOOKUP_WILDCARD) == 0) {
|
1999-11-22 02:45:11 +00:00
|
|
|
struct inpcbhead *head;
|
|
|
|
/*
|
|
|
|
* Look for an unconnected (wildcard foreign addr) PCB that
|
|
|
|
* matches the local address and port we're looking for.
|
|
|
|
*/
|
2014-09-10 12:35:42 +00:00
|
|
|
head = &pcbinfo->ipi_hashbase[INP_PCBHASH(
|
|
|
|
INP6_PCBHASHKEY(&in6addr_any), lport, 0,
|
|
|
|
pcbinfo->ipi_hashmask)];
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_FOREACH(inp, head, inp_hash) {
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
/* XXX inp locking */
|
1999-12-21 11:14:12 +00:00
|
|
|
if ((inp->inp_vflag & INP_IPV6) == 0)
|
1999-11-22 02:45:11 +00:00
|
|
|
continue;
|
|
|
|
if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr) &&
|
|
|
|
IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, laddr) &&
|
|
|
|
inp->inp_lport == lport) {
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
/* Found. */
|
|
|
|
if (cred == NULL ||
|
2009-05-27 14:11:23 +00:00
|
|
|
prison_equal_ip6(cred->cr_prison,
|
|
|
|
inp->inp_cred->cr_prison))
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
return (inp);
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Not found.
|
|
|
|
*/
|
|
|
|
return (NULL);
|
|
|
|
} else {
|
|
|
|
struct inpcbporthead *porthash;
|
|
|
|
struct inpcbport *phd;
|
|
|
|
struct inpcb *match = NULL;
|
|
|
|
/*
|
|
|
|
* Best fit PCB lookup.
|
|
|
|
*
|
|
|
|
* First see if this local port is in use by looking on the
|
|
|
|
* port hash list.
|
|
|
|
*/
|
2007-04-30 23:12:05 +00:00
|
|
|
porthash = &pcbinfo->ipi_porthashbase[INP_PCBPORTHASH(lport,
|
|
|
|
pcbinfo->ipi_porthashmask)];
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_FOREACH(phd, porthash, phd_hash) {
|
1999-11-22 02:45:11 +00:00
|
|
|
if (phd->phd_port == lport)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (phd != NULL) {
|
|
|
|
/*
|
|
|
|
* Port is in use by one or more PCBs. Look for best
|
|
|
|
* fit.
|
|
|
|
*/
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_FOREACH(inp, &phd->phd_pcblist, inp_portlist) {
|
1999-11-22 02:45:11 +00:00
|
|
|
wildcard = 0;
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
if (cred != NULL &&
|
2009-05-27 14:11:23 +00:00
|
|
|
!prison_equal_ip6(cred->cr_prison,
|
|
|
|
inp->inp_cred->cr_prison))
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
continue;
|
|
|
|
/* XXX inp locking */
|
1999-12-21 11:14:12 +00:00
|
|
|
if ((inp->inp_vflag & INP_IPV6) == 0)
|
1999-11-22 02:45:11 +00:00
|
|
|
continue;
|
|
|
|
if (!IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr))
|
|
|
|
wildcard++;
|
|
|
|
if (!IN6_IS_ADDR_UNSPECIFIED(
|
|
|
|
&inp->in6p_laddr)) {
|
|
|
|
if (IN6_IS_ADDR_UNSPECIFIED(laddr))
|
|
|
|
wildcard++;
|
|
|
|
else if (!IN6_ARE_ADDR_EQUAL(
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
&inp->in6p_laddr, laddr))
|
1999-11-22 02:45:11 +00:00
|
|
|
continue;
|
|
|
|
} else {
|
|
|
|
if (!IN6_IS_ADDR_UNSPECIFIED(laddr))
|
|
|
|
wildcard++;
|
|
|
|
}
|
|
|
|
if (wildcard < matchwild) {
|
|
|
|
match = inp;
|
|
|
|
matchwild = wildcard;
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
if (matchwild == 0)
|
1999-11-22 02:45:11 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return (match);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2001-08-04 17:10:14 +00:00
|
|
|
void
|
2007-07-05 16:23:49 +00:00
|
|
|
in6_pcbpurgeif0(struct inpcbinfo *pcbinfo, struct ifnet *ifp)
|
2001-08-04 17:10:14 +00:00
|
|
|
{
|
2019-08-02 07:41:36 +00:00
|
|
|
struct inpcb *inp;
|
2019-06-25 11:54:41 +00:00
|
|
|
struct in6_multi *inm;
|
|
|
|
struct in6_mfilter *imf;
|
2001-08-04 17:10:14 +00:00
|
|
|
struct ip6_moptions *im6o;
|
|
|
|
|
2015-08-03 12:13:54 +00:00
|
|
|
INP_INFO_WLOCK(pcbinfo);
|
2019-08-02 07:41:36 +00:00
|
|
|
CK_LIST_FOREACH(inp, pcbinfo->ipi_listhead, inp_list) {
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
if (__predict_false(inp->inp_flags2 & INP_FREED)) {
|
|
|
|
INP_WUNLOCK(inp);
|
2018-09-27 15:32:37 +00:00
|
|
|
continue;
|
|
|
|
}
|
2019-08-02 07:41:36 +00:00
|
|
|
im6o = inp->in6p_moptions;
|
|
|
|
if ((inp->inp_vflag & INP_IPV6) && im6o != NULL) {
|
2001-08-04 17:10:14 +00:00
|
|
|
/*
|
Bite the bullet, and make the IPv6 SSM and MLDv2 mega-commit:
import from p4 bms_netdev. Summary of changes:
* Connect netinet6/in6_mcast.c to build.
The legacy KAME KPIs are mostly preserved.
* Eliminate now dead code from ip6_output.c.
Don't do mbuf bingo, we are not going to do RFC 2292 style
CMSG tricks for multicast options as they are not required
by any current IPv6 normative reference.
* Refactor transports (UDP, raw_ip6) to do own mcast filtering.
SCTP, TCP unaffected by this change.
* Add ip6_msource, in6_msource structs to in6_var.h.
* Hookup mld_ifinfo state to in6_ifextra, allocate from
domifattach path.
* Eliminate IN6_LOOKUP_MULTI(), it is no longer referenced.
Kernel consumers which need this should use in6m_lookup().
* Refactor IPv6 socket group memberships to use a vector (like IPv4).
* Update ifmcstat(8) for IPv6 SSM.
* Add witness lock order for IN6_MULTI_LOCK.
* Move IN6_MULTI_LOCK out of lower ip6_output()/ip6_input() paths.
* Introduce IP6STAT_ADD/SUB/INC/DEC as per rwatson's IPv4 cleanup.
* Update carp(4) for new IPv6 SSM KPIs.
* Virtualize ip6_mrouter socket.
Changes mostly localized to IPv6 MROUTING.
* Don't do a local group lookup in MROUTING.
* Kill unused KAME prototypes in6_purgemkludge(), in6_restoremkludge().
* Preserve KAME DAD timer jitter behaviour in MLDv1 compatibility mode.
* Bump __FreeBSD_version to 800084.
* Update UPDATING.
NOTE WELL:
* This code hasn't been tested against real MLDv2 queriers
(yet), although the on-wire protocol has been verified in Wireshark.
* There are a few unresolved issues in the socket layer APIs to
do with scope ID propagation.
* There is a LOR present in ip6_output()'s use of
in6_setscope() which needs to be resolved. See comments in mld6.c.
This is believed to be benign and can't be avoided for the moment
without re-introducing an indirect netisr.
This work was mostly derived from the IGMPv3 implementation, and
has been sponsored by a third party.
2009-04-29 19:19:13 +00:00
|
|
|
* Unselect the outgoing ifp for multicast if it
|
|
|
|
* is being detached.
|
2001-08-04 17:10:14 +00:00
|
|
|
*/
|
|
|
|
if (im6o->im6o_multicast_ifp == ifp)
|
|
|
|
im6o->im6o_multicast_ifp = NULL;
|
|
|
|
/*
|
|
|
|
* Drop multicast group membership if we joined
|
|
|
|
* through the interface being detached.
|
|
|
|
*/
|
2019-06-25 11:54:41 +00:00
|
|
|
restart:
|
|
|
|
IP6_MFILTER_FOREACH(imf, &im6o->im6o_head) {
|
|
|
|
if ((inm = imf->im6f_in6m) == NULL)
|
|
|
|
continue;
|
|
|
|
if (inm->in6m_ifp != ifp)
|
|
|
|
continue;
|
|
|
|
ip6_mfilter_remove(&im6o->im6o_head, imf);
|
|
|
|
IN6_MULTI_LOCK_ASSERT();
|
|
|
|
in6_leavegroup_locked(inm, NULL);
|
|
|
|
ip6_mfilter_free(imf);
|
|
|
|
goto restart;
|
2001-08-04 17:10:14 +00:00
|
|
|
}
|
|
|
|
}
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2001-08-04 17:10:14 +00:00
|
|
|
}
|
2015-08-03 12:13:54 +00:00
|
|
|
INP_INFO_WUNLOCK(pcbinfo);
|
2001-08-04 17:10:14 +00:00
|
|
|
}
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
|
|
|
* Check for alternatives when higher level complains
|
|
|
|
* about service problems. For now, invalidate cached
|
|
|
|
* routing information. If the route was created dynamically
|
|
|
|
* (by a redirect), time to try a default gateway again.
|
|
|
|
*/
|
|
|
|
void
|
2018-09-03 22:27:27 +00:00
|
|
|
in6_losing(struct inpcb *inp)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
2007-07-05 16:23:49 +00:00
|
|
|
|
2018-09-03 22:27:27 +00:00
|
|
|
RO_INVALIDATE_CACHE(&inp->inp_route6);
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* After a routing change, flush old routing
|
|
|
|
* and allocate a (hopefully) better one.
|
|
|
|
*/
|
2002-06-14 08:35:21 +00:00
|
|
|
struct inpcb *
|
2018-09-03 22:27:27 +00:00
|
|
|
in6_rtchange(struct inpcb *inp, int errno __unused)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
2016-03-24 07:54:56 +00:00
|
|
|
|
2018-09-03 22:27:27 +00:00
|
|
|
RO_INVALIDATE_CACHE(&inp->inp_route6);
|
2002-06-14 08:35:21 +00:00
|
|
|
return inp;
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
|
2018-06-06 15:45:57 +00:00
|
|
|
static struct inpcb *
|
|
|
|
in6_pcblookup_lbgroup(const struct inpcbinfo *pcbinfo,
|
|
|
|
const struct in6_addr *laddr, uint16_t lport, const struct in6_addr *faddr,
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
|
|
|
uint16_t fport, int lookupflags, uint8_t numa_domain)
|
2018-06-06 15:45:57 +00:00
|
|
|
{
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
|
|
|
struct inpcb *local_wild, *numa_wild;
|
2018-06-06 15:45:57 +00:00
|
|
|
const struct inpcblbgrouphead *hdr;
|
|
|
|
struct inpcblbgroup *grp;
|
|
|
|
uint32_t idx;
|
|
|
|
|
|
|
|
INP_HASH_LOCK_ASSERT(pcbinfo);
|
|
|
|
|
2018-12-05 17:06:00 +00:00
|
|
|
hdr = &pcbinfo->ipi_lbgrouphashbase[
|
|
|
|
INP_PCBPORTHASH(lport, pcbinfo->ipi_lbgrouphashmask)];
|
2018-06-06 15:45:57 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Order of socket selection:
|
|
|
|
* 1. non-wild.
|
|
|
|
* 2. wild (if lookupflags contains INPLOOKUP_WILDCARD).
|
|
|
|
*
|
|
|
|
* NOTE:
|
|
|
|
* - Load balanced group does not contain jailed sockets.
|
|
|
|
* - Load balanced does not contain IPv4 mapped INET6 wild sockets.
|
|
|
|
*/
|
2018-10-22 16:09:01 +00:00
|
|
|
local_wild = NULL;
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
|
|
|
numa_wild = NULL;
|
2018-09-10 19:00:29 +00:00
|
|
|
CK_LIST_FOREACH(grp, hdr, il_list) {
|
2018-08-27 18:13:20 +00:00
|
|
|
#ifdef INET
|
|
|
|
if (!(grp->il_vflag & INP_IPV6))
|
|
|
|
continue;
|
|
|
|
#endif
|
2018-10-22 16:09:01 +00:00
|
|
|
if (grp->il_lport != lport)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
idx = INP_PCBLBGROUP_PKTHASH(INP6_PCBHASHKEY(faddr), lport,
|
|
|
|
fport) % grp->il_inpcnt;
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
|
|
|
if (IN6_ARE_ADDR_EQUAL(&grp->il6_laddr, laddr)) {
|
|
|
|
if (numa_domain == M_NODOM ||
|
|
|
|
grp->il_numa_domain == numa_domain) {
|
|
|
|
return (grp->il_inp[idx]);
|
|
|
|
}
|
|
|
|
else
|
|
|
|
numa_wild = grp->il_inp[idx];
|
|
|
|
}
|
2018-10-22 16:09:01 +00:00
|
|
|
if (IN6_IS_ADDR_UNSPECIFIED(&grp->il6_laddr) &&
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
|
|
|
(lookupflags & INPLOOKUP_WILDCARD) != 0 &&
|
|
|
|
(local_wild == NULL || numa_domain == M_NODOM ||
|
|
|
|
grp->il_numa_domain == numa_domain)) {
|
2018-10-22 16:09:01 +00:00
|
|
|
local_wild = grp->il_inp[idx];
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
|
|
|
}
|
2018-06-06 15:45:57 +00:00
|
|
|
}
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
|
|
|
if (numa_wild != NULL)
|
|
|
|
return (numa_wild);
|
2018-10-22 16:09:01 +00:00
|
|
|
return (local_wild);
|
2018-06-06 15:45:57 +00:00
|
|
|
}
|
|
|
|
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#ifdef PCBGROUP
|
|
|
|
/*
|
|
|
|
* Lookup PCB in hash list, using pcbgroup tables.
|
|
|
|
*/
|
|
|
|
static struct inpcb *
|
|
|
|
in6_pcblookup_group(struct inpcbinfo *pcbinfo, struct inpcbgroup *pcbgroup,
|
|
|
|
struct in6_addr *faddr, u_int fport_arg, struct in6_addr *laddr,
|
|
|
|
u_int lport_arg, int lookupflags, struct ifnet *ifp)
|
|
|
|
{
|
|
|
|
struct inpcbhead *head;
|
|
|
|
struct inpcb *inp, *tmpinp;
|
|
|
|
u_short fport = fport_arg, lport = lport_arg;
|
2018-03-21 15:54:46 +00:00
|
|
|
bool locked;
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* First look for an exact match.
|
|
|
|
*/
|
|
|
|
tmpinp = NULL;
|
|
|
|
INP_GROUP_LOCK(pcbgroup);
|
2014-09-10 12:35:42 +00:00
|
|
|
head = &pcbgroup->ipg_hashbase[INP_PCBHASH(
|
|
|
|
INP6_PCBHASHKEY(faddr), lport, fport, pcbgroup->ipg_hashmask)];
|
2018-06-13 23:19:54 +00:00
|
|
|
CK_LIST_FOREACH(inp, head, inp_pcbgrouphash) {
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
/* XXX inp locking */
|
|
|
|
if ((inp->inp_vflag & INP_IPV6) == 0)
|
|
|
|
continue;
|
|
|
|
if (IN6_ARE_ADDR_EQUAL(&inp->in6p_faddr, faddr) &&
|
|
|
|
IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, laddr) &&
|
|
|
|
inp->inp_fport == fport &&
|
|
|
|
inp->inp_lport == lport) {
|
|
|
|
/*
|
|
|
|
* XXX We should be able to directly return
|
|
|
|
* the inp here, without any checks.
|
|
|
|
* Well unless both bound with SO_REUSEPORT?
|
|
|
|
*/
|
|
|
|
if (prison_flag(inp->inp_cred, PR_IP6))
|
|
|
|
goto found;
|
|
|
|
if (tmpinp == NULL)
|
|
|
|
tmpinp = inp;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (tmpinp != NULL) {
|
|
|
|
inp = tmpinp;
|
|
|
|
goto found;
|
|
|
|
}
|
|
|
|
|
2014-07-12 05:46:33 +00:00
|
|
|
/*
|
|
|
|
* Then look for a wildcard match in the pcbgroup.
|
|
|
|
*/
|
|
|
|
if ((lookupflags & INPLOOKUP_WILDCARD) != 0) {
|
|
|
|
struct inpcb *local_wild = NULL, *local_exact = NULL;
|
|
|
|
struct inpcb *jail_wild = NULL;
|
|
|
|
int injail;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Order of socket selection - we always prefer jails.
|
|
|
|
* 1. jailed, non-wild.
|
|
|
|
* 2. jailed, wild.
|
|
|
|
* 3. non-jailed, non-wild.
|
|
|
|
* 4. non-jailed, wild.
|
|
|
|
*/
|
|
|
|
head = &pcbgroup->ipg_hashbase[
|
|
|
|
INP_PCBHASH(INADDR_ANY, lport, 0, pcbgroup->ipg_hashmask)];
|
2018-06-13 23:19:54 +00:00
|
|
|
CK_LIST_FOREACH(inp, head, inp_pcbgrouphash) {
|
2014-07-12 05:46:33 +00:00
|
|
|
/* XXX inp locking */
|
|
|
|
if ((inp->inp_vflag & INP_IPV6) == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (!IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr) ||
|
|
|
|
inp->inp_lport != lport) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
injail = prison_flag(inp->inp_cred, PR_IP6);
|
|
|
|
if (injail) {
|
|
|
|
if (prison_check_ip6(inp->inp_cred,
|
|
|
|
laddr) != 0)
|
|
|
|
continue;
|
|
|
|
} else {
|
|
|
|
if (local_exact != NULL)
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, laddr)) {
|
|
|
|
if (injail)
|
|
|
|
goto found;
|
|
|
|
else
|
|
|
|
local_exact = inp;
|
|
|
|
} else if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr)) {
|
|
|
|
if (injail)
|
|
|
|
jail_wild = inp;
|
|
|
|
else
|
|
|
|
local_wild = inp;
|
|
|
|
}
|
|
|
|
} /* LIST_FOREACH */
|
|
|
|
|
|
|
|
inp = jail_wild;
|
|
|
|
if (inp == NULL)
|
|
|
|
inp = jail_wild;
|
|
|
|
if (inp == NULL)
|
|
|
|
inp = local_exact;
|
|
|
|
if (inp == NULL)
|
|
|
|
inp = local_wild;
|
|
|
|
if (inp != NULL)
|
|
|
|
goto found;
|
|
|
|
}
|
|
|
|
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
/*
|
|
|
|
* Then look for a wildcard match, if requested.
|
|
|
|
*/
|
|
|
|
if ((lookupflags & INPLOOKUP_WILDCARD) != 0) {
|
|
|
|
struct inpcb *local_wild = NULL, *local_exact = NULL;
|
|
|
|
struct inpcb *jail_wild = NULL;
|
|
|
|
int injail;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Order of socket selection - we always prefer jails.
|
|
|
|
* 1. jailed, non-wild.
|
|
|
|
* 2. jailed, wild.
|
|
|
|
* 3. non-jailed, non-wild.
|
|
|
|
* 4. non-jailed, wild.
|
|
|
|
*/
|
2014-09-10 12:35:42 +00:00
|
|
|
head = &pcbinfo->ipi_wildbase[INP_PCBHASH(
|
|
|
|
INP6_PCBHASHKEY(&in6addr_any), lport, 0,
|
|
|
|
pcbinfo->ipi_wildmask)];
|
2018-06-13 23:19:54 +00:00
|
|
|
CK_LIST_FOREACH(inp, head, inp_pcbgroup_wild) {
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
/* XXX inp locking */
|
|
|
|
if ((inp->inp_vflag & INP_IPV6) == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (!IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr) ||
|
|
|
|
inp->inp_lport != lport) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
injail = prison_flag(inp->inp_cred, PR_IP6);
|
|
|
|
if (injail) {
|
|
|
|
if (prison_check_ip6(inp->inp_cred,
|
|
|
|
laddr) != 0)
|
|
|
|
continue;
|
|
|
|
} else {
|
|
|
|
if (local_exact != NULL)
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, laddr)) {
|
|
|
|
if (injail)
|
|
|
|
goto found;
|
|
|
|
else
|
|
|
|
local_exact = inp;
|
|
|
|
} else if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr)) {
|
|
|
|
if (injail)
|
|
|
|
jail_wild = inp;
|
|
|
|
else
|
|
|
|
local_wild = inp;
|
|
|
|
}
|
|
|
|
} /* LIST_FOREACH */
|
|
|
|
|
|
|
|
inp = jail_wild;
|
|
|
|
if (inp == NULL)
|
|
|
|
inp = jail_wild;
|
|
|
|
if (inp == NULL)
|
|
|
|
inp = local_exact;
|
|
|
|
if (inp == NULL)
|
|
|
|
inp = local_wild;
|
|
|
|
if (inp != NULL)
|
|
|
|
goto found;
|
|
|
|
} /* if ((lookupflags & INPLOOKUP_WILDCARD) != 0) */
|
|
|
|
INP_GROUP_UNLOCK(pcbgroup);
|
|
|
|
return (NULL);
|
|
|
|
|
|
|
|
found:
|
2018-03-21 15:54:46 +00:00
|
|
|
if (lookupflags & INPLOOKUP_WLOCKPCB)
|
|
|
|
locked = INP_TRY_WLOCK(inp);
|
|
|
|
else if (lookupflags & INPLOOKUP_RLOCKPCB)
|
|
|
|
locked = INP_TRY_RLOCK(inp);
|
|
|
|
else
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
panic("%s: locking buf", __func__);
|
2018-03-21 15:54:46 +00:00
|
|
|
if (!locked)
|
|
|
|
in_pcbref(inp);
|
|
|
|
INP_GROUP_UNLOCK(pcbgroup);
|
|
|
|
if (!locked) {
|
|
|
|
if (lookupflags & INPLOOKUP_WLOCKPCB) {
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
if (in_pcbrele_wlocked(inp))
|
|
|
|
return (NULL);
|
|
|
|
} else {
|
|
|
|
INP_RLOCK(inp);
|
|
|
|
if (in_pcbrele_rlocked(inp))
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#ifdef INVARIANTS
|
|
|
|
if (lookupflags & INPLOOKUP_WLOCKPCB)
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
else
|
|
|
|
INP_RLOCK_ASSERT(inp);
|
|
|
|
#endif
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
return (inp);
|
|
|
|
}
|
|
|
|
#endif /* PCBGROUP */
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
2020-05-18 22:53:12 +00:00
|
|
|
* Lookup PCB in hash list. Used in in_pcb.c as well as here.
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
2020-05-18 22:53:12 +00:00
|
|
|
struct inpcb *
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
in6_pcblookup_hash_locked(struct inpcbinfo *pcbinfo, struct in6_addr *faddr,
|
|
|
|
u_int fport_arg, struct in6_addr *laddr, u_int lport_arg,
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
|
|
|
int lookupflags, struct ifnet *ifp, uint8_t numa_domain)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
|
|
|
struct inpcbhead *head;
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
struct inpcb *inp, *tmpinp;
|
1999-11-22 02:45:11 +00:00
|
|
|
u_short fport = fport_arg, lport = lport_arg;
|
2001-06-11 12:39:29 +00:00
|
|
|
|
2011-05-23 15:23:18 +00:00
|
|
|
KASSERT((lookupflags & ~(INPLOOKUP_WILDCARD)) == 0,
|
|
|
|
("%s: invalid lookup flags %d", __func__, lookupflags));
|
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_LOCK_ASSERT(pcbinfo);
|
2006-04-25 12:09:58 +00:00
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
|
|
|
* First look for an exact match.
|
|
|
|
*/
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
tmpinp = NULL;
|
2014-09-10 12:35:42 +00:00
|
|
|
head = &pcbinfo->ipi_hashbase[INP_PCBHASH(
|
|
|
|
INP6_PCBHASHKEY(faddr), lport, fport, pcbinfo->ipi_hashmask)];
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_FOREACH(inp, head, inp_hash) {
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
/* XXX inp locking */
|
1999-12-21 11:14:12 +00:00
|
|
|
if ((inp->inp_vflag & INP_IPV6) == 0)
|
1999-11-22 02:45:11 +00:00
|
|
|
continue;
|
|
|
|
if (IN6_ARE_ADDR_EQUAL(&inp->in6p_faddr, faddr) &&
|
|
|
|
IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, laddr) &&
|
|
|
|
inp->inp_fport == fport &&
|
|
|
|
inp->inp_lport == lport) {
|
|
|
|
/*
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
* XXX We should be able to directly return
|
|
|
|
* the inp here, without any checks.
|
|
|
|
* Well unless both bound with SO_REUSEPORT?
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
2009-05-27 14:11:23 +00:00
|
|
|
if (prison_flag(inp->inp_cred, PR_IP6))
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
return (inp);
|
|
|
|
if (tmpinp == NULL)
|
|
|
|
tmpinp = inp;
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
}
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
if (tmpinp != NULL)
|
|
|
|
return (tmpinp);
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2018-06-06 15:45:57 +00:00
|
|
|
/*
|
|
|
|
* Then look in lb group (for wildcard match).
|
|
|
|
*/
|
2018-11-01 15:52:49 +00:00
|
|
|
if ((lookupflags & INPLOOKUP_WILDCARD) != 0) {
|
2018-06-06 15:45:57 +00:00
|
|
|
inp = in6_pcblookup_lbgroup(pcbinfo, laddr, lport, faddr,
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
|
|
|
fport, lookupflags, numa_domain);
|
2018-11-01 15:52:49 +00:00
|
|
|
if (inp != NULL)
|
2018-06-06 15:45:57 +00:00
|
|
|
return (inp);
|
|
|
|
}
|
|
|
|
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
/*
|
|
|
|
* Then look for a wildcard match, if requested.
|
|
|
|
*/
|
2011-05-23 15:23:18 +00:00
|
|
|
if ((lookupflags & INPLOOKUP_WILDCARD) != 0) {
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
struct inpcb *local_wild = NULL, *local_exact = NULL;
|
|
|
|
struct inpcb *jail_wild = NULL;
|
|
|
|
int injail;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Order of socket selection - we always prefer jails.
|
|
|
|
* 1. jailed, non-wild.
|
|
|
|
* 2. jailed, wild.
|
|
|
|
* 3. non-jailed, non-wild.
|
|
|
|
* 4. non-jailed, wild.
|
|
|
|
*/
|
2014-09-10 12:35:42 +00:00
|
|
|
head = &pcbinfo->ipi_hashbase[INP_PCBHASH(
|
|
|
|
INP6_PCBHASHKEY(&in6addr_any), lport, 0,
|
|
|
|
pcbinfo->ipi_hashmask)];
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_FOREACH(inp, head, inp_hash) {
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
/* XXX inp locking */
|
1999-12-21 11:14:12 +00:00
|
|
|
if ((inp->inp_vflag & INP_IPV6) == 0)
|
1999-11-22 02:45:11 +00:00
|
|
|
continue;
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
|
|
|
|
if (!IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr) ||
|
|
|
|
inp->inp_lport != lport) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2009-05-27 14:11:23 +00:00
|
|
|
injail = prison_flag(inp->inp_cred, PR_IP6);
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
if (injail) {
|
2009-02-05 14:06:09 +00:00
|
|
|
if (prison_check_ip6(inp->inp_cred,
|
|
|
|
laddr) != 0)
|
1999-11-22 02:45:11 +00:00
|
|
|
continue;
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
} else {
|
|
|
|
if (local_exact != NULL)
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, laddr)) {
|
|
|
|
if (injail)
|
1999-11-22 02:45:11 +00:00
|
|
|
return (inp);
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
else
|
|
|
|
local_exact = inp;
|
|
|
|
} else if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr)) {
|
|
|
|
if (injail)
|
|
|
|
jail_wild = inp;
|
|
|
|
else
|
1999-11-22 02:45:11 +00:00
|
|
|
local_wild = inp;
|
|
|
|
}
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
} /* LIST_FOREACH */
|
|
|
|
|
|
|
|
if (jail_wild != NULL)
|
|
|
|
return (jail_wild);
|
|
|
|
if (local_exact != NULL)
|
|
|
|
return (local_exact);
|
|
|
|
if (local_wild != NULL)
|
|
|
|
return (local_wild);
|
2011-05-23 15:23:18 +00:00
|
|
|
} /* if ((lookupflags & INPLOOKUP_WILDCARD) != 0) */
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Not found.
|
|
|
|
*/
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
/*
|
|
|
|
* Lookup PCB in hash list, using pcbinfo tables. This variation locks the
|
|
|
|
* hash list lock, and will return the inpcb locked (i.e., requires
|
|
|
|
* INPLOOKUP_LOCKPCB).
|
|
|
|
*/
|
|
|
|
static struct inpcb *
|
|
|
|
in6_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in6_addr *faddr,
|
|
|
|
u_int fport, struct in6_addr *laddr, u_int lport, int lookupflags,
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
|
|
|
struct ifnet *ifp, uint8_t numa_domain)
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
|
|
|
|
|
|
|
inp = in6_pcblookup_hash_locked(pcbinfo, faddr, fport, laddr, lport,
|
2021-03-19 02:06:13 +00:00
|
|
|
lookupflags & INPLOOKUP_WILDCARD, ifp, numa_domain);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
if (inp != NULL) {
|
2018-07-01 01:01:59 +00:00
|
|
|
if (lookupflags & INPLOOKUP_WLOCKPCB) {
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
} else if (lookupflags & INPLOOKUP_RLOCKPCB) {
|
|
|
|
INP_RLOCK(inp);
|
|
|
|
} else
|
|
|
|
panic("%s: locking bug", __func__);
|
2021-03-19 02:06:13 +00:00
|
|
|
if (__predict_false(inp->inp_flags2 & INP_FREED)) {
|
|
|
|
INP_UNLOCK(inp);
|
|
|
|
inp = NULL;
|
2018-07-01 01:01:59 +00:00
|
|
|
}
|
|
|
|
}
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
return (inp);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
* Public inpcb lookup routines, accepting a 4-tuple, and optionally, an mbuf
|
|
|
|
* from which a pre-calculated hash value may be extracted.
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
*
|
|
|
|
* Possibly more of this logic should be in in6_pcbgroup.c.
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
*/
|
|
|
|
struct inpcb *
|
|
|
|
in6_pcblookup(struct inpcbinfo *pcbinfo, struct in6_addr *faddr, u_int fport,
|
|
|
|
struct in6_addr *laddr, u_int lport, int lookupflags, struct ifnet *ifp)
|
|
|
|
{
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
#if defined(PCBGROUP) && !defined(RSS)
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
struct inpcbgroup *pcbgroup;
|
|
|
|
#endif
|
|
|
|
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
KASSERT((lookupflags & ~INPLOOKUP_MASK) == 0,
|
|
|
|
("%s: invalid lookup flags %d", __func__, lookupflags));
|
2021-04-16 21:56:14 +00:00
|
|
|
KASSERT((lookupflags & (INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)) != 0,
|
|
|
|
("%s: LOCKPCB not set", __func__));
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
/*
|
|
|
|
* When not using RSS, use connection groups in preference to the
|
|
|
|
* reservation table when looking up 4-tuples. When using RSS, just
|
|
|
|
* use the reservation table, due to the cost of the Toeplitz hash
|
|
|
|
* in software.
|
|
|
|
*
|
|
|
|
* XXXRW: This policy belongs in the pcbgroup code, as in principle
|
|
|
|
* we could be doing RSS with a non-Toeplitz hash that is affordable
|
|
|
|
* in software.
|
|
|
|
*/
|
|
|
|
#if defined(PCBGROUP) && !defined(RSS)
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
if (in_pcbgroup_enabled(pcbinfo)) {
|
|
|
|
pcbgroup = in6_pcbgroup_bytuple(pcbinfo, laddr, lport, faddr,
|
|
|
|
fport);
|
|
|
|
return (in6_pcblookup_group(pcbinfo, pcbgroup, faddr, fport,
|
|
|
|
laddr, lport, lookupflags, ifp));
|
|
|
|
}
|
|
|
|
#endif
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
return (in6_pcblookup_hash(pcbinfo, faddr, fport, laddr, lport,
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
|
|
|
lookupflags, ifp, M_NODOM));
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
struct inpcb *
|
|
|
|
in6_pcblookup_mbuf(struct inpcbinfo *pcbinfo, struct in6_addr *faddr,
|
|
|
|
u_int fport, struct in6_addr *laddr, u_int lport, int lookupflags,
|
|
|
|
struct ifnet *ifp, struct mbuf *m)
|
|
|
|
{
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#ifdef PCBGROUP
|
|
|
|
struct inpcbgroup *pcbgroup;
|
|
|
|
#endif
|
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
KASSERT((lookupflags & ~INPLOOKUP_MASK) == 0,
|
|
|
|
("%s: invalid lookup flags %d", __func__, lookupflags));
|
2021-04-16 21:56:14 +00:00
|
|
|
KASSERT((lookupflags & (INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)) != 0,
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
("%s: LOCKPCB not set", __func__));
|
|
|
|
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#ifdef PCBGROUP
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
/*
|
|
|
|
* If we can use a hardware-generated hash to look up the connection
|
|
|
|
* group, use that connection group to find the inpcb. Otherwise
|
|
|
|
* fall back on a software hash -- or the reservation table if we're
|
|
|
|
* using RSS.
|
|
|
|
*
|
|
|
|
* XXXRW: As above, that policy belongs in the pcbgroup code.
|
|
|
|
*/
|
|
|
|
if (in_pcbgroup_enabled(pcbinfo) &&
|
2014-12-01 11:45:24 +00:00
|
|
|
M_HASHTYPE_TEST(m, M_HASHTYPE_NONE) == 0) {
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
pcbgroup = in6_pcbgroup_byhash(pcbinfo, M_HASHTYPE_GET(m),
|
|
|
|
m->m_pkthdr.flowid);
|
|
|
|
if (pcbgroup != NULL)
|
|
|
|
return (in6_pcblookup_group(pcbinfo, pcbgroup, faddr,
|
|
|
|
fport, laddr, lport, lookupflags, ifp));
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
#ifndef RSS
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
pcbgroup = in6_pcbgroup_bytuple(pcbinfo, laddr, lport, faddr,
|
|
|
|
fport);
|
|
|
|
return (in6_pcblookup_group(pcbinfo, pcbgroup, faddr, fport,
|
|
|
|
laddr, lport, lookupflags, ifp));
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
#endif
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
}
|
|
|
|
#endif
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
return (in6_pcblookup_hash(pcbinfo, faddr, fport, laddr, lport,
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
|
|
|
lookupflags, ifp, m->m_pkthdr.numa_domain));
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
}
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
void
|
2017-03-06 04:01:58 +00:00
|
|
|
init_sin6(struct sockaddr_in6 *sin6, struct mbuf *m, int srcordst)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
|
|
|
struct ip6_hdr *ip;
|
|
|
|
|
|
|
|
ip = mtod(m, struct ip6_hdr *);
|
|
|
|
bzero(sin6, sizeof(*sin6));
|
|
|
|
sin6->sin6_len = sizeof(*sin6);
|
|
|
|
sin6->sin6_family = AF_INET6;
|
2017-03-06 04:01:58 +00:00
|
|
|
sin6->sin6_addr = srcordst ? ip->ip6_dst : ip->ip6_src;
|
2005-07-25 12:31:43 +00:00
|
|
|
|
|
|
|
(void)sa6_recoverscope(sin6); /* XXX: should catch errors... */
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
return;
|
|
|
|
}
|