2005-01-07 01:45:51 +00:00
|
|
|
/*-
|
2017-11-20 19:43:44 +00:00
|
|
|
* SPDX-License-Identifier: BSD-3-Clause
|
|
|
|
*
|
1995-09-21 17:55:49 +00:00
|
|
|
* Copyright (c) 1982, 1986, 1991, 1993, 1995
|
2007-02-17 21:02:38 +00:00
|
|
|
* The Regents of the University of California.
|
2009-03-11 00:29:22 +00:00
|
|
|
* Copyright (c) 2007-2009 Robert N. M. Watson
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
|
|
|
* Copyright (c) 2010-2011 Juniper Networks, Inc.
|
2007-02-17 21:02:38 +00:00
|
|
|
* All rights reserved.
|
1994-05-24 10:09:53 +00:00
|
|
|
*
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
|
|
|
* Portions of this software were developed by Robert N. M. Watson under
|
|
|
|
* contract to Juniper Networks, Inc.
|
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
2017-02-28 23:42:47 +00:00
|
|
|
* 3. Neither the name of the University nor the names of its contributors
|
1994-05-24 10:09:53 +00:00
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
1995-09-21 17:55:49 +00:00
|
|
|
* @(#)in_pcb.c 8.4 (Berkeley) 5/24/95
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
|
2007-10-07 20:44:24 +00:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
2007-02-17 21:02:38 +00:00
|
|
|
#include "opt_ddb.h"
|
1999-12-22 19:13:38 +00:00
|
|
|
#include "opt_ipsec.h"
|
2011-03-12 21:46:37 +00:00
|
|
|
#include "opt_inet.h"
|
1999-12-07 17:39:16 +00:00
|
|
|
#include "opt_inet6.h"
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
#include "opt_ratelimit.h"
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#include "opt_pcbgroup.h"
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
#include "opt_rss.h"
|
1999-12-07 17:39:16 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/param.h>
|
|
|
|
#include <sys/systm.h>
|
2015-07-29 08:12:05 +00:00
|
|
|
#include <sys/lock.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/malloc.h>
|
|
|
|
#include <sys/mbuf.h>
|
2011-04-20 08:00:29 +00:00
|
|
|
#include <sys/callout.h>
|
2016-01-09 09:34:39 +00:00
|
|
|
#include <sys/eventhandler.h>
|
1999-12-07 17:39:16 +00:00
|
|
|
#include <sys/domain.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/protosw.h>
|
2015-07-29 08:12:05 +00:00
|
|
|
#include <sys/rmlock.h>
|
2018-04-19 13:37:59 +00:00
|
|
|
#include <sys/smp.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/socket.h>
|
|
|
|
#include <sys/socketvar.h>
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
#include <sys/sockio.h>
|
2006-11-06 13:42:10 +00:00
|
|
|
#include <sys/priv.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/proc.h>
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
|
|
|
#include <sys/refcount.h>
|
This Implements the mumbled about "Jail" feature.
This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.
For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".
Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.
Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.
It generally does what one would expect, but setting up a jail
still takes a little knowledge.
A few notes:
I have no scripts for setting up a jail, don't ask me for them.
The IP number should be an alias on one of the interfaces.
mount a /proc in each jail, it will make ps more useable.
/proc/<pid>/status tells the hostname of the prison for
jailed processes.
Quotas are only sensible if you have a mountpoint per prison.
There are no privisions for stopping resource-hogging.
Some "#ifdef INET" and similar may be missing (send patches!)
If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!
Tools, comments, patches & documentation most welcome.
Have fun...
Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/
1999-04-28 11:38:52 +00:00
|
|
|
#include <sys/jail.h>
|
1996-01-19 08:00:58 +00:00
|
|
|
#include <sys/kernel.h>
|
|
|
|
#include <sys/sysctl.h>
|
1998-03-28 10:18:26 +00:00
|
|
|
|
2007-02-17 21:02:38 +00:00
|
|
|
#ifdef DDB
|
|
|
|
#include <ddb/ddb.h>
|
|
|
|
#endif
|
|
|
|
|
2002-03-20 05:48:55 +00:00
|
|
|
#include <vm/uma.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
#include <net/if.h>
|
2013-10-26 17:58:36 +00:00
|
|
|
#include <net/if_var.h>
|
1999-12-07 17:39:16 +00:00
|
|
|
#include <net/if_types.h>
|
2016-06-02 17:51:29 +00:00
|
|
|
#include <net/if_llatbl.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <net/route.h>
|
2015-01-18 18:06:40 +00:00
|
|
|
#include <net/rss_config.h>
|
2009-08-01 19:26:27 +00:00
|
|
|
#include <net/vnet.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2011-04-30 11:04:34 +00:00
|
|
|
#if defined(INET) || defined(INET6)
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <netinet/in.h>
|
|
|
|
#include <netinet/in_pcb.h>
|
2019-06-25 11:54:41 +00:00
|
|
|
#ifdef INET
|
|
|
|
#include <netinet/in_var.h>
|
2020-04-14 23:06:25 +00:00
|
|
|
#include <netinet/in_fib.h>
|
2019-06-25 11:54:41 +00:00
|
|
|
#endif
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <netinet/ip_var.h>
|
2003-02-19 22:32:43 +00:00
|
|
|
#include <netinet/tcp_var.h>
|
2018-04-19 13:37:59 +00:00
|
|
|
#ifdef TCPHPTS
|
|
|
|
#include <netinet/tcp_hpts.h>
|
|
|
|
#endif
|
2005-01-02 01:50:57 +00:00
|
|
|
#include <netinet/udp.h>
|
|
|
|
#include <netinet/udp_var.h>
|
1999-12-07 17:39:16 +00:00
|
|
|
#ifdef INET6
|
|
|
|
#include <netinet/ip6.h>
|
2011-03-12 21:46:37 +00:00
|
|
|
#include <netinet6/in6_pcb.h>
|
2011-04-30 11:04:34 +00:00
|
|
|
#include <netinet6/in6_var.h>
|
|
|
|
#include <netinet6/ip6_var.h>
|
1999-12-07 17:39:16 +00:00
|
|
|
#endif /* INET6 */
|
2020-04-14 23:06:25 +00:00
|
|
|
#include <net/route/nhop.h>
|
2019-06-25 11:54:41 +00:00
|
|
|
#endif
|
1999-12-07 17:39:16 +00:00
|
|
|
|
2017-02-06 08:49:57 +00:00
|
|
|
#include <netipsec/ipsec_support.h>
|
2002-10-16 02:25:05 +00:00
|
|
|
|
2006-10-22 11:52:19 +00:00
|
|
|
#include <security/mac/mac_framework.h>
|
|
|
|
|
2018-06-06 15:45:57 +00:00
|
|
|
#define INPCBLBGROUP_SIZMIN 8
|
|
|
|
#define INPCBLBGROUP_SIZMAX 256
|
|
|
|
|
2011-04-20 08:00:29 +00:00
|
|
|
static struct callout ipport_tick_callout;
|
|
|
|
|
1996-01-19 08:00:58 +00:00
|
|
|
/*
|
|
|
|
* These configure the range of local port addresses assigned to
|
|
|
|
* "unspecified" outgoing connections/packets/whatever.
|
|
|
|
*/
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
VNET_DEFINE(int, ipport_lowfirstauto) = IPPORT_RESERVED - 1; /* 1023 */
|
|
|
|
VNET_DEFINE(int, ipport_lowlastauto) = IPPORT_RESERVEDSTART; /* 600 */
|
|
|
|
VNET_DEFINE(int, ipport_firstauto) = IPPORT_EPHEMERALFIRST; /* 10000 */
|
|
|
|
VNET_DEFINE(int, ipport_lastauto) = IPPORT_EPHEMERALLAST; /* 65535 */
|
|
|
|
VNET_DEFINE(int, ipport_hifirstauto) = IPPORT_HIFIRSTAUTO; /* 49152 */
|
|
|
|
VNET_DEFINE(int, ipport_hilastauto) = IPPORT_HILASTAUTO; /* 65535 */
|
1996-01-19 08:00:58 +00:00
|
|
|
|
The ancient and outdated concept of "privileged ports" in UNIX-type
OSes has probably caused more problems than it ever solved. Allow the
user to retire the old behavior by specifying their own privileged
range with,
net.inet.ip.portrange.reservedhigh default = IPPORT_RESERVED - 1
net.inet.ip.portrange.reservedlo default = 0
Now you can run that webserver without ever needing root at all. Or
just imagine, an ftpd that can really drop privileges, rather than
just set the euid, and still do PORT data transfers from 20/tcp.
Two edge cases to note,
# sysctl net.inet.ip.portrange.reservedhigh=0
Opens all ports to everyone, and,
# sysctl net.inet.ip.portrange.reservedhigh=65535
Locks all network activity to root only (which could actually have
been achieved before with ipfw(8), but is somewhat more
complicated).
For those who stick to the old religion that 0-1023 belong to root and
root alone, don't touch the knobs (or even lock them by raising
securelevel(8)), and nothing changes.
2003-02-21 05:28:27 +00:00
|
|
|
/*
|
|
|
|
* Reserved ports accessible only to root. There are significant
|
|
|
|
* security considerations that must be accounted for when changing these,
|
|
|
|
* but the security benefits can be great. Please be careful.
|
|
|
|
*/
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
VNET_DEFINE(int, ipport_reservedhigh) = IPPORT_RESERVED - 1; /* 1023 */
|
|
|
|
VNET_DEFINE(int, ipport_reservedlow);
|
The ancient and outdated concept of "privileged ports" in UNIX-type
OSes has probably caused more problems than it ever solved. Allow the
user to retire the old behavior by specifying their own privileged
range with,
net.inet.ip.portrange.reservedhigh default = IPPORT_RESERVED - 1
net.inet.ip.portrange.reservedlo default = 0
Now you can run that webserver without ever needing root at all. Or
just imagine, an ftpd that can really drop privileges, rather than
just set the euid, and still do PORT data transfers from 20/tcp.
Two edge cases to note,
# sysctl net.inet.ip.portrange.reservedhigh=0
Opens all ports to everyone, and,
# sysctl net.inet.ip.portrange.reservedhigh=65535
Locks all network activity to root only (which could actually have
been achieved before with ipfw(8), but is somewhat more
complicated).
For those who stick to the old religion that 0-1023 belong to root and
root alone, don't touch the knobs (or even lock them by raising
securelevel(8)), and nothing changes.
2003-02-21 05:28:27 +00:00
|
|
|
|
2005-01-02 01:50:57 +00:00
|
|
|
/* Variables dealing with random ephemeral port allocation. */
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
VNET_DEFINE(int, ipport_randomized) = 1; /* user controlled via sysctl */
|
|
|
|
VNET_DEFINE(int, ipport_randomcps) = 10; /* user controlled via sysctl */
|
|
|
|
VNET_DEFINE(int, ipport_randomtime) = 45; /* user controlled via sysctl */
|
|
|
|
VNET_DEFINE(int, ipport_stoprandom); /* toggled by ipport_tick */
|
|
|
|
VNET_DEFINE(int, ipport_tcpallocs);
|
2018-07-24 16:35:52 +00:00
|
|
|
VNET_DEFINE_STATIC(int, ipport_tcplastcount);
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
|
2009-07-16 21:13:04 +00:00
|
|
|
#define V_ipport_tcplastcount VNET(ipport_tcplastcount)
|
2004-04-22 08:32:14 +00:00
|
|
|
|
2011-05-30 18:07:35 +00:00
|
|
|
static void in_pcbremlists(struct inpcb *inp);
|
|
|
|
#ifdef INET
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
static struct inpcb *in_pcblookup_hash_locked(struct inpcbinfo *pcbinfo,
|
|
|
|
struct in_addr faddr, u_int fport_arg,
|
|
|
|
struct in_addr laddr, u_int lport_arg,
|
|
|
|
int lookupflags, struct ifnet *ifp);
|
2011-04-30 11:04:34 +00:00
|
|
|
|
1996-08-12 14:05:54 +00:00
|
|
|
#define RANGECHK(var, min, max) \
|
|
|
|
if ((var) < (min)) { (var) = (min); } \
|
|
|
|
else if ((var) > (max)) { (var) = (max); }
|
|
|
|
|
|
|
|
static int
|
2000-07-04 11:25:35 +00:00
|
|
|
sysctl_net_ipport_check(SYSCTL_HANDLER_ARGS)
|
1996-08-12 14:05:54 +00:00
|
|
|
{
|
2004-04-06 10:59:11 +00:00
|
|
|
int error;
|
|
|
|
|
Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:
1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:
options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet
2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:
INIT_VNET_NET(ifp->if_vnet); becomes
struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];
3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.
4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.
5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.
6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.
Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.
Reviewed by: bz, rwatson
Approved by: julian (mentor)
2009-04-30 13:36:26 +00:00
|
|
|
error = sysctl_handle_int(oidp, arg1, arg2, req);
|
2004-04-06 10:59:11 +00:00
|
|
|
if (error == 0) {
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
RANGECHK(V_ipport_lowfirstauto, 1, IPPORT_RESERVED - 1);
|
|
|
|
RANGECHK(V_ipport_lowlastauto, 1, IPPORT_RESERVED - 1);
|
|
|
|
RANGECHK(V_ipport_firstauto, IPPORT_RESERVED, IPPORT_MAX);
|
|
|
|
RANGECHK(V_ipport_lastauto, IPPORT_RESERVED, IPPORT_MAX);
|
|
|
|
RANGECHK(V_ipport_hifirstauto, IPPORT_RESERVED, IPPORT_MAX);
|
|
|
|
RANGECHK(V_ipport_hilastauto, IPPORT_RESERVED, IPPORT_MAX);
|
1996-08-12 14:05:54 +00:00
|
|
|
}
|
2004-04-06 10:59:11 +00:00
|
|
|
return (error);
|
1996-08-12 14:05:54 +00:00
|
|
|
}
|
1996-02-22 21:32:23 +00:00
|
|
|
|
1996-08-12 14:05:54 +00:00
|
|
|
#undef RANGECHK
|
1996-01-19 08:00:58 +00:00
|
|
|
|
2020-02-26 14:26:36 +00:00
|
|
|
static SYSCTL_NODE(_net_inet_ip, IPPROTO_IP, portrange,
|
|
|
|
CTLFLAG_RW | CTLFLAG_MPSAFE, 0,
|
2011-11-07 15:43:11 +00:00
|
|
|
"IP Ports");
|
1996-08-12 14:05:54 +00:00
|
|
|
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, lowfirst,
|
2020-02-26 14:26:36 +00:00
|
|
|
CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_NEEDGIANT,
|
|
|
|
&VNET_NAME(ipport_lowfirstauto), 0, &sysctl_net_ipport_check, "I",
|
|
|
|
"");
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, lowlast,
|
2020-02-26 14:26:36 +00:00
|
|
|
CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_NEEDGIANT,
|
|
|
|
&VNET_NAME(ipport_lowlastauto), 0, &sysctl_net_ipport_check, "I",
|
|
|
|
"");
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, first,
|
2020-02-26 14:26:36 +00:00
|
|
|
CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_NEEDGIANT,
|
|
|
|
&VNET_NAME(ipport_firstauto), 0, &sysctl_net_ipport_check, "I",
|
|
|
|
"");
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, last,
|
2020-02-26 14:26:36 +00:00
|
|
|
CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_NEEDGIANT,
|
|
|
|
&VNET_NAME(ipport_lastauto), 0, &sysctl_net_ipport_check, "I",
|
|
|
|
"");
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, hifirst,
|
2020-02-26 14:26:36 +00:00
|
|
|
CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_NEEDGIANT,
|
|
|
|
&VNET_NAME(ipport_hifirstauto), 0, &sysctl_net_ipport_check, "I",
|
|
|
|
"");
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_PROC(_net_inet_ip_portrange, OID_AUTO, hilast,
|
2020-02-26 14:26:36 +00:00
|
|
|
CTLFLAG_VNET | CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_NEEDGIANT,
|
|
|
|
&VNET_NAME(ipport_hilastauto), 0, &sysctl_net_ipport_check, "I",
|
|
|
|
"");
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_INT(_net_inet_ip_portrange, OID_AUTO, reservedhigh,
|
|
|
|
CTLFLAG_VNET | CTLFLAG_RW | CTLFLAG_SECURE,
|
|
|
|
&VNET_NAME(ipport_reservedhigh), 0, "");
|
|
|
|
SYSCTL_INT(_net_inet_ip_portrange, OID_AUTO, reservedlow,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
CTLFLAG_RW|CTLFLAG_SECURE, &VNET_NAME(ipport_reservedlow), 0, "");
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_INT(_net_inet_ip_portrange, OID_AUTO, randomized,
|
|
|
|
CTLFLAG_VNET | CTLFLAG_RW,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
&VNET_NAME(ipport_randomized), 0, "Enable random port allocation");
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_INT(_net_inet_ip_portrange, OID_AUTO, randomcps,
|
|
|
|
CTLFLAG_VNET | CTLFLAG_RW,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
&VNET_NAME(ipport_randomcps), 0, "Maximum number of random port "
|
Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit
Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.
Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().
Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).
All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).
(*) netipsec/keysock.c did not validate depending on compile time options.
Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
|
|
|
"allocations before switching to a sequental one");
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_INT(_net_inet_ip_portrange, OID_AUTO, randomtime,
|
|
|
|
CTLFLAG_VNET | CTLFLAG_RW,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
&VNET_NAME(ipport_randomtime), 0,
|
Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit
Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.
Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().
Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).
All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).
(*) netipsec/keysock.c did not validate depending on compile time options.
Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
|
|
|
"Minimum time to keep sequental port "
|
|
|
|
"allocation before switching to a random one");
|
2019-08-01 14:17:31 +00:00
|
|
|
|
|
|
|
#ifdef RATELIMIT
|
|
|
|
counter_u64_t rate_limit_active;
|
|
|
|
counter_u64_t rate_limit_alloc_fail;
|
|
|
|
counter_u64_t rate_limit_set_ok;
|
|
|
|
|
2020-02-26 14:26:36 +00:00
|
|
|
static SYSCTL_NODE(_net_inet_ip, OID_AUTO, rl, CTLFLAG_RD | CTLFLAG_MPSAFE, 0,
|
2019-08-01 14:17:31 +00:00
|
|
|
"IP Rate Limiting");
|
|
|
|
SYSCTL_COUNTER_U64(_net_inet_ip_rl, OID_AUTO, active, CTLFLAG_RD,
|
|
|
|
&rate_limit_active, "Active rate limited connections");
|
|
|
|
SYSCTL_COUNTER_U64(_net_inet_ip_rl, OID_AUTO, alloc_fail, CTLFLAG_RD,
|
|
|
|
&rate_limit_alloc_fail, "Rate limited connection failures");
|
|
|
|
SYSCTL_COUNTER_U64(_net_inet_ip_rl, OID_AUTO, set_ok, CTLFLAG_RD,
|
|
|
|
&rate_limit_set_ok, "Rate limited setting succeeded");
|
|
|
|
#endif /* RATELIMIT */
|
|
|
|
|
2012-01-22 02:13:19 +00:00
|
|
|
#endif /* INET */
|
1995-11-14 20:34:56 +00:00
|
|
|
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
/*
|
|
|
|
* in_pcb.c: manage the Protocol Control Blocks.
|
|
|
|
*
|
2005-07-19 12:24:27 +00:00
|
|
|
* NOTE: It is assumed that most of these functions will be called with
|
|
|
|
* the pcbinfo lock held, and often, the inpcb lock held, as these utility
|
|
|
|
* functions often modify hash chains or addresses in pcbs.
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
*/
|
|
|
|
|
2018-06-06 15:45:57 +00:00
|
|
|
static struct inpcblbgroup *
|
|
|
|
in_pcblbgroup_alloc(struct inpcblbgrouphead *hdr, u_char vflag,
|
|
|
|
uint16_t port, const union in_dependaddr *addr, int size)
|
|
|
|
{
|
|
|
|
struct inpcblbgroup *grp;
|
|
|
|
size_t bytes;
|
|
|
|
|
|
|
|
bytes = __offsetof(struct inpcblbgroup, il_inp[size]);
|
|
|
|
grp = malloc(bytes, M_PCB, M_ZERO | M_NOWAIT);
|
|
|
|
if (!grp)
|
|
|
|
return (NULL);
|
|
|
|
grp->il_vflag = vflag;
|
|
|
|
grp->il_lport = port;
|
|
|
|
grp->il_dependladdr = *addr;
|
|
|
|
grp->il_inpsiz = size;
|
2018-09-10 19:00:29 +00:00
|
|
|
CK_LIST_INSERT_HEAD(hdr, grp, il_list);
|
2018-06-06 15:45:57 +00:00
|
|
|
return (grp);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2018-09-10 19:00:29 +00:00
|
|
|
in_pcblbgroup_free_deferred(epoch_context_t ctx)
|
2018-06-06 15:45:57 +00:00
|
|
|
{
|
2018-09-10 19:00:29 +00:00
|
|
|
struct inpcblbgroup *grp;
|
2018-06-06 15:45:57 +00:00
|
|
|
|
2018-09-10 19:00:29 +00:00
|
|
|
grp = __containerof(ctx, struct inpcblbgroup, il_epoch_ctx);
|
2018-09-03 17:39:09 +00:00
|
|
|
free(grp, M_PCB);
|
2018-06-06 15:45:57 +00:00
|
|
|
}
|
|
|
|
|
2018-09-10 19:00:29 +00:00
|
|
|
static void
|
|
|
|
in_pcblbgroup_free(struct inpcblbgroup *grp)
|
|
|
|
{
|
|
|
|
|
|
|
|
CK_LIST_REMOVE(grp, il_list);
|
2020-01-15 06:05:20 +00:00
|
|
|
NET_EPOCH_CALL(in_pcblbgroup_free_deferred, &grp->il_epoch_ctx);
|
2018-09-10 19:00:29 +00:00
|
|
|
}
|
|
|
|
|
2018-06-06 15:45:57 +00:00
|
|
|
static struct inpcblbgroup *
|
|
|
|
in_pcblbgroup_resize(struct inpcblbgrouphead *hdr,
|
|
|
|
struct inpcblbgroup *old_grp, int size)
|
|
|
|
{
|
|
|
|
struct inpcblbgroup *grp;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
grp = in_pcblbgroup_alloc(hdr, old_grp->il_vflag,
|
|
|
|
old_grp->il_lport, &old_grp->il_dependladdr, size);
|
2018-11-01 15:51:49 +00:00
|
|
|
if (grp == NULL)
|
2018-06-06 15:45:57 +00:00
|
|
|
return (NULL);
|
|
|
|
|
|
|
|
KASSERT(old_grp->il_inpcnt < grp->il_inpsiz,
|
|
|
|
("invalid new local group size %d and old local group count %d",
|
|
|
|
grp->il_inpsiz, old_grp->il_inpcnt));
|
|
|
|
|
|
|
|
for (i = 0; i < old_grp->il_inpcnt; ++i)
|
|
|
|
grp->il_inp[i] = old_grp->il_inp[i];
|
|
|
|
grp->il_inpcnt = old_grp->il_inpcnt;
|
|
|
|
in_pcblbgroup_free(old_grp);
|
|
|
|
return (grp);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* PCB at index 'i' is removed from the group. Pull up the ones below il_inp[i]
|
|
|
|
* and shrink group if possible.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
in_pcblbgroup_reorder(struct inpcblbgrouphead *hdr, struct inpcblbgroup **grpp,
|
|
|
|
int i)
|
|
|
|
{
|
2018-11-01 15:51:49 +00:00
|
|
|
struct inpcblbgroup *grp, *new_grp;
|
2018-06-06 15:45:57 +00:00
|
|
|
|
2018-11-01 15:51:49 +00:00
|
|
|
grp = *grpp;
|
2018-06-06 15:45:57 +00:00
|
|
|
for (; i + 1 < grp->il_inpcnt; ++i)
|
|
|
|
grp->il_inp[i] = grp->il_inp[i + 1];
|
|
|
|
grp->il_inpcnt--;
|
|
|
|
|
|
|
|
if (grp->il_inpsiz > INPCBLBGROUP_SIZMIN &&
|
2018-11-01 15:51:49 +00:00
|
|
|
grp->il_inpcnt <= grp->il_inpsiz / 4) {
|
2018-06-06 15:45:57 +00:00
|
|
|
/* Shrink this group. */
|
2018-11-01 15:51:49 +00:00
|
|
|
new_grp = in_pcblbgroup_resize(hdr, grp, grp->il_inpsiz / 2);
|
|
|
|
if (new_grp != NULL)
|
2018-06-06 15:45:57 +00:00
|
|
|
*grpp = new_grp;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add PCB to load balance group for SO_REUSEPORT_LB option.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
in_pcbinslbgrouphash(struct inpcb *inp)
|
|
|
|
{
|
2018-09-07 21:11:41 +00:00
|
|
|
const static struct timeval interval = { 60, 0 };
|
|
|
|
static struct timeval lastprint;
|
2018-06-06 15:45:57 +00:00
|
|
|
struct inpcbinfo *pcbinfo;
|
|
|
|
struct inpcblbgrouphead *hdr;
|
|
|
|
struct inpcblbgroup *grp;
|
2018-11-01 15:51:49 +00:00
|
|
|
uint32_t idx;
|
2018-06-06 15:45:57 +00:00
|
|
|
|
|
|
|
pcbinfo = inp->inp_pcbinfo;
|
|
|
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
INP_HASH_WLOCK_ASSERT(pcbinfo);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Don't allow jailed socket to join local group.
|
|
|
|
*/
|
2018-11-01 15:51:49 +00:00
|
|
|
if (inp->inp_socket != NULL && jailed(inp->inp_socket->so_cred))
|
2018-06-06 15:45:57 +00:00
|
|
|
return (0);
|
|
|
|
|
|
|
|
#ifdef INET6
|
|
|
|
/*
|
|
|
|
* Don't allow IPv4 mapped INET6 wild socket.
|
|
|
|
*/
|
|
|
|
if ((inp->inp_vflag & INP_IPV4) &&
|
|
|
|
inp->inp_laddr.s_addr == INADDR_ANY &&
|
|
|
|
INP_CHECK_SOCKAF(inp->inp_socket, AF_INET6)) {
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2018-12-05 17:06:00 +00:00
|
|
|
idx = INP_PCBPORTHASH(inp->inp_lport, pcbinfo->ipi_lbgrouphashmask);
|
2018-11-01 15:51:49 +00:00
|
|
|
hdr = &pcbinfo->ipi_lbgrouphashbase[idx];
|
2018-09-10 19:00:29 +00:00
|
|
|
CK_LIST_FOREACH(grp, hdr, il_list) {
|
2018-06-06 15:45:57 +00:00
|
|
|
if (grp->il_vflag == inp->inp_vflag &&
|
|
|
|
grp->il_lport == inp->inp_lport &&
|
|
|
|
memcmp(&grp->il_dependladdr,
|
2018-11-01 15:51:49 +00:00
|
|
|
&inp->inp_inc.inc_ie.ie_dependladdr,
|
|
|
|
sizeof(grp->il_dependladdr)) == 0)
|
2018-06-06 15:45:57 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (grp == NULL) {
|
|
|
|
/* Create new load balance group. */
|
|
|
|
grp = in_pcblbgroup_alloc(hdr, inp->inp_vflag,
|
|
|
|
inp->inp_lport, &inp->inp_inc.inc_ie.ie_dependladdr,
|
|
|
|
INPCBLBGROUP_SIZMIN);
|
2018-11-01 15:51:49 +00:00
|
|
|
if (grp == NULL)
|
2018-06-06 15:45:57 +00:00
|
|
|
return (ENOBUFS);
|
|
|
|
} else if (grp->il_inpcnt == grp->il_inpsiz) {
|
|
|
|
if (grp->il_inpsiz >= INPCBLBGROUP_SIZMAX) {
|
2018-09-07 21:11:41 +00:00
|
|
|
if (ratecheck(&lastprint, &interval))
|
2018-06-06 15:45:57 +00:00
|
|
|
printf("lb group port %d, limit reached\n",
|
|
|
|
ntohs(grp->il_lport));
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Expand this local group. */
|
|
|
|
grp = in_pcblbgroup_resize(hdr, grp, grp->il_inpsiz * 2);
|
2018-11-01 15:51:49 +00:00
|
|
|
if (grp == NULL)
|
2018-06-06 15:45:57 +00:00
|
|
|
return (ENOBUFS);
|
|
|
|
}
|
|
|
|
|
|
|
|
KASSERT(grp->il_inpcnt < grp->il_inpsiz,
|
2018-11-01 15:51:49 +00:00
|
|
|
("invalid local group size %d and count %d", grp->il_inpsiz,
|
|
|
|
grp->il_inpcnt));
|
2018-06-06 15:45:57 +00:00
|
|
|
|
|
|
|
grp->il_inp[grp->il_inpcnt] = inp;
|
|
|
|
grp->il_inpcnt++;
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove PCB from load balance group.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
in_pcbremlbgrouphash(struct inpcb *inp)
|
|
|
|
{
|
|
|
|
struct inpcbinfo *pcbinfo;
|
|
|
|
struct inpcblbgrouphead *hdr;
|
|
|
|
struct inpcblbgroup *grp;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
pcbinfo = inp->inp_pcbinfo;
|
|
|
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
INP_HASH_WLOCK_ASSERT(pcbinfo);
|
|
|
|
|
|
|
|
hdr = &pcbinfo->ipi_lbgrouphashbase[
|
2018-12-05 17:06:00 +00:00
|
|
|
INP_PCBPORTHASH(inp->inp_lport, pcbinfo->ipi_lbgrouphashmask)];
|
2018-09-10 19:00:29 +00:00
|
|
|
CK_LIST_FOREACH(grp, hdr, il_list) {
|
2018-06-06 15:45:57 +00:00
|
|
|
for (i = 0; i < grp->il_inpcnt; ++i) {
|
|
|
|
if (grp->il_inp[i] != inp)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (grp->il_inpcnt == 1) {
|
|
|
|
/* We are the last, free this local group. */
|
|
|
|
in_pcblbgroup_free(grp);
|
|
|
|
} else {
|
|
|
|
/* Pull up inpcbs, shrink group if possible. */
|
|
|
|
in_pcblbgroup_reorder(hdr, &grp, i);
|
|
|
|
}
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-05-15 21:58:36 +00:00
|
|
|
/*
|
|
|
|
* Different protocols initialize their inpcbs differently - giving
|
|
|
|
* different name to the lock. But they all are disposed the same.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
inpcb_fini(void *mem, int size)
|
|
|
|
{
|
|
|
|
struct inpcb *inp = mem;
|
|
|
|
|
|
|
|
INP_LOCK_DESTROY(inp);
|
|
|
|
}
|
|
|
|
|
2010-03-14 18:59:11 +00:00
|
|
|
/*
|
|
|
|
* Initialize an inpcbinfo -- we should be able to reduce the number of
|
|
|
|
* arguments in time.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
in_pcbinfo_init(struct inpcbinfo *pcbinfo, const char *name,
|
|
|
|
struct inpcbhead *listhead, int hash_nelements, int porthash_nelements,
|
2017-05-15 21:58:36 +00:00
|
|
|
char *inpcbzone_name, uma_init inpcbzone_init, u_int hashfields)
|
2010-03-14 18:59:11 +00:00
|
|
|
{
|
|
|
|
|
2018-12-05 17:06:00 +00:00
|
|
|
porthash_nelements = imin(porthash_nelements, IPPORT_MAX + 1);
|
|
|
|
|
2010-03-14 18:59:11 +00:00
|
|
|
INP_INFO_LOCK_INIT(pcbinfo, name);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_LOCK_INIT(pcbinfo, "pcbinfohash"); /* XXXRW: argument? */
|
2015-08-03 12:13:54 +00:00
|
|
|
INP_LIST_LOCK_INIT(pcbinfo, "pcbinfolist");
|
2010-03-14 18:59:11 +00:00
|
|
|
#ifdef VIMAGE
|
|
|
|
pcbinfo->ipi_vnet = curvnet;
|
|
|
|
#endif
|
|
|
|
pcbinfo->ipi_listhead = listhead;
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_INIT(pcbinfo->ipi_listhead);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
pcbinfo->ipi_count = 0;
|
2010-03-14 18:59:11 +00:00
|
|
|
pcbinfo->ipi_hashbase = hashinit(hash_nelements, M_PCB,
|
|
|
|
&pcbinfo->ipi_hashmask);
|
|
|
|
pcbinfo->ipi_porthashbase = hashinit(porthash_nelements, M_PCB,
|
|
|
|
&pcbinfo->ipi_porthashmask);
|
2018-12-05 17:06:00 +00:00
|
|
|
pcbinfo->ipi_lbgrouphashbase = hashinit(porthash_nelements, M_PCB,
|
2018-06-06 15:45:57 +00:00
|
|
|
&pcbinfo->ipi_lbgrouphashmask);
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#ifdef PCBGROUP
|
|
|
|
in_pcbgroup_init(pcbinfo, hashfields, hash_nelements);
|
|
|
|
#endif
|
2010-03-14 18:59:11 +00:00
|
|
|
pcbinfo->ipi_zone = uma_zcreate(inpcbzone_name, sizeof(struct inpcb),
|
2017-05-15 21:58:36 +00:00
|
|
|
NULL, NULL, inpcbzone_init, inpcb_fini, UMA_ALIGN_PTR, 0);
|
2010-03-14 18:59:11 +00:00
|
|
|
uma_zone_set_max(pcbinfo->ipi_zone, maxsockets);
|
2012-12-08 12:51:06 +00:00
|
|
|
uma_zone_set_warning(pcbinfo->ipi_zone,
|
|
|
|
"kern.ipc.maxsockets limit reached");
|
2010-03-14 18:59:11 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Destroy an inpcbinfo.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
in_pcbinfo_destroy(struct inpcbinfo *pcbinfo)
|
|
|
|
{
|
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
KASSERT(pcbinfo->ipi_count == 0,
|
|
|
|
("%s: ipi_count = %u", __func__, pcbinfo->ipi_count));
|
|
|
|
|
2010-03-14 18:59:11 +00:00
|
|
|
hashdestroy(pcbinfo->ipi_hashbase, M_PCB, pcbinfo->ipi_hashmask);
|
|
|
|
hashdestroy(pcbinfo->ipi_porthashbase, M_PCB,
|
|
|
|
pcbinfo->ipi_porthashmask);
|
2018-06-06 15:45:57 +00:00
|
|
|
hashdestroy(pcbinfo->ipi_lbgrouphashbase, M_PCB,
|
|
|
|
pcbinfo->ipi_lbgrouphashmask);
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#ifdef PCBGROUP
|
|
|
|
in_pcbgroup_destroy(pcbinfo);
|
|
|
|
#endif
|
2010-03-14 18:59:11 +00:00
|
|
|
uma_zdestroy(pcbinfo->ipi_zone);
|
2015-08-03 12:13:54 +00:00
|
|
|
INP_LIST_LOCK_DESTROY(pcbinfo);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_LOCK_DESTROY(pcbinfo);
|
2010-03-14 18:59:11 +00:00
|
|
|
INP_INFO_LOCK_DESTROY(pcbinfo);
|
|
|
|
}
|
|
|
|
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
/*
|
|
|
|
* Allocate a PCB and associate it with the socket.
|
2006-07-18 22:34:27 +00:00
|
|
|
* On success return with the PCB locked.
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
*/
|
1994-05-24 10:09:53 +00:00
|
|
|
int
|
2006-07-18 22:34:27 +00:00
|
|
|
in_pcballoc(struct socket *so, struct inpcbinfo *pcbinfo)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2006-01-22 01:16:25 +00:00
|
|
|
struct inpcb *inp;
|
2001-07-26 19:19:49 +00:00
|
|
|
int error;
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
|
|
|
|
|
|
|
error = 0;
|
2006-07-18 22:34:27 +00:00
|
|
|
inp = uma_zalloc(pcbinfo->ipi_zone, M_NOWAIT);
|
1994-05-24 10:09:53 +00:00
|
|
|
if (inp == NULL)
|
|
|
|
return (ENOBUFS);
|
2017-05-24 17:47:16 +00:00
|
|
|
bzero(&inp->inp_start_zero, inp_zero_size);
|
2019-04-25 15:37:28 +00:00
|
|
|
#ifdef NUMA
|
|
|
|
inp->inp_numa_domain = M_NODOM;
|
|
|
|
#endif
|
1995-04-09 01:29:31 +00:00
|
|
|
inp->inp_pcbinfo = pcbinfo;
|
1994-05-24 10:09:53 +00:00
|
|
|
inp->inp_socket = so;
|
2008-10-04 15:06:34 +00:00
|
|
|
inp->inp_cred = crhold(so->so_cred);
|
Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)
Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.
From my notes:
-----
One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.
Constraints:
------------
I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.
One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".
One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.
This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.
Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.
To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.
The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.
The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.
In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.
One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).
You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.
This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.
Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.
Packets fall into one of a number of classes.
1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..
setfib -3 ping target.example.com # will use fib 3 for ping.
It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.
2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)
3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).
4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.
5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.
6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.
Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)
In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.
In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.
Early testing experience:
-------------------------
Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.
For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.
Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.
ipfw has grown 2 new keywords:
setfib N ip from anay to any
count ip from any to any fib N
In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.
SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.
Where to next:
--------------------
After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.
Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.
My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.
When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.
Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.
This work was sponsored by Ironport Systems/Cisco
Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco
2008-05-09 23:03:00 +00:00
|
|
|
inp->inp_inc.inc_fibnum = so->so_fibnum;
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
|
|
|
#ifdef MAC
|
2007-10-24 19:04:04 +00:00
|
|
|
error = mac_inpcb_init(inp, M_NOWAIT);
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
|
|
|
if (error != 0)
|
|
|
|
goto out;
|
2007-10-24 19:04:04 +00:00
|
|
|
mac_inpcb_create(so, inp);
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
|
|
|
#endif
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT)
|
|
|
|
error = ipsec_init_pcbpolicy(inp);
|
2007-12-22 10:06:11 +00:00
|
|
|
if (error != 0) {
|
|
|
|
#ifdef MAC
|
|
|
|
mac_inpcb_destroy(inp);
|
|
|
|
#endif
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
|
|
|
goto out;
|
2008-03-17 13:04:56 +00:00
|
|
|
}
|
2007-07-03 12:13:45 +00:00
|
|
|
#endif /*IPSEC*/
|
2006-11-30 10:54:54 +00:00
|
|
|
#ifdef INET6
|
2003-02-19 22:32:43 +00:00
|
|
|
if (INP_SOCKAF(so) == AF_INET6) {
|
|
|
|
inp->inp_vflag |= INP_IPV6PROTO;
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if (V_ip6_v6only)
|
2003-02-19 22:32:43 +00:00
|
|
|
inp->inp_flags |= IN6P_IPV6_V6ONLY;
|
|
|
|
}
|
2000-04-02 03:49:25 +00:00
|
|
|
#endif
|
2015-08-03 12:13:54 +00:00
|
|
|
INP_WLOCK(inp);
|
|
|
|
INP_LIST_WLOCK(pcbinfo);
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_INSERT_HEAD(pcbinfo->ipi_listhead, inp, inp_list);
|
1998-03-24 18:06:34 +00:00
|
|
|
pcbinfo->ipi_count++;
|
1994-05-24 10:09:53 +00:00
|
|
|
so->so_pcb = (caddr_t)inp;
|
2001-06-11 12:39:29 +00:00
|
|
|
#ifdef INET6
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if (V_ip6_auto_flowlabel)
|
2001-06-11 12:39:29 +00:00
|
|
|
inp->inp_flags |= IN6P_AUTOFLOWLABEL;
|
|
|
|
#endif
|
2006-07-18 22:34:27 +00:00
|
|
|
inp->inp_gencnt = ++pcbinfo->ipi_gencnt;
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
|
|
|
refcount_init(&inp->inp_refcount, 1); /* Reference from inpcbinfo */
|
2017-03-25 15:06:28 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Routes in inpcb's can cache L2 as well; they are guaranteed
|
|
|
|
* to be cleaned up.
|
|
|
|
*/
|
|
|
|
inp->inp_route.ro_flags = RT_LLE_CACHE;
|
2015-08-03 12:13:54 +00:00
|
|
|
INP_LIST_WUNLOCK(pcbinfo);
|
2017-02-14 21:33:10 +00:00
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT) || defined(MAC)
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
|
|
|
out:
|
2008-10-04 15:06:34 +00:00
|
|
|
if (error != 0) {
|
|
|
|
crfree(inp->inp_cred);
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
|
|
|
uma_zfree(pcbinfo->ipi_zone, inp);
|
2008-10-04 15:06:34 +00:00
|
|
|
}
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
|
|
|
#endif
|
|
|
|
return (error);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2011-04-30 11:04:34 +00:00
|
|
|
#ifdef INET
|
1994-05-24 10:09:53 +00:00
|
|
|
int
|
2006-01-22 01:16:25 +00:00
|
|
|
in_pcbbind(struct inpcb *inp, struct sockaddr *nam, struct ucred *cred)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2002-10-20 21:44:31 +00:00
|
|
|
int anonport, error;
|
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
|
2003-11-08 23:02:36 +00:00
|
|
|
|
2002-10-20 21:44:31 +00:00
|
|
|
if (inp->inp_lport != 0 || inp->inp_laddr.s_addr != INADDR_ANY)
|
|
|
|
return (EINVAL);
|
2013-01-25 20:14:27 +00:00
|
|
|
anonport = nam == NULL || ((struct sockaddr_in *)nam)->sin_port == 0;
|
2002-10-20 21:44:31 +00:00
|
|
|
error = in_pcbbind_setup(inp, nam, &inp->inp_laddr.s_addr,
|
2004-03-27 21:05:46 +00:00
|
|
|
&inp->inp_lport, cred);
|
2002-10-20 21:44:31 +00:00
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
if (in_pcbinshash(inp) != 0) {
|
|
|
|
inp->inp_laddr.s_addr = INADDR_ANY;
|
|
|
|
inp->inp_lport = 0;
|
|
|
|
return (EAGAIN);
|
|
|
|
}
|
|
|
|
if (anonport)
|
|
|
|
inp->inp_flags |= INP_ANONPORT;
|
|
|
|
return (0);
|
|
|
|
}
|
2011-04-30 11:04:34 +00:00
|
|
|
#endif
|
2002-10-20 21:44:31 +00:00
|
|
|
|
2020-05-18 22:53:12 +00:00
|
|
|
#if defined(INET) || defined(INET6)
|
2014-07-29 23:42:51 +00:00
|
|
|
/*
|
2020-05-18 22:53:12 +00:00
|
|
|
* Assign a local port like in_pcb_lport(), but also used with connect()
|
|
|
|
* and a foreign address and port. If fsa is non-NULL, choose a local port
|
|
|
|
* that is unused with those, otherwise one that is completely unused.
|
2020-05-19 01:05:13 +00:00
|
|
|
* lsa can be NULL for IPv6.
|
2014-07-29 23:42:51 +00:00
|
|
|
*/
|
2011-03-12 21:46:37 +00:00
|
|
|
int
|
2020-05-18 22:53:12 +00:00
|
|
|
in_pcb_lport_dest(struct inpcb *inp, struct sockaddr *lsa, u_short *lportp,
|
|
|
|
struct sockaddr *fsa, u_short fport, struct ucred *cred, int lookupflags)
|
2011-03-12 21:46:37 +00:00
|
|
|
{
|
|
|
|
struct inpcbinfo *pcbinfo;
|
|
|
|
struct inpcb *tmpinp;
|
|
|
|
unsigned short *lastport;
|
|
|
|
int count, dorandom, error;
|
|
|
|
u_short aux, first, last, lport;
|
|
|
|
#ifdef INET
|
2020-05-18 22:53:12 +00:00
|
|
|
struct in_addr laddr, faddr;
|
|
|
|
#endif
|
|
|
|
#ifdef INET6
|
|
|
|
struct in6_addr *laddr6, *faddr6;
|
2011-03-12 21:46:37 +00:00
|
|
|
#endif
|
|
|
|
|
|
|
|
pcbinfo = inp->inp_pcbinfo;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Because no actual state changes occur here, a global write lock on
|
|
|
|
* the pcbinfo isn't required.
|
|
|
|
*/
|
|
|
|
INP_LOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_LOCK_ASSERT(pcbinfo);
|
2011-03-12 21:46:37 +00:00
|
|
|
|
|
|
|
if (inp->inp_flags & INP_HIGHPORT) {
|
|
|
|
first = V_ipport_hifirstauto; /* sysctl */
|
|
|
|
last = V_ipport_hilastauto;
|
|
|
|
lastport = &pcbinfo->ipi_lasthi;
|
|
|
|
} else if (inp->inp_flags & INP_LOWPORT) {
|
2018-12-11 19:32:16 +00:00
|
|
|
error = priv_check_cred(cred, PRIV_NETINET_RESERVEDPORT);
|
2011-03-12 21:46:37 +00:00
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
first = V_ipport_lowfirstauto; /* 1023 */
|
|
|
|
last = V_ipport_lowlastauto; /* 600 */
|
|
|
|
lastport = &pcbinfo->ipi_lastlow;
|
|
|
|
} else {
|
|
|
|
first = V_ipport_firstauto; /* sysctl */
|
|
|
|
last = V_ipport_lastauto;
|
|
|
|
lastport = &pcbinfo->ipi_lastport;
|
|
|
|
}
|
|
|
|
/*
|
2014-04-07 01:53:03 +00:00
|
|
|
* For UDP(-Lite), use random port allocation as long as the user
|
2011-03-12 21:46:37 +00:00
|
|
|
* allows it. For TCP (and as of yet unknown) connections,
|
|
|
|
* use random port allocation only if the user allows it AND
|
|
|
|
* ipport_tick() allows it.
|
|
|
|
*/
|
|
|
|
if (V_ipport_randomized &&
|
2014-04-07 01:53:03 +00:00
|
|
|
(!V_ipport_stoprandom || pcbinfo == &V_udbinfo ||
|
|
|
|
pcbinfo == &V_ulitecbinfo))
|
2011-03-12 21:46:37 +00:00
|
|
|
dorandom = 1;
|
|
|
|
else
|
|
|
|
dorandom = 0;
|
|
|
|
/*
|
|
|
|
* It makes no sense to do random port allocation if
|
|
|
|
* we have the only port available.
|
|
|
|
*/
|
|
|
|
if (first == last)
|
|
|
|
dorandom = 0;
|
2014-04-07 01:53:03 +00:00
|
|
|
/* Make sure to not include UDP(-Lite) packets in the count. */
|
|
|
|
if (pcbinfo != &V_udbinfo || pcbinfo != &V_ulitecbinfo)
|
2011-03-12 21:46:37 +00:00
|
|
|
V_ipport_tcpallocs++;
|
|
|
|
/*
|
|
|
|
* Instead of having two loops further down counting up or down
|
|
|
|
* make sure that first is always <= last and go with only one
|
|
|
|
* code path implementing all logic.
|
|
|
|
*/
|
|
|
|
if (first > last) {
|
|
|
|
aux = first;
|
|
|
|
first = last;
|
|
|
|
last = aux;
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifdef INET
|
2020-05-18 22:53:12 +00:00
|
|
|
laddr.s_addr = INADDR_ANY;
|
2011-03-19 19:08:54 +00:00
|
|
|
if ((inp->inp_vflag & (INP_IPV4|INP_IPV6)) == INP_IPV4) {
|
2020-05-19 01:05:13 +00:00
|
|
|
if (lsa != NULL)
|
|
|
|
laddr = ((struct sockaddr_in *)lsa)->sin_addr;
|
2020-05-18 22:53:12 +00:00
|
|
|
if (fsa != NULL)
|
|
|
|
faddr = ((struct sockaddr_in *)fsa)->sin_addr;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
#ifdef INET6
|
2020-05-19 01:05:13 +00:00
|
|
|
laddr6 = NULL;
|
|
|
|
if ((inp->inp_vflag & INP_IPV6) != 0) {
|
|
|
|
if (lsa != NULL)
|
|
|
|
laddr6 = &((struct sockaddr_in6 *)lsa)->sin6_addr;
|
2020-05-18 22:53:12 +00:00
|
|
|
if (fsa != NULL)
|
|
|
|
faddr6 = &((struct sockaddr_in6 *)fsa)->sin6_addr;
|
2011-03-12 21:46:37 +00:00
|
|
|
}
|
|
|
|
#endif
|
2020-05-18 22:53:12 +00:00
|
|
|
|
|
|
|
tmpinp = NULL;
|
2011-03-12 21:46:37 +00:00
|
|
|
lport = *lportp;
|
|
|
|
|
|
|
|
if (dorandom)
|
|
|
|
*lastport = first + (arc4random() % (last - first));
|
|
|
|
|
|
|
|
count = last - first;
|
|
|
|
|
|
|
|
do {
|
|
|
|
if (count-- < 0) /* completely used? */
|
|
|
|
return (EADDRNOTAVAIL);
|
|
|
|
++*lastport;
|
|
|
|
if (*lastport < first || *lastport > last)
|
|
|
|
*lastport = first;
|
|
|
|
lport = htons(*lastport);
|
|
|
|
|
2020-05-18 22:53:12 +00:00
|
|
|
if (fsa != NULL) {
|
|
|
|
|
|
|
|
#ifdef INET
|
|
|
|
if (lsa->sa_family == AF_INET) {
|
|
|
|
tmpinp = in_pcblookup_hash_locked(pcbinfo,
|
|
|
|
faddr, fport, laddr, lport, lookupflags,
|
|
|
|
NULL);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
#ifdef INET6
|
|
|
|
if (lsa->sa_family == AF_INET6) {
|
|
|
|
tmpinp = in6_pcblookup_hash_locked(pcbinfo,
|
|
|
|
faddr6, fport, laddr6, lport, lookupflags,
|
|
|
|
NULL);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
} else {
|
2011-03-12 21:46:37 +00:00
|
|
|
#ifdef INET6
|
2020-05-18 22:53:12 +00:00
|
|
|
if ((inp->inp_vflag & INP_IPV6) != 0)
|
|
|
|
tmpinp = in6_pcblookup_local(pcbinfo,
|
|
|
|
&inp->in6p_laddr, lport, lookupflags, cred);
|
2011-03-12 21:46:37 +00:00
|
|
|
#endif
|
|
|
|
#if defined(INET) && defined(INET6)
|
2020-05-18 22:53:12 +00:00
|
|
|
else
|
2011-03-12 21:46:37 +00:00
|
|
|
#endif
|
|
|
|
#ifdef INET
|
2020-05-18 22:53:12 +00:00
|
|
|
tmpinp = in_pcblookup_local(pcbinfo, laddr,
|
|
|
|
lport, lookupflags, cred);
|
2011-03-12 21:46:37 +00:00
|
|
|
#endif
|
2020-05-18 22:53:12 +00:00
|
|
|
}
|
2011-03-12 21:46:37 +00:00
|
|
|
} while (tmpinp != NULL);
|
|
|
|
|
|
|
|
*lportp = lport;
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
2013-07-04 18:38:00 +00:00
|
|
|
|
2020-05-18 22:53:12 +00:00
|
|
|
/*
|
|
|
|
* Select a local port (number) to use.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
in_pcb_lport(struct inpcb *inp, struct in_addr *laddrp, u_short *lportp,
|
|
|
|
struct ucred *cred, int lookupflags)
|
|
|
|
{
|
|
|
|
struct sockaddr_in laddr;
|
|
|
|
|
|
|
|
if (laddrp) {
|
|
|
|
bzero(&laddr, sizeof(laddr));
|
|
|
|
laddr.sin_family = AF_INET;
|
|
|
|
laddr.sin_addr = *laddrp;
|
|
|
|
}
|
|
|
|
return (in_pcb_lport_dest(inp, laddrp ? (struct sockaddr *) &laddr :
|
|
|
|
NULL, lportp, NULL, 0, cred, lookupflags));
|
|
|
|
}
|
|
|
|
|
2013-07-04 18:38:00 +00:00
|
|
|
/*
|
|
|
|
* Return cached socket options.
|
|
|
|
*/
|
2018-06-06 15:45:57 +00:00
|
|
|
int
|
2013-07-04 18:38:00 +00:00
|
|
|
inp_so_options(const struct inpcb *inp)
|
|
|
|
{
|
2018-06-06 15:45:57 +00:00
|
|
|
int so_options;
|
2013-07-04 18:38:00 +00:00
|
|
|
|
2018-06-06 15:45:57 +00:00
|
|
|
so_options = 0;
|
2013-07-04 18:38:00 +00:00
|
|
|
|
2018-06-06 15:45:57 +00:00
|
|
|
if ((inp->inp_flags2 & INP_REUSEPORT_LB) != 0)
|
|
|
|
so_options |= SO_REUSEPORT_LB;
|
|
|
|
if ((inp->inp_flags2 & INP_REUSEPORT) != 0)
|
|
|
|
so_options |= SO_REUSEPORT;
|
|
|
|
if ((inp->inp_flags2 & INP_REUSEADDR) != 0)
|
|
|
|
so_options |= SO_REUSEADDR;
|
|
|
|
return (so_options);
|
2013-07-04 18:38:00 +00:00
|
|
|
}
|
2011-03-12 21:46:37 +00:00
|
|
|
#endif /* INET || INET6 */
|
|
|
|
|
2014-07-10 03:10:56 +00:00
|
|
|
/*
|
|
|
|
* Check if a new BINDMULTI socket is allowed to be created.
|
|
|
|
*
|
|
|
|
* ni points to the new inp.
|
|
|
|
* oi points to the exisitng inp.
|
|
|
|
*
|
|
|
|
* This checks whether the existing inp also has BINDMULTI and
|
|
|
|
* whether the credentials match.
|
|
|
|
*/
|
2014-07-12 05:40:13 +00:00
|
|
|
int
|
2014-07-10 03:10:56 +00:00
|
|
|
in_pcbbind_check_bindmulti(const struct inpcb *ni, const struct inpcb *oi)
|
|
|
|
{
|
|
|
|
/* Check permissions match */
|
|
|
|
if ((ni->inp_flags2 & INP_BINDMULTI) &&
|
|
|
|
(ni->inp_cred->cr_uid !=
|
|
|
|
oi->inp_cred->cr_uid))
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
/* Check the existing inp has BINDMULTI set */
|
|
|
|
if ((ni->inp_flags2 & INP_BINDMULTI) &&
|
|
|
|
((oi->inp_flags2 & INP_BINDMULTI) == 0))
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We're okay - either INP_BINDMULTI isn't set on ni, or
|
|
|
|
* it is and it matches the checks.
|
|
|
|
*/
|
|
|
|
return (1);
|
|
|
|
}
|
|
|
|
|
2014-07-12 05:40:13 +00:00
|
|
|
#ifdef INET
|
2002-10-20 21:44:31 +00:00
|
|
|
/*
|
|
|
|
* Set up a bind operation on a PCB, performing port allocation
|
|
|
|
* as required, but do not actually modify the PCB. Callers can
|
|
|
|
* either complete the bind by setting inp_laddr/inp_lport and
|
|
|
|
* calling in_pcbinshash(), or they can just use the resulting
|
|
|
|
* port and address to authorise the sending of a once-off packet.
|
|
|
|
*
|
|
|
|
* On error, the values of *laddrp and *lportp are not changed.
|
|
|
|
*/
|
|
|
|
int
|
2006-01-22 01:16:25 +00:00
|
|
|
in_pcbbind_setup(struct inpcb *inp, struct sockaddr *nam, in_addr_t *laddrp,
|
|
|
|
u_short *lportp, struct ucred *cred)
|
2002-10-20 21:44:31 +00:00
|
|
|
{
|
|
|
|
struct socket *so = inp->inp_socket;
|
1995-04-09 01:29:31 +00:00
|
|
|
struct sockaddr_in *sin;
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
struct inpcbinfo *pcbinfo = inp->inp_pcbinfo;
|
2002-10-20 21:44:31 +00:00
|
|
|
struct in_addr laddr;
|
1994-05-24 10:09:53 +00:00
|
|
|
u_short lport = 0;
|
2011-05-23 15:23:18 +00:00
|
|
|
int lookupflags = 0, reuseport = (so->so_options & SO_REUSEPORT);
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
int error;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2018-06-06 15:45:57 +00:00
|
|
|
/*
|
|
|
|
* XXX: Maybe we could let SO_REUSEPORT_LB set SO_REUSEPORT bit here
|
|
|
|
* so that we don't have to add to the (already messy) code below.
|
|
|
|
*/
|
|
|
|
int reuseport_lb = (so->so_options & SO_REUSEPORT_LB);
|
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
/*
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
* No state changes, so read locks are sufficient here.
|
2008-04-17 21:38:18 +00:00
|
|
|
*/
|
2003-11-08 23:02:36 +00:00
|
|
|
INP_LOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_LOCK_ASSERT(pcbinfo);
|
2003-11-08 23:02:36 +00:00
|
|
|
|
2018-05-18 20:13:34 +00:00
|
|
|
if (CK_STAILQ_EMPTY(&V_in_ifaddrhead)) /* XXX broken! */
|
1994-05-24 10:09:53 +00:00
|
|
|
return (EADDRNOTAVAIL);
|
2002-10-20 21:44:31 +00:00
|
|
|
laddr.s_addr = *laddrp;
|
|
|
|
if (nam != NULL && laddr.s_addr != INADDR_ANY)
|
1994-05-24 10:09:53 +00:00
|
|
|
return (EINVAL);
|
2018-06-06 15:45:57 +00:00
|
|
|
if ((so->so_options & (SO_REUSEADDR|SO_REUSEPORT|SO_REUSEPORT_LB)) == 0)
|
2011-05-23 15:23:18 +00:00
|
|
|
lookupflags = INPLOOKUP_WILDCARD;
|
2017-02-10 05:58:16 +00:00
|
|
|
if (nam == NULL) {
|
2009-02-05 14:25:53 +00:00
|
|
|
if ((error = prison_local_ip4(cred, &laddr)) != 0)
|
|
|
|
return (error);
|
|
|
|
} else {
|
1997-08-16 19:16:27 +00:00
|
|
|
sin = (struct sockaddr_in *)nam;
|
|
|
|
if (nam->sa_len != sizeof (*sin))
|
1994-05-24 10:09:53 +00:00
|
|
|
return (EINVAL);
|
|
|
|
#ifdef notdef
|
|
|
|
/*
|
|
|
|
* We should check the family, but old programs
|
|
|
|
* incorrectly fail to initialize it.
|
|
|
|
*/
|
|
|
|
if (sin->sin_family != AF_INET)
|
|
|
|
return (EAFNOSUPPORT);
|
|
|
|
#endif
|
2009-02-05 14:06:09 +00:00
|
|
|
error = prison_local_ip4(cred, &sin->sin_addr);
|
|
|
|
if (error)
|
|
|
|
return (error);
|
2002-10-20 21:44:31 +00:00
|
|
|
if (sin->sin_port != *lportp) {
|
|
|
|
/* Don't allow the port to change. */
|
|
|
|
if (*lportp != 0)
|
|
|
|
return (EINVAL);
|
|
|
|
lport = sin->sin_port;
|
|
|
|
}
|
|
|
|
/* NB: lport is left as 0 if the port isn't being changed. */
|
1994-05-24 10:09:53 +00:00
|
|
|
if (IN_MULTICAST(ntohl(sin->sin_addr.s_addr))) {
|
|
|
|
/*
|
|
|
|
* Treat SO_REUSEADDR as SO_REUSEPORT for multicast;
|
|
|
|
* allow complete duplication of binding if
|
|
|
|
* SO_REUSEPORT is set, or if SO_REUSEADDR is set
|
|
|
|
* and a multicast address is bound on both
|
|
|
|
* new and duplicated sockets.
|
|
|
|
*/
|
2013-07-12 19:08:33 +00:00
|
|
|
if ((so->so_options & (SO_REUSEADDR|SO_REUSEPORT)) != 0)
|
1994-05-24 10:09:53 +00:00
|
|
|
reuseport = SO_REUSEADDR|SO_REUSEPORT;
|
2018-06-06 15:45:57 +00:00
|
|
|
/*
|
|
|
|
* XXX: How to deal with SO_REUSEPORT_LB here?
|
|
|
|
* Treat same as SO_REUSEPORT for now.
|
|
|
|
*/
|
|
|
|
if ((so->so_options &
|
|
|
|
(SO_REUSEADDR|SO_REUSEPORT_LB)) != 0)
|
|
|
|
reuseport_lb = SO_REUSEADDR|SO_REUSEPORT_LB;
|
1994-05-24 10:09:53 +00:00
|
|
|
} else if (sin->sin_addr.s_addr != INADDR_ANY) {
|
|
|
|
sin->sin_port = 0; /* yech... */
|
2001-11-06 00:48:01 +00:00
|
|
|
bzero(&sin->sin_zero, sizeof(sin->sin_zero));
|
2009-01-09 17:16:18 +00:00
|
|
|
/*
|
2018-06-06 15:45:57 +00:00
|
|
|
* Is the address a local IP address?
|
2009-06-01 10:30:00 +00:00
|
|
|
* If INP_BINDANY is set, then the socket may be bound
|
2009-01-09 18:38:57 +00:00
|
|
|
* to any endpoint address, local or not.
|
2009-01-09 17:16:18 +00:00
|
|
|
*/
|
2009-06-01 10:30:00 +00:00
|
|
|
if ((inp->inp_flags & INP_BINDANY) == 0 &&
|
2018-06-06 15:45:57 +00:00
|
|
|
ifa_ifwithaddr_check((struct sockaddr *)sin) == 0)
|
1994-05-24 10:09:53 +00:00
|
|
|
return (EADDRNOTAVAIL);
|
|
|
|
}
|
2002-10-20 21:44:31 +00:00
|
|
|
laddr = sin->sin_addr;
|
1994-05-24 10:09:53 +00:00
|
|
|
if (lport) {
|
|
|
|
struct inpcb *t;
|
2006-04-04 12:26:07 +00:00
|
|
|
struct tcptw *tw;
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/* GROSS */
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if (ntohs(lport) <= V_ipport_reservedhigh &&
|
|
|
|
ntohs(lport) >= V_ipport_reservedlow &&
|
2018-12-11 19:32:16 +00:00
|
|
|
priv_check_cred(cred, PRIV_NETINET_RESERVEDPORT))
|
1995-09-21 17:55:49 +00:00
|
|
|
return (EACCES);
|
2006-06-27 11:35:53 +00:00
|
|
|
if (!IN_MULTICAST(ntohl(sin->sin_addr.s_addr)) &&
|
2018-12-11 19:32:16 +00:00
|
|
|
priv_check_cred(inp->inp_cred, PRIV_NETINET_REUSEPORT) != 0) {
|
2008-07-10 13:31:11 +00:00
|
|
|
t = in_pcblookup_local(pcbinfo, sin->sin_addr,
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
lport, INPLOOKUP_WILDCARD, cred);
|
2003-02-19 22:32:43 +00:00
|
|
|
/*
|
|
|
|
* XXX
|
|
|
|
* This entire block sorely needs a rewrite.
|
|
|
|
*/
|
2002-05-31 11:52:35 +00:00
|
|
|
if (t &&
|
2014-07-10 03:10:56 +00:00
|
|
|
((inp->inp_flags2 & INP_BINDMULTI) == 0) &&
|
2009-03-15 09:58:31 +00:00
|
|
|
((t->inp_flags & INP_TIMEWAIT) == 0) &&
|
2004-05-20 06:35:02 +00:00
|
|
|
(so->so_type != SOCK_STREAM ||
|
|
|
|
ntohl(t->inp_faddr.s_addr) == INADDR_ANY) &&
|
2002-05-31 11:52:35 +00:00
|
|
|
(ntohl(sin->sin_addr.s_addr) != INADDR_ANY ||
|
|
|
|
ntohl(t->inp_laddr.s_addr) != INADDR_ANY ||
|
2018-06-06 15:45:57 +00:00
|
|
|
(t->inp_flags2 & INP_REUSEPORT) ||
|
|
|
|
(t->inp_flags2 & INP_REUSEPORT_LB) == 0) &&
|
2008-10-04 15:06:34 +00:00
|
|
|
(inp->inp_cred->cr_uid !=
|
|
|
|
t->inp_cred->cr_uid))
|
2002-05-31 11:52:35 +00:00
|
|
|
return (EADDRINUSE);
|
2014-07-10 03:10:56 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the socket is a BINDMULTI socket, then
|
|
|
|
* the credentials need to match and the
|
|
|
|
* original socket also has to have been bound
|
|
|
|
* with BINDMULTI.
|
|
|
|
*/
|
|
|
|
if (t && (! in_pcbbind_check_bindmulti(inp, t)))
|
|
|
|
return (EADDRINUSE);
|
1998-03-01 19:39:29 +00:00
|
|
|
}
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
t = in_pcblookup_local(pcbinfo, sin->sin_addr,
|
2011-05-23 15:23:18 +00:00
|
|
|
lport, lookupflags, cred);
|
2009-03-15 09:58:31 +00:00
|
|
|
if (t && (t->inp_flags & INP_TIMEWAIT)) {
|
2006-04-04 12:26:07 +00:00
|
|
|
/*
|
|
|
|
* XXXRW: If an incpb has had its timewait
|
|
|
|
* state recycled, we treat the address as
|
|
|
|
* being in use (for now). This is better
|
|
|
|
* than a panic, but not desirable.
|
|
|
|
*/
|
2011-11-06 09:17:48 +00:00
|
|
|
tw = intotw(t);
|
2006-04-04 12:26:07 +00:00
|
|
|
if (tw == NULL ||
|
2018-06-06 15:45:57 +00:00
|
|
|
((reuseport & tw->tw_so_options) == 0 &&
|
|
|
|
(reuseport_lb &
|
|
|
|
tw->tw_so_options) == 0)) {
|
2003-02-19 22:32:43 +00:00
|
|
|
return (EADDRINUSE);
|
2018-06-06 15:45:57 +00:00
|
|
|
}
|
2014-07-10 03:10:56 +00:00
|
|
|
} else if (t &&
|
2018-06-06 15:45:57 +00:00
|
|
|
((inp->inp_flags2 & INP_BINDMULTI) == 0) &&
|
|
|
|
(reuseport & inp_so_options(t)) == 0 &&
|
|
|
|
(reuseport_lb & inp_so_options(t)) == 0) {
|
2006-11-30 10:54:54 +00:00
|
|
|
#ifdef INET6
|
2002-05-31 11:52:35 +00:00
|
|
|
if (ntohl(sin->sin_addr.s_addr) !=
|
|
|
|
INADDR_ANY ||
|
|
|
|
ntohl(t->inp_laddr.s_addr) !=
|
|
|
|
INADDR_ANY ||
|
2011-11-06 10:47:20 +00:00
|
|
|
(inp->inp_vflag & INP_IPV6PROTO) == 0 ||
|
|
|
|
(t->inp_vflag & INP_IPV6PROTO) == 0)
|
2006-11-30 10:54:54 +00:00
|
|
|
#endif
|
2018-06-06 15:45:57 +00:00
|
|
|
return (EADDRINUSE);
|
2014-07-10 03:10:56 +00:00
|
|
|
if (t && (! in_pcbbind_check_bindmulti(inp, t)))
|
|
|
|
return (EADDRINUSE);
|
1999-12-07 17:39:16 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
}
|
2002-10-20 21:44:31 +00:00
|
|
|
if (*lportp != 0)
|
|
|
|
lport = *lportp;
|
1996-02-22 21:32:23 +00:00
|
|
|
if (lport == 0) {
|
2017-02-10 05:58:16 +00:00
|
|
|
error = in_pcb_lport(inp, &laddr, &lport, cred, lookupflags);
|
2011-03-12 21:46:37 +00:00
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
2008-03-04 19:16:21 +00:00
|
|
|
|
1996-02-22 21:32:23 +00:00
|
|
|
}
|
2002-10-20 21:44:31 +00:00
|
|
|
*laddrp = laddr.s_addr;
|
|
|
|
*lportp = lport;
|
1994-05-24 10:09:53 +00:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
1995-02-08 20:22:09 +00:00
|
|
|
/*
|
2002-10-21 13:55:50 +00:00
|
|
|
* Connect from a socket to a specified address.
|
|
|
|
* Both address and port must be specified in argument sin.
|
|
|
|
* If don't have a local address for this socket yet,
|
|
|
|
* then pick one.
|
1995-02-08 20:22:09 +00:00
|
|
|
*/
|
2002-10-21 13:55:50 +00:00
|
|
|
int
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
in_pcbconnect_mbuf(struct inpcb *inp, struct sockaddr *nam,
|
2020-01-12 17:52:32 +00:00
|
|
|
struct ucred *cred, struct mbuf *m, bool rehash)
|
2002-10-21 13:55:50 +00:00
|
|
|
{
|
|
|
|
u_short lport, fport;
|
|
|
|
in_addr_t laddr, faddr;
|
|
|
|
int anonport, error;
|
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
|
2004-08-11 04:35:20 +00:00
|
|
|
|
2002-10-21 13:55:50 +00:00
|
|
|
lport = inp->inp_lport;
|
|
|
|
laddr = inp->inp_laddr.s_addr;
|
|
|
|
anonport = (lport == 0);
|
|
|
|
error = in_pcbconnect_setup(inp, nam, &laddr, &lport, &faddr, &fport,
|
2004-03-27 21:05:46 +00:00
|
|
|
NULL, cred);
|
2002-10-21 13:55:50 +00:00
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
/* Do the initial binding of the local address if required. */
|
|
|
|
if (inp->inp_laddr.s_addr == INADDR_ANY && inp->inp_lport == 0) {
|
2020-01-12 17:52:32 +00:00
|
|
|
KASSERT(rehash == true,
|
|
|
|
("Rehashing required for unbound inps"));
|
2002-10-21 13:55:50 +00:00
|
|
|
inp->inp_lport = lport;
|
|
|
|
inp->inp_laddr.s_addr = laddr;
|
|
|
|
if (in_pcbinshash(inp) != 0) {
|
|
|
|
inp->inp_laddr.s_addr = INADDR_ANY;
|
|
|
|
inp->inp_lport = 0;
|
|
|
|
return (EAGAIN);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Commit the remaining changes. */
|
|
|
|
inp->inp_lport = lport;
|
|
|
|
inp->inp_laddr.s_addr = laddr;
|
|
|
|
inp->inp_faddr.s_addr = faddr;
|
|
|
|
inp->inp_fport = fport;
|
2020-01-12 17:52:32 +00:00
|
|
|
if (rehash) {
|
|
|
|
in_pcbrehash_mbuf(inp, m);
|
|
|
|
} else {
|
|
|
|
in_pcbinshash_mbuf(inp, m);
|
|
|
|
}
|
2007-07-01 11:41:27 +00:00
|
|
|
|
2002-10-21 13:55:50 +00:00
|
|
|
if (anonport)
|
|
|
|
inp->inp_flags |= INP_ANONPORT;
|
|
|
|
return (0);
|
|
|
|
}
|
1995-02-08 20:22:09 +00:00
|
|
|
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
int
|
|
|
|
in_pcbconnect(struct inpcb *inp, struct sockaddr *nam, struct ucred *cred)
|
|
|
|
{
|
|
|
|
|
2020-01-12 17:52:32 +00:00
|
|
|
return (in_pcbconnect_mbuf(inp, nam, cred, NULL, true));
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
}
|
|
|
|
|
2008-10-03 12:21:21 +00:00
|
|
|
/*
|
|
|
|
* Do proper source address selection on an unbound socket in case
|
|
|
|
* of connect. Take jails into account as well.
|
|
|
|
*/
|
2014-04-24 12:52:31 +00:00
|
|
|
int
|
2008-10-03 12:21:21 +00:00
|
|
|
in_pcbladdr(struct inpcb *inp, struct in_addr *faddr, struct in_addr *laddr,
|
|
|
|
struct ucred *cred)
|
|
|
|
{
|
|
|
|
struct ifaddr *ifa;
|
|
|
|
struct sockaddr *sa;
|
2020-04-14 23:06:25 +00:00
|
|
|
struct sockaddr_in *sin, dst;
|
|
|
|
struct nhop_object *nh;
|
2008-10-03 12:21:21 +00:00
|
|
|
int error;
|
|
|
|
|
2020-01-22 06:10:41 +00:00
|
|
|
NET_EPOCH_ASSERT();
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
KASSERT(laddr != NULL, ("%s: laddr NULL", __func__));
|
2010-01-17 12:57:11 +00:00
|
|
|
/*
|
|
|
|
* Bypass source address selection and use the primary jail IP
|
|
|
|
* if requested.
|
|
|
|
*/
|
|
|
|
if (cred != NULL && !prison_saddrsel_ip4(cred, laddr))
|
|
|
|
return (0);
|
|
|
|
|
2008-10-03 12:21:21 +00:00
|
|
|
error = 0;
|
|
|
|
|
2020-04-14 23:06:25 +00:00
|
|
|
nh = NULL;
|
|
|
|
bzero(&dst, sizeof(dst));
|
|
|
|
sin = &dst;
|
2008-10-03 12:21:21 +00:00
|
|
|
sin->sin_family = AF_INET;
|
|
|
|
sin->sin_len = sizeof(struct sockaddr_in);
|
|
|
|
sin->sin_addr.s_addr = faddr->s_addr;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If route is known our src addr is taken from the i/f,
|
|
|
|
* else punt.
|
|
|
|
*
|
|
|
|
* Find out route to destination.
|
|
|
|
*/
|
|
|
|
if ((inp->inp_socket->so_options & SO_DONTROUTE) == 0)
|
2020-04-14 23:06:25 +00:00
|
|
|
nh = fib4_lookup(inp->inp_inc.inc_fibnum, *faddr,
|
|
|
|
0, NHR_NONE, 0);
|
2008-10-03 12:21:21 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we found a route, use the address corresponding to
|
|
|
|
* the outgoing interface.
|
2020-02-12 13:31:36 +00:00
|
|
|
*
|
2008-10-03 12:21:21 +00:00
|
|
|
* Otherwise assume faddr is reachable on a directly connected
|
|
|
|
* network and try to find a corresponding interface to take
|
|
|
|
* the source address from.
|
|
|
|
*/
|
2020-04-14 23:06:25 +00:00
|
|
|
if (nh == NULL || nh->nh_ifp == NULL) {
|
2009-06-23 20:19:09 +00:00
|
|
|
struct in_ifaddr *ia;
|
2008-10-03 12:21:21 +00:00
|
|
|
struct ifnet *ifp;
|
|
|
|
|
2014-09-11 20:21:03 +00:00
|
|
|
ia = ifatoia(ifa_ifwithdstaddr((struct sockaddr *)sin,
|
2014-09-16 15:28:19 +00:00
|
|
|
inp->inp_socket->so_fibnum));
|
2018-05-23 21:02:14 +00:00
|
|
|
if (ia == NULL) {
|
2014-09-11 20:21:03 +00:00
|
|
|
ia = ifatoia(ifa_ifwithnet((struct sockaddr *)sin, 0,
|
2014-09-16 15:28:19 +00:00
|
|
|
inp->inp_socket->so_fibnum));
|
2018-05-23 21:02:14 +00:00
|
|
|
|
|
|
|
}
|
2008-10-03 12:21:21 +00:00
|
|
|
if (ia == NULL) {
|
|
|
|
error = ENETUNREACH;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
2009-05-27 14:11:23 +00:00
|
|
|
if (cred == NULL || !prison_flag(cred, PR_IP4)) {
|
2008-10-03 12:21:21 +00:00
|
|
|
laddr->s_addr = ia->ia_addr.sin_addr.s_addr;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
ifp = ia->ia_ifp;
|
|
|
|
ia = NULL;
|
2018-05-18 20:13:34 +00:00
|
|
|
CK_STAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) {
|
2008-10-03 12:21:21 +00:00
|
|
|
|
|
|
|
sa = ifa->ifa_addr;
|
|
|
|
if (sa->sa_family != AF_INET)
|
|
|
|
continue;
|
|
|
|
sin = (struct sockaddr_in *)sa;
|
2009-02-05 14:06:09 +00:00
|
|
|
if (prison_check_ip4(cred, &sin->sin_addr) == 0) {
|
2008-10-03 12:21:21 +00:00
|
|
|
ia = (struct in_ifaddr *)ifa;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (ia != NULL) {
|
|
|
|
laddr->s_addr = ia->ia_addr.sin_addr.s_addr;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* 3. As a last resort return the 'default' jail address. */
|
2009-02-05 14:06:09 +00:00
|
|
|
error = prison_get_ip4(cred, laddr);
|
2008-10-03 12:21:21 +00:00
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the outgoing interface on the route found is not
|
|
|
|
* a loopback interface, use the address from that interface.
|
|
|
|
* In case of jails do those three steps:
|
|
|
|
* 1. check if the interface address belongs to the jail. If so use it.
|
|
|
|
* 2. check if we have any address on the outgoing interface
|
|
|
|
* belonging to this jail. If so use it.
|
|
|
|
* 3. as a last resort return the 'default' jail address.
|
|
|
|
*/
|
2020-04-14 23:06:25 +00:00
|
|
|
if ((nh->nh_ifp->if_flags & IFF_LOOPBACK) == 0) {
|
2009-06-23 20:19:09 +00:00
|
|
|
struct in_ifaddr *ia;
|
2009-04-19 22:25:09 +00:00
|
|
|
struct ifnet *ifp;
|
2008-10-03 12:21:21 +00:00
|
|
|
|
|
|
|
/* If not jailed, use the default returned. */
|
2009-05-27 14:11:23 +00:00
|
|
|
if (cred == NULL || !prison_flag(cred, PR_IP4)) {
|
2020-04-14 23:06:25 +00:00
|
|
|
ia = (struct in_ifaddr *)nh->nh_ifa;
|
2008-10-03 12:21:21 +00:00
|
|
|
laddr->s_addr = ia->ia_addr.sin_addr.s_addr;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Jailed. */
|
|
|
|
/* 1. Check if the iface address belongs to the jail. */
|
2020-04-14 23:06:25 +00:00
|
|
|
sin = (struct sockaddr_in *)nh->nh_ifa->ifa_addr;
|
2009-02-05 14:06:09 +00:00
|
|
|
if (prison_check_ip4(cred, &sin->sin_addr) == 0) {
|
2020-04-14 23:06:25 +00:00
|
|
|
ia = (struct in_ifaddr *)nh->nh_ifa;
|
2008-10-03 12:21:21 +00:00
|
|
|
laddr->s_addr = ia->ia_addr.sin_addr.s_addr;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* 2. Check if we have any address on the outgoing interface
|
|
|
|
* belonging to this jail.
|
|
|
|
*/
|
2009-06-23 20:19:09 +00:00
|
|
|
ia = NULL;
|
2020-04-14 23:06:25 +00:00
|
|
|
ifp = nh->nh_ifp;
|
2018-05-18 20:13:34 +00:00
|
|
|
CK_STAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) {
|
2008-10-03 12:21:21 +00:00
|
|
|
sa = ifa->ifa_addr;
|
|
|
|
if (sa->sa_family != AF_INET)
|
|
|
|
continue;
|
|
|
|
sin = (struct sockaddr_in *)sa;
|
2009-02-05 14:06:09 +00:00
|
|
|
if (prison_check_ip4(cred, &sin->sin_addr) == 0) {
|
2008-10-03 12:21:21 +00:00
|
|
|
ia = (struct in_ifaddr *)ifa;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (ia != NULL) {
|
|
|
|
laddr->s_addr = ia->ia_addr.sin_addr.s_addr;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* 3. As a last resort return the 'default' jail address. */
|
2009-02-05 14:06:09 +00:00
|
|
|
error = prison_get_ip4(cred, laddr);
|
2008-10-03 12:21:21 +00:00
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The outgoing interface is marked with 'loopback net', so a route
|
|
|
|
* to ourselves is here.
|
|
|
|
* Try to find the interface of the destination address and then
|
|
|
|
* take the address from there. That interface is not necessarily
|
|
|
|
* a loopback interface.
|
|
|
|
* In case of jails, check that it is an address of the jail
|
|
|
|
* and if we cannot find, fall back to the 'default' jail address.
|
|
|
|
*/
|
2020-04-14 23:06:25 +00:00
|
|
|
if ((nh->nh_ifp->if_flags & IFF_LOOPBACK) != 0) {
|
2009-06-23 20:19:09 +00:00
|
|
|
struct in_ifaddr *ia;
|
2008-10-03 12:21:21 +00:00
|
|
|
|
2020-04-14 23:06:25 +00:00
|
|
|
ia = ifatoia(ifa_ifwithdstaddr(sintosa(&dst),
|
2014-09-16 15:28:19 +00:00
|
|
|
inp->inp_socket->so_fibnum));
|
2008-10-03 12:21:21 +00:00
|
|
|
if (ia == NULL)
|
2020-04-14 23:06:25 +00:00
|
|
|
ia = ifatoia(ifa_ifwithnet(sintosa(&dst), 0,
|
2014-09-16 15:28:19 +00:00
|
|
|
inp->inp_socket->so_fibnum));
|
2009-09-14 22:19:47 +00:00
|
|
|
if (ia == NULL)
|
2020-04-14 23:06:25 +00:00
|
|
|
ia = ifatoia(ifa_ifwithaddr(sintosa(&dst)));
|
2008-10-03 12:21:21 +00:00
|
|
|
|
2009-05-27 14:11:23 +00:00
|
|
|
if (cred == NULL || !prison_flag(cred, PR_IP4)) {
|
2008-10-03 12:21:21 +00:00
|
|
|
if (ia == NULL) {
|
|
|
|
error = ENETUNREACH;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
laddr->s_addr = ia->ia_addr.sin_addr.s_addr;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Jailed. */
|
|
|
|
if (ia != NULL) {
|
|
|
|
struct ifnet *ifp;
|
|
|
|
|
|
|
|
ifp = ia->ia_ifp;
|
|
|
|
ia = NULL;
|
2018-05-18 20:13:34 +00:00
|
|
|
CK_STAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) {
|
2008-10-03 12:21:21 +00:00
|
|
|
sa = ifa->ifa_addr;
|
|
|
|
if (sa->sa_family != AF_INET)
|
|
|
|
continue;
|
|
|
|
sin = (struct sockaddr_in *)sa;
|
2009-02-05 14:06:09 +00:00
|
|
|
if (prison_check_ip4(cred,
|
|
|
|
&sin->sin_addr) == 0) {
|
2008-10-03 12:21:21 +00:00
|
|
|
ia = (struct in_ifaddr *)ifa;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (ia != NULL) {
|
|
|
|
laddr->s_addr = ia->ia_addr.sin_addr.s_addr;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* 3. As a last resort return the 'default' jail address. */
|
2009-02-05 14:06:09 +00:00
|
|
|
error = prison_get_ip4(cred, laddr);
|
2008-10-03 12:21:21 +00:00
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
done:
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2002-10-21 13:55:50 +00:00
|
|
|
/*
|
|
|
|
* Set up for a connect from a socket to the specified address.
|
|
|
|
* On entry, *laddrp and *lportp should contain the current local
|
|
|
|
* address and port for the PCB; these are updated to the values
|
|
|
|
* that should be placed in inp_laddr and inp_lport to complete
|
|
|
|
* the connect.
|
|
|
|
*
|
|
|
|
* On success, *faddrp and *fportp will be set to the remote address
|
|
|
|
* and port. These are not updated in the error case.
|
|
|
|
*
|
|
|
|
* If the operation fails because the connection already exists,
|
|
|
|
* *oinpp will be set to the PCB of that connection so that the
|
|
|
|
* caller can decide to override it. In all other cases, *oinpp
|
|
|
|
* is set to NULL.
|
|
|
|
*/
|
1995-02-08 20:22:09 +00:00
|
|
|
int
|
2006-01-22 01:16:25 +00:00
|
|
|
in_pcbconnect_setup(struct inpcb *inp, struct sockaddr *nam,
|
|
|
|
in_addr_t *laddrp, u_short *lportp, in_addr_t *faddrp, u_short *fportp,
|
|
|
|
struct inpcb **oinpp, struct ucred *cred)
|
1995-02-08 20:22:09 +00:00
|
|
|
{
|
2015-07-29 08:12:05 +00:00
|
|
|
struct rm_priotracker in_ifa_tracker;
|
2002-10-21 13:55:50 +00:00
|
|
|
struct sockaddr_in *sin = (struct sockaddr_in *)nam;
|
1994-05-24 10:09:53 +00:00
|
|
|
struct in_ifaddr *ia;
|
2002-10-21 13:55:50 +00:00
|
|
|
struct inpcb *oinp;
|
2009-02-05 14:06:09 +00:00
|
|
|
struct in_addr laddr, faddr;
|
2002-10-21 13:55:50 +00:00
|
|
|
u_short lport, fport;
|
|
|
|
int error;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
/*
|
|
|
|
* Because a global state change doesn't actually occur here, a read
|
|
|
|
* lock is sufficient.
|
|
|
|
*/
|
2020-01-22 06:10:41 +00:00
|
|
|
NET_EPOCH_ASSERT();
|
2004-08-11 04:35:20 +00:00
|
|
|
INP_LOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_LOCK_ASSERT(inp->inp_pcbinfo);
|
2004-08-11 04:35:20 +00:00
|
|
|
|
2002-10-21 13:55:50 +00:00
|
|
|
if (oinpp != NULL)
|
|
|
|
*oinpp = NULL;
|
1997-08-16 19:16:27 +00:00
|
|
|
if (nam->sa_len != sizeof (*sin))
|
1994-05-24 10:09:53 +00:00
|
|
|
return (EINVAL);
|
|
|
|
if (sin->sin_family != AF_INET)
|
|
|
|
return (EAFNOSUPPORT);
|
|
|
|
if (sin->sin_port == 0)
|
|
|
|
return (EADDRNOTAVAIL);
|
2002-10-21 13:55:50 +00:00
|
|
|
laddr.s_addr = *laddrp;
|
|
|
|
lport = *lportp;
|
|
|
|
faddr = sin->sin_addr;
|
|
|
|
fport = sin->sin_port;
|
2008-10-03 12:21:21 +00:00
|
|
|
|
2018-05-18 20:13:34 +00:00
|
|
|
if (!CK_STAILQ_EMPTY(&V_in_ifaddrhead)) {
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* If the destination address is INADDR_ANY,
|
|
|
|
* use the primary local address.
|
|
|
|
* If the supplied address is INADDR_BROADCAST,
|
|
|
|
* and the primary interface supports broadcast,
|
|
|
|
* choose the broadcast address for that interface.
|
|
|
|
*/
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
if (faddr.s_addr == INADDR_ANY) {
|
2015-07-29 08:12:05 +00:00
|
|
|
IN_IFADDR_RLOCK(&in_ifa_tracker);
|
2009-02-05 14:06:09 +00:00
|
|
|
faddr =
|
2018-05-18 20:13:34 +00:00
|
|
|
IA_SIN(CK_STAILQ_FIRST(&V_in_ifaddrhead))->sin_addr;
|
2015-07-29 08:12:05 +00:00
|
|
|
IN_IFADDR_RUNLOCK(&in_ifa_tracker);
|
2009-02-05 14:06:09 +00:00
|
|
|
if (cred != NULL &&
|
|
|
|
(error = prison_get_ip4(cred, &faddr)) != 0)
|
|
|
|
return (error);
|
2009-06-25 11:52:33 +00:00
|
|
|
} else if (faddr.s_addr == (u_long)INADDR_BROADCAST) {
|
2015-07-29 08:12:05 +00:00
|
|
|
IN_IFADDR_RLOCK(&in_ifa_tracker);
|
2018-05-18 20:13:34 +00:00
|
|
|
if (CK_STAILQ_FIRST(&V_in_ifaddrhead)->ia_ifp->if_flags &
|
2009-06-25 11:52:33 +00:00
|
|
|
IFF_BROADCAST)
|
2018-05-18 20:13:34 +00:00
|
|
|
faddr = satosin(&CK_STAILQ_FIRST(
|
2009-06-25 11:52:33 +00:00
|
|
|
&V_in_ifaddrhead)->ia_broadaddr)->sin_addr;
|
2015-07-29 08:12:05 +00:00
|
|
|
IN_IFADDR_RUNLOCK(&in_ifa_tracker);
|
2009-06-25 11:52:33 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2002-10-21 13:55:50 +00:00
|
|
|
if (laddr.s_addr == INADDR_ANY) {
|
2011-01-08 22:33:46 +00:00
|
|
|
error = in_pcbladdr(inp, &faddr, &laddr, cred);
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* If the destination address is multicast and an outgoing
|
2011-01-08 22:33:46 +00:00
|
|
|
* interface has been set as a multicast option, prefer the
|
1994-05-24 10:09:53 +00:00
|
|
|
* address of that interface as our source address.
|
|
|
|
*/
|
2002-10-21 13:55:50 +00:00
|
|
|
if (IN_MULTICAST(ntohl(faddr.s_addr)) &&
|
1994-05-24 10:09:53 +00:00
|
|
|
inp->inp_moptions != NULL) {
|
|
|
|
struct ip_moptions *imo;
|
|
|
|
struct ifnet *ifp;
|
|
|
|
|
|
|
|
imo = inp->inp_moptions;
|
|
|
|
if (imo->imo_multicast_ifp != NULL) {
|
|
|
|
ifp = imo->imo_multicast_ifp;
|
2015-07-29 08:12:05 +00:00
|
|
|
IN_IFADDR_RLOCK(&in_ifa_tracker);
|
2018-05-18 20:13:34 +00:00
|
|
|
CK_STAILQ_FOREACH(ia, &V_in_ifaddrhead, ia_link) {
|
2011-01-26 17:31:03 +00:00
|
|
|
if ((ia->ia_ifp == ifp) &&
|
|
|
|
(cred == NULL ||
|
|
|
|
prison_check_ip4(cred,
|
|
|
|
&ia->ia_addr.sin_addr) == 0))
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
2011-01-26 17:31:03 +00:00
|
|
|
}
|
|
|
|
if (ia == NULL)
|
2011-01-08 22:33:46 +00:00
|
|
|
error = EADDRNOTAVAIL;
|
2011-01-26 17:31:03 +00:00
|
|
|
else {
|
2011-01-08 22:33:46 +00:00
|
|
|
laddr = ia->ia_addr.sin_addr;
|
|
|
|
error = 0;
|
2009-06-25 11:52:33 +00:00
|
|
|
}
|
2015-07-29 08:12:05 +00:00
|
|
|
IN_IFADDR_RUNLOCK(&in_ifa_tracker);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
}
|
2011-01-08 22:33:46 +00:00
|
|
|
if (error)
|
|
|
|
return (error);
|
1995-02-08 20:22:09 +00:00
|
|
|
}
|
2020-05-18 22:53:12 +00:00
|
|
|
if (lport != 0) {
|
|
|
|
oinp = in_pcblookup_hash_locked(inp->inp_pcbinfo, faddr,
|
|
|
|
fport, laddr, lport, 0, NULL);
|
|
|
|
if (oinp != NULL) {
|
|
|
|
if (oinpp != NULL)
|
|
|
|
*oinpp = oinp;
|
|
|
|
return (EADDRINUSE);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
struct sockaddr_in lsin, fsin;
|
|
|
|
|
|
|
|
bzero(&lsin, sizeof(lsin));
|
|
|
|
bzero(&fsin, sizeof(fsin));
|
|
|
|
lsin.sin_family = AF_INET;
|
|
|
|
lsin.sin_addr = laddr;
|
|
|
|
fsin.sin_family = AF_INET;
|
|
|
|
fsin.sin_addr = faddr;
|
|
|
|
error = in_pcb_lport_dest(inp, (struct sockaddr *) &lsin,
|
|
|
|
&lport, (struct sockaddr *)& fsin, fport, cred,
|
|
|
|
INPLOOKUP_WILDCARD);
|
2002-10-21 13:55:50 +00:00
|
|
|
if (error)
|
|
|
|
return (error);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2002-10-21 13:55:50 +00:00
|
|
|
*laddrp = laddr.s_addr;
|
|
|
|
*lportp = lport;
|
|
|
|
*faddrp = faddr.s_addr;
|
|
|
|
*fportp = fport;
|
1994-05-24 10:09:53 +00:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
void
|
2006-01-22 01:16:25 +00:00
|
|
|
in_pcbdisconnect(struct inpcb *inp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2005-06-01 11:39:42 +00:00
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK_ASSERT(inp->inp_pcbinfo);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
inp->inp_faddr.s_addr = INADDR_ANY;
|
|
|
|
inp->inp_fport = 0;
|
1995-04-09 01:29:31 +00:00
|
|
|
in_pcbrehash(inp);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2012-01-22 02:13:19 +00:00
|
|
|
#endif /* INET */
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2006-04-01 16:04:42 +00:00
|
|
|
/*
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 20:18:50 +00:00
|
|
|
* in_pcbdetach() is responsibe for disassociating a socket from an inpcb.
|
2008-09-29 13:50:17 +00:00
|
|
|
* For most protocols, this will be invoked immediately prior to calling
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 20:18:50 +00:00
|
|
|
* in_pcbfree(). However, with TCP the inpcb may significantly outlive the
|
|
|
|
* socket, in which case in_pcbfree() is deferred.
|
2006-04-01 16:04:42 +00:00
|
|
|
*/
|
1994-05-25 09:21:21 +00:00
|
|
|
void
|
2006-01-22 01:16:25 +00:00
|
|
|
in_pcbdetach(struct inpcb *inp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2006-04-01 16:04:42 +00:00
|
|
|
|
2008-11-26 12:54:31 +00:00
|
|
|
KASSERT(inp->inp_socket != NULL, ("%s: inp_socket == NULL", __func__));
|
2008-09-29 13:50:17 +00:00
|
|
|
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
#ifdef RATELIMIT
|
|
|
|
if (inp->inp_snd_tag != NULL)
|
|
|
|
in_pcbdetach_txrtlmt(inp);
|
|
|
|
#endif
|
2006-04-01 16:04:42 +00:00
|
|
|
inp->inp_socket->so_pcb = NULL;
|
|
|
|
inp->inp_socket = NULL;
|
|
|
|
}
|
|
|
|
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 20:18:50 +00:00
|
|
|
/*
|
|
|
|
* in_pcbref() bumps the reference count on an inpcb in order to maintain
|
|
|
|
* stability of an inpcb pointer despite the inpcb lock being released. This
|
|
|
|
* is used in TCP when the inpcbinfo lock needs to be acquired or upgraded,
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
* but where the inpcb lock may already held, or when acquiring a reference
|
|
|
|
* via a pcbgroup.
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 20:18:50 +00:00
|
|
|
*
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
|
|
|
* in_pcbref() should be used only to provide brief memory stability, and
|
|
|
|
* must always be followed by a call to INP_WLOCK() and in_pcbrele() to
|
|
|
|
* garbage collect the inpcb if it has been in_pcbfree()'d from another
|
|
|
|
* context. Until in_pcbrele() has returned that the inpcb is still valid,
|
|
|
|
* lock and rele are the *only* safe operations that may be performed on the
|
|
|
|
* inpcb.
|
|
|
|
*
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 20:18:50 +00:00
|
|
|
* While the inpcb will not be freed, releasing the inpcb lock means that the
|
|
|
|
* connection's state may change, so the caller should be careful to
|
|
|
|
* revalidate any cached state on reacquiring the lock. Drop the reference
|
|
|
|
* using in_pcbrele().
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
in_pcbref(struct inpcb *inp)
|
|
|
|
{
|
|
|
|
|
|
|
|
KASSERT(inp->inp_refcount > 0, ("%s: refcount 0", __func__));
|
|
|
|
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
|
|
|
refcount_acquire(&inp->inp_refcount);
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 20:18:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Drop a refcount on an inpcb elevated using in_pcbref(); because a call to
|
|
|
|
* in_pcbfree() may have been made between in_pcbref() and in_pcbrele(), we
|
|
|
|
* return a flag indicating whether or not the inpcb remains valid. If it is
|
|
|
|
* valid, we return with the inpcb lock held.
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
|
|
|
*
|
|
|
|
* Notice that, unlike in_pcbref(), the inpcb lock must be held to drop a
|
|
|
|
* reference on an inpcb. Historically more work was done here (actually, in
|
|
|
|
* in_pcbfree_internal()) but has been moved to in_pcbfree() to avoid the
|
|
|
|
* need for the pcbinfo lock in in_pcbrele(). Deferring the free is entirely
|
|
|
|
* about memory stability (and continued use of the write lock).
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 20:18:50 +00:00
|
|
|
*/
|
|
|
|
int
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
|
|
|
in_pcbrele_rlocked(struct inpcb *inp)
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 20:18:50 +00:00
|
|
|
{
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
|
|
|
struct inpcbinfo *pcbinfo;
|
|
|
|
|
|
|
|
KASSERT(inp->inp_refcount > 0, ("%s: refcount 0", __func__));
|
|
|
|
|
|
|
|
INP_RLOCK_ASSERT(inp);
|
|
|
|
|
There is a complex race in in_pcblookup_hash() and in_pcblookup_group().
Both functions need to obtain lock on the found PCB, and they can't do
classic inter-lock with the PCB hash lock, due to lock order reversal.
To keep the PCB stable, these functions put a reference on it and after PCB
lock is acquired drop it. If the reference was the last one, this means
we've raced with in_pcbfree() and the PCB is no longer valid.
This approach works okay only if we are acquiring writer-lock on the PCB.
In case of reader-lock, the following scenario can happen:
- 2 threads locate pcb, and do in_pcbref() on it.
- These 2 threads drop the inp hash lock.
- Another thread comes to delete pcb via in_pcbfree(), it obtains hash lock,
does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(), which
doesn't free the pcb due to two references on it. Then it unlocks the pcb.
- 2 aforementioned threads acquire reader lock on the pcb and run
in_pcbrele_rlocked(). One gets 1 from in_pcbrele_rlocked() and continues,
second gets 0 and considers pcb freed, returns.
- The thread that got 1 continutes working with detached pcb, which later
leads to panic in the underlying protocol level.
To plumb that problem an additional INPCB flag introduced - INP_FREED. We
check for that flag in the in_pcbrele_rlocked() and if it is set, we pretend
that that was the last reference.
Discussed with: rwatson, jhb
Reported by: Vladimir Medvedkin <medved rambler-co.ru>
2012-10-02 12:03:02 +00:00
|
|
|
if (refcount_release(&inp->inp_refcount) == 0) {
|
|
|
|
/*
|
|
|
|
* If the inpcb has been freed, let the caller know, even if
|
|
|
|
* this isn't the last reference.
|
|
|
|
*/
|
|
|
|
if (inp->inp_flags2 & INP_FREED) {
|
|
|
|
INP_RUNLOCK(inp);
|
|
|
|
return (1);
|
|
|
|
}
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
|
|
|
return (0);
|
There is a complex race in in_pcblookup_hash() and in_pcblookup_group().
Both functions need to obtain lock on the found PCB, and they can't do
classic inter-lock with the PCB hash lock, due to lock order reversal.
To keep the PCB stable, these functions put a reference on it and after PCB
lock is acquired drop it. If the reference was the last one, this means
we've raced with in_pcbfree() and the PCB is no longer valid.
This approach works okay only if we are acquiring writer-lock on the PCB.
In case of reader-lock, the following scenario can happen:
- 2 threads locate pcb, and do in_pcbref() on it.
- These 2 threads drop the inp hash lock.
- Another thread comes to delete pcb via in_pcbfree(), it obtains hash lock,
does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(), which
doesn't free the pcb due to two references on it. Then it unlocks the pcb.
- 2 aforementioned threads acquire reader lock on the pcb and run
in_pcbrele_rlocked(). One gets 1 from in_pcbrele_rlocked() and continues,
second gets 0 and considers pcb freed, returns.
- The thread that got 1 continutes working with detached pcb, which later
leads to panic in the underlying protocol level.
To plumb that problem an additional INPCB flag introduced - INP_FREED. We
check for that flag in the in_pcbrele_rlocked() and if it is set, we pretend
that that was the last reference.
Discussed with: rwatson, jhb
Reported by: Vladimir Medvedkin <medved rambler-co.ru>
2012-10-02 12:03:02 +00:00
|
|
|
}
|
2020-02-12 13:31:36 +00:00
|
|
|
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
|
|
|
KASSERT(inp->inp_socket == NULL, ("%s: inp_socket != NULL", __func__));
|
2018-04-19 13:37:59 +00:00
|
|
|
#ifdef TCPHPTS
|
|
|
|
if (inp->inp_in_hpts || inp->inp_in_input) {
|
|
|
|
struct tcp_hpts_entry *hpts;
|
|
|
|
/*
|
2020-02-12 13:31:36 +00:00
|
|
|
* We should not be on the hpts at
|
2018-04-19 13:37:59 +00:00
|
|
|
* this point in any form. we must
|
|
|
|
* get the lock to be sure.
|
|
|
|
*/
|
|
|
|
hpts = tcp_hpts_lock(inp);
|
|
|
|
if (inp->inp_in_hpts)
|
|
|
|
panic("Hpts:%p inp:%p at free still on hpts",
|
|
|
|
hpts, inp);
|
|
|
|
mtx_unlock(&hpts->p_mtx);
|
|
|
|
hpts = tcp_input_lock(inp);
|
2020-02-12 13:31:36 +00:00
|
|
|
if (inp->inp_in_input)
|
2018-04-19 13:37:59 +00:00
|
|
|
panic("Hpts:%p inp:%p at free still on input hpts",
|
|
|
|
hpts, inp);
|
|
|
|
mtx_unlock(&hpts->p_mtx);
|
|
|
|
}
|
|
|
|
#endif
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
|
|
|
INP_RUNLOCK(inp);
|
|
|
|
pcbinfo = inp->inp_pcbinfo;
|
|
|
|
uma_zfree(pcbinfo->ipi_zone, inp);
|
|
|
|
return (1);
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
in_pcbrele_wlocked(struct inpcb *inp)
|
|
|
|
{
|
|
|
|
struct inpcbinfo *pcbinfo;
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 20:18:50 +00:00
|
|
|
|
|
|
|
KASSERT(inp->inp_refcount > 0, ("%s: refcount 0", __func__));
|
|
|
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
|
2015-11-25 14:45:43 +00:00
|
|
|
if (refcount_release(&inp->inp_refcount) == 0) {
|
|
|
|
/*
|
|
|
|
* If the inpcb has been freed, let the caller know, even if
|
|
|
|
* this isn't the last reference.
|
|
|
|
*/
|
|
|
|
if (inp->inp_flags2 & INP_FREED) {
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
return (1);
|
|
|
|
}
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 20:18:50 +00:00
|
|
|
return (0);
|
2015-11-25 14:45:43 +00:00
|
|
|
}
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
|
|
|
|
|
|
|
KASSERT(inp->inp_socket == NULL, ("%s: inp_socket != NULL", __func__));
|
2018-04-19 13:37:59 +00:00
|
|
|
#ifdef TCPHPTS
|
|
|
|
if (inp->inp_in_hpts || inp->inp_in_input) {
|
|
|
|
struct tcp_hpts_entry *hpts;
|
|
|
|
/*
|
2020-02-12 13:31:36 +00:00
|
|
|
* We should not be on the hpts at
|
2018-04-19 13:37:59 +00:00
|
|
|
* this point in any form. we must
|
|
|
|
* get the lock to be sure.
|
|
|
|
*/
|
|
|
|
hpts = tcp_hpts_lock(inp);
|
|
|
|
if (inp->inp_in_hpts)
|
|
|
|
panic("Hpts:%p inp:%p at free still on hpts",
|
|
|
|
hpts, inp);
|
|
|
|
mtx_unlock(&hpts->p_mtx);
|
|
|
|
hpts = tcp_input_lock(inp);
|
2020-02-12 13:31:36 +00:00
|
|
|
if (inp->inp_in_input)
|
2018-04-19 13:37:59 +00:00
|
|
|
panic("Hpts:%p inp:%p at free still on input hpts",
|
|
|
|
hpts, inp);
|
|
|
|
mtx_unlock(&hpts->p_mtx);
|
|
|
|
}
|
|
|
|
#endif
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
pcbinfo = inp->inp_pcbinfo;
|
|
|
|
uma_zfree(pcbinfo->ipi_zone, inp);
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 20:18:50 +00:00
|
|
|
return (1);
|
|
|
|
}
|
|
|
|
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
|
|
|
/*
|
|
|
|
* Temporary wrapper.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
in_pcbrele(struct inpcb *inp)
|
|
|
|
{
|
|
|
|
|
|
|
|
return (in_pcbrele_wlocked(inp));
|
|
|
|
}
|
|
|
|
|
2018-05-20 02:17:30 +00:00
|
|
|
void
|
|
|
|
in_pcblist_rele_rlocked(epoch_context_t ctx)
|
|
|
|
{
|
|
|
|
struct in_pcblist *il;
|
|
|
|
struct inpcb *inp;
|
|
|
|
struct inpcbinfo *pcbinfo;
|
|
|
|
int i, n;
|
|
|
|
|
|
|
|
il = __containerof(ctx, struct in_pcblist, il_epoch_ctx);
|
|
|
|
pcbinfo = il->il_pcbinfo;
|
|
|
|
n = il->il_count;
|
|
|
|
INP_INFO_WLOCK(pcbinfo);
|
|
|
|
for (i = 0; i < n; i++) {
|
|
|
|
inp = il->il_inp_list[i];
|
|
|
|
INP_RLOCK(inp);
|
|
|
|
if (!in_pcbrele_rlocked(inp))
|
|
|
|
INP_RUNLOCK(inp);
|
|
|
|
}
|
|
|
|
INP_INFO_WUNLOCK(pcbinfo);
|
|
|
|
free(il, M_TEMP);
|
|
|
|
}
|
|
|
|
|
2018-06-12 22:18:27 +00:00
|
|
|
static void
|
|
|
|
inpcbport_free(epoch_context_t ctx)
|
|
|
|
{
|
|
|
|
struct inpcbport *phd;
|
|
|
|
|
|
|
|
phd = __containerof(ctx, struct inpcbport, phd_epoch_ctx);
|
|
|
|
free(phd, M_PCB);
|
|
|
|
}
|
|
|
|
|
2018-06-12 22:18:15 +00:00
|
|
|
static void
|
|
|
|
in_pcbfree_deferred(epoch_context_t ctx)
|
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
|
|
|
int released __unused;
|
|
|
|
|
|
|
|
inp = __containerof(ctx, struct inpcb, inp_epoch_ctx);
|
|
|
|
|
|
|
|
INP_WLOCK(inp);
|
2019-02-13 15:46:05 +00:00
|
|
|
CURVNET_SET(inp->inp_vnet);
|
2018-06-12 22:18:15 +00:00
|
|
|
#ifdef INET
|
2018-08-15 20:23:08 +00:00
|
|
|
struct ip_moptions *imo = inp->inp_moptions;
|
2018-06-12 22:18:15 +00:00
|
|
|
inp->inp_moptions = NULL;
|
|
|
|
#endif
|
|
|
|
/* XXXRW: Do as much as possible here. */
|
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT)
|
|
|
|
if (inp->inp_sp != NULL)
|
|
|
|
ipsec_delete_pcbpolicy(inp);
|
|
|
|
#endif
|
|
|
|
#ifdef INET6
|
2018-08-15 20:23:08 +00:00
|
|
|
struct ip6_moptions *im6o = NULL;
|
2018-06-12 22:18:15 +00:00
|
|
|
if (inp->inp_vflag & INP_IPV6PROTO) {
|
|
|
|
ip6_freepcbopts(inp->in6p_outputopts);
|
2018-08-15 20:23:08 +00:00
|
|
|
im6o = inp->in6p_moptions;
|
2018-06-12 22:18:15 +00:00
|
|
|
inp->in6p_moptions = NULL;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
if (inp->inp_options)
|
|
|
|
(void)m_free(inp->inp_options);
|
|
|
|
inp->inp_vflag = 0;
|
|
|
|
crfree(inp->inp_cred);
|
|
|
|
#ifdef MAC
|
|
|
|
mac_inpcb_destroy(inp);
|
|
|
|
#endif
|
|
|
|
released = in_pcbrele_wlocked(inp);
|
|
|
|
MPASS(released);
|
2018-08-15 20:23:08 +00:00
|
|
|
#ifdef INET6
|
|
|
|
ip6_freemoptions(im6o);
|
|
|
|
#endif
|
|
|
|
#ifdef INET
|
|
|
|
inp_freemoptions(imo);
|
2020-02-12 13:31:36 +00:00
|
|
|
#endif
|
2019-02-13 15:46:05 +00:00
|
|
|
CURVNET_RESTORE();
|
2018-06-12 22:18:15 +00:00
|
|
|
}
|
|
|
|
|
2018-05-20 04:38:04 +00:00
|
|
|
/*
|
|
|
|
* Unconditionally schedule an inpcb to be freed by decrementing its
|
|
|
|
* reference count, which should occur only after the inpcb has been detached
|
|
|
|
* from its socket. If another thread holds a temporary reference (acquired
|
|
|
|
* using in_pcbref()) then the free is deferred until that reference is
|
|
|
|
* released using in_pcbrele(), but the inpcb is still unlocked. Almost all
|
|
|
|
* work, including removal from global lists, is done in this context, where
|
|
|
|
* the pcbinfo lock is held.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
in_pcbfree(struct inpcb *inp)
|
|
|
|
{
|
2018-05-21 16:13:43 +00:00
|
|
|
struct inpcbinfo *pcbinfo = inp->inp_pcbinfo;
|
|
|
|
|
2018-05-20 04:38:04 +00:00
|
|
|
KASSERT(inp->inp_socket == NULL, ("%s: inp_socket != NULL", __func__));
|
|
|
|
KASSERT((inp->inp_flags2 & INP_FREED) == 0,
|
|
|
|
("%s: called twice for pcb %p", __func__, inp));
|
|
|
|
if (inp->inp_flags2 & INP_FREED) {
|
2018-05-20 00:22:28 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2018-05-20 04:38:04 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
2018-05-21 16:13:43 +00:00
|
|
|
INP_LIST_WLOCK(pcbinfo);
|
|
|
|
in_pcbremlists(inp);
|
|
|
|
INP_LIST_WUNLOCK(pcbinfo);
|
2018-05-20 04:38:04 +00:00
|
|
|
RO_INVALIDATE_CACHE(&inp->inp_route);
|
2018-06-12 22:18:15 +00:00
|
|
|
/* mark as destruction in progress */
|
2018-05-21 16:13:43 +00:00
|
|
|
inp->inp_flags2 |= INP_FREED;
|
2018-06-12 22:18:15 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2020-01-15 06:05:20 +00:00
|
|
|
NET_EPOCH_CALL(in_pcbfree_deferred, &inp->inp_epoch_ctx);
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 20:18:50 +00:00
|
|
|
}
|
|
|
|
|
2006-04-25 11:17:35 +00:00
|
|
|
/*
|
2008-09-29 13:50:17 +00:00
|
|
|
* in_pcbdrop() removes an inpcb from hashed lists, releasing its address and
|
|
|
|
* port reservation, and preventing it from being returned by inpcb lookups.
|
|
|
|
*
|
|
|
|
* It is used by TCP to mark an inpcb as unused and avoid future packet
|
|
|
|
* delivery or event notification when a socket remains open but TCP has
|
|
|
|
* closed. This might occur as a result of a shutdown()-initiated TCP close
|
|
|
|
* or a RST on the wire, and allows the port binding to be reused while still
|
|
|
|
* maintaining the invariant that so_pcb always points to a valid inpcb until
|
|
|
|
* in_pcbdetach().
|
|
|
|
*
|
|
|
|
* XXXRW: Possibly in_pcbdrop() should also prevent future notifications by
|
|
|
|
* in_pcbnotifyall() and in_pcbpurgeif0()?
|
2006-04-25 11:17:35 +00:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
in_pcbdrop(struct inpcb *inp)
|
|
|
|
{
|
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
2018-07-04 02:47:16 +00:00
|
|
|
#ifdef INVARIANTS
|
|
|
|
if (inp->inp_socket != NULL && inp->inp_ppcb != NULL)
|
|
|
|
MPASS(inp->inp_refcount > 1);
|
|
|
|
#endif
|
2006-04-25 11:17:35 +00:00
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
/*
|
|
|
|
* XXXRW: Possibly we should protect the setting of INP_DROPPED with
|
|
|
|
* the hash lock...?
|
|
|
|
*/
|
2009-03-15 09:58:31 +00:00
|
|
|
inp->inp_flags |= INP_DROPPED;
|
2009-03-11 00:29:22 +00:00
|
|
|
if (inp->inp_flags & INP_INHASHLIST) {
|
2006-04-25 11:17:35 +00:00
|
|
|
struct inpcbport *phd = inp->inp_phd;
|
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK(inp->inp_pcbinfo);
|
2018-06-06 15:45:57 +00:00
|
|
|
in_pcbremlbgrouphash(inp);
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_REMOVE(inp, inp_hash);
|
|
|
|
CK_LIST_REMOVE(inp, inp_portlist);
|
|
|
|
if (CK_LIST_FIRST(&phd->phd_pcblist) == NULL) {
|
|
|
|
CK_LIST_REMOVE(phd, phd_hash);
|
2020-01-15 06:05:20 +00:00
|
|
|
NET_EPOCH_CALL(inpcbport_free, &phd->phd_epoch_ctx);
|
2006-04-25 11:17:35 +00:00
|
|
|
}
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WUNLOCK(inp->inp_pcbinfo);
|
2009-03-11 00:29:22 +00:00
|
|
|
inp->inp_flags &= ~INP_INHASHLIST;
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#ifdef PCBGROUP
|
|
|
|
in_pcbgroup_remove(inp);
|
|
|
|
#endif
|
2006-04-25 11:17:35 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-04-30 11:04:34 +00:00
|
|
|
#ifdef INET
|
2007-05-11 10:20:51 +00:00
|
|
|
/*
|
|
|
|
* Common routines to return the socket addresses associated with inpcbs.
|
|
|
|
*/
|
2002-08-21 11:57:12 +00:00
|
|
|
struct sockaddr *
|
2006-01-22 01:16:25 +00:00
|
|
|
in_sockaddr(in_port_t port, struct in_addr *addr_p)
|
2002-08-21 11:57:12 +00:00
|
|
|
{
|
|
|
|
struct sockaddr_in *sin;
|
|
|
|
|
2008-10-23 15:53:51 +00:00
|
|
|
sin = malloc(sizeof *sin, M_SONAME,
|
2003-02-19 05:47:46 +00:00
|
|
|
M_WAITOK | M_ZERO);
|
2002-08-21 11:57:12 +00:00
|
|
|
sin->sin_family = AF_INET;
|
|
|
|
sin->sin_len = sizeof(*sin);
|
|
|
|
sin->sin_addr = *addr_p;
|
|
|
|
sin->sin_port = port;
|
|
|
|
|
|
|
|
return (struct sockaddr *)sin;
|
|
|
|
}
|
|
|
|
|
1997-02-18 20:46:36 +00:00
|
|
|
int
|
2007-05-11 10:20:51 +00:00
|
|
|
in_getsockaddr(struct socket *so, struct sockaddr **nam)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2006-01-22 01:16:25 +00:00
|
|
|
struct inpcb *inp;
|
2002-08-21 11:57:12 +00:00
|
|
|
struct in_addr addr;
|
|
|
|
in_port_t port;
|
1997-12-25 06:57:36 +00:00
|
|
|
|
1997-05-19 01:28:39 +00:00
|
|
|
inp = sotoinpcb(so);
|
2007-05-11 10:20:51 +00:00
|
|
|
KASSERT(inp != NULL, ("in_getsockaddr: inp == NULL"));
|
2006-04-22 19:10:02 +00:00
|
|
|
|
2008-04-19 14:34:38 +00:00
|
|
|
INP_RLOCK(inp);
|
2002-08-21 11:57:12 +00:00
|
|
|
port = inp->inp_lport;
|
|
|
|
addr = inp->inp_laddr;
|
2008-04-19 14:34:38 +00:00
|
|
|
INP_RUNLOCK(inp);
|
1997-12-25 06:57:36 +00:00
|
|
|
|
2002-08-21 11:57:12 +00:00
|
|
|
*nam = in_sockaddr(port, &addr);
|
1997-02-18 20:46:36 +00:00
|
|
|
return 0;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
1997-02-18 20:46:36 +00:00
|
|
|
int
|
2007-05-11 10:20:51 +00:00
|
|
|
in_getpeeraddr(struct socket *so, struct sockaddr **nam)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2006-01-22 01:16:25 +00:00
|
|
|
struct inpcb *inp;
|
2002-08-21 11:57:12 +00:00
|
|
|
struct in_addr addr;
|
|
|
|
in_port_t port;
|
1997-12-25 06:57:36 +00:00
|
|
|
|
1997-05-19 01:28:39 +00:00
|
|
|
inp = sotoinpcb(so);
|
2007-05-11 10:20:51 +00:00
|
|
|
KASSERT(inp != NULL, ("in_getpeeraddr: inp == NULL"));
|
2006-04-22 19:10:02 +00:00
|
|
|
|
2008-04-19 14:34:38 +00:00
|
|
|
INP_RLOCK(inp);
|
2002-08-21 11:57:12 +00:00
|
|
|
port = inp->inp_fport;
|
|
|
|
addr = inp->inp_faddr;
|
2008-04-19 14:34:38 +00:00
|
|
|
INP_RUNLOCK(inp);
|
1997-12-25 06:57:36 +00:00
|
|
|
|
2002-08-21 11:57:12 +00:00
|
|
|
*nam = in_sockaddr(port, &addr);
|
1997-02-18 20:46:36 +00:00
|
|
|
return 0;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
1994-05-25 09:21:21 +00:00
|
|
|
void
|
2006-01-22 01:16:25 +00:00
|
|
|
in_pcbnotifyall(struct inpcbinfo *pcbinfo, struct in_addr faddr, int errno,
|
|
|
|
struct inpcb *(*notify)(struct inpcb *, int))
|
2001-02-22 21:23:45 +00:00
|
|
|
{
|
2008-04-06 21:20:56 +00:00
|
|
|
struct inpcb *inp, *inp_temp;
|
2001-02-22 21:23:45 +00:00
|
|
|
|
2003-02-12 23:55:07 +00:00
|
|
|
INP_INFO_WLOCK(pcbinfo);
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_FOREACH_SAFE(inp, pcbinfo->ipi_listhead, inp_list, inp_temp) {
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2001-02-22 21:23:45 +00:00
|
|
|
#ifdef INET6
|
2002-06-10 20:05:46 +00:00
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0) {
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2001-02-22 21:23:45 +00:00
|
|
|
continue;
|
2002-06-10 20:05:46 +00:00
|
|
|
}
|
2001-02-22 21:23:45 +00:00
|
|
|
#endif
|
|
|
|
if (inp->inp_faddr.s_addr != faddr.s_addr ||
|
2002-06-10 20:05:46 +00:00
|
|
|
inp->inp_socket == NULL) {
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2003-02-12 23:55:07 +00:00
|
|
|
continue;
|
2002-06-10 20:05:46 +00:00
|
|
|
}
|
2003-02-12 23:55:07 +00:00
|
|
|
if ((*notify)(inp, errno))
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2001-02-22 21:23:45 +00:00
|
|
|
}
|
2003-02-12 23:55:07 +00:00
|
|
|
INP_INFO_WUNLOCK(pcbinfo);
|
2001-02-22 21:23:45 +00:00
|
|
|
}
|
|
|
|
|
2001-08-04 17:10:14 +00:00
|
|
|
void
|
2006-01-22 01:16:25 +00:00
|
|
|
in_pcbpurgeif0(struct inpcbinfo *pcbinfo, struct ifnet *ifp)
|
2001-08-04 17:10:14 +00:00
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
2019-06-25 11:54:41 +00:00
|
|
|
struct in_multi *inm;
|
|
|
|
struct in_mfilter *imf;
|
2001-08-04 17:10:14 +00:00
|
|
|
struct ip_moptions *imo;
|
|
|
|
|
2015-08-03 12:13:54 +00:00
|
|
|
INP_INFO_WLOCK(pcbinfo);
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_FOREACH(inp, pcbinfo->ipi_listhead, inp_list) {
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2001-08-04 17:10:14 +00:00
|
|
|
imo = inp->inp_moptions;
|
|
|
|
if ((inp->inp_vflag & INP_IPV4) &&
|
|
|
|
imo != NULL) {
|
|
|
|
/*
|
|
|
|
* Unselect the outgoing interface if it is being
|
|
|
|
* detached.
|
|
|
|
*/
|
|
|
|
if (imo->imo_multicast_ifp == ifp)
|
|
|
|
imo->imo_multicast_ifp = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Drop multicast group membership if we joined
|
|
|
|
* through the interface being detached.
|
2018-05-20 00:22:28 +00:00
|
|
|
*
|
|
|
|
* XXX This can all be deferred to an epoch_call
|
2001-08-04 17:10:14 +00:00
|
|
|
*/
|
2019-06-25 11:54:41 +00:00
|
|
|
restart:
|
|
|
|
IP_MFILTER_FOREACH(imf, &imo->imo_head) {
|
|
|
|
if ((inm = imf->imf_inm) == NULL)
|
|
|
|
continue;
|
|
|
|
if (inm->inm_ifp != ifp)
|
|
|
|
continue;
|
|
|
|
ip_mfilter_remove(&imo->imo_head, imf);
|
|
|
|
IN_MULTI_LOCK_ASSERT();
|
|
|
|
in_leavegroup_locked(inm, NULL);
|
|
|
|
ip_mfilter_free(imf);
|
|
|
|
goto restart;
|
2001-08-04 17:10:14 +00:00
|
|
|
}
|
|
|
|
}
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2001-08-04 17:10:14 +00:00
|
|
|
}
|
2015-08-03 12:13:54 +00:00
|
|
|
INP_INFO_WUNLOCK(pcbinfo);
|
2001-08-04 17:10:14 +00:00
|
|
|
}
|
|
|
|
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
/*
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
* Lookup a PCB based on the local address and port. Caller must hold the
|
|
|
|
* hash lock. No inpcb locks or references are acquired.
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
*/
|
2006-02-04 07:59:17 +00:00
|
|
|
#define INP_LOOKUP_MAPPED_PCB_COST 3
|
1994-05-24 10:09:53 +00:00
|
|
|
struct inpcb *
|
2006-01-22 01:16:25 +00:00
|
|
|
in_pcblookup_local(struct inpcbinfo *pcbinfo, struct in_addr laddr,
|
2011-05-23 15:23:18 +00:00
|
|
|
u_short lport, int lookupflags, struct ucred *cred)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2006-01-22 01:16:25 +00:00
|
|
|
struct inpcb *inp;
|
2006-02-04 07:59:17 +00:00
|
|
|
#ifdef INET6
|
|
|
|
int matchwild = 3 + INP_LOOKUP_MAPPED_PCB_COST;
|
|
|
|
#else
|
|
|
|
int matchwild = 3;
|
|
|
|
#endif
|
|
|
|
int wildcard;
|
1995-04-10 08:52:45 +00:00
|
|
|
|
2011-05-23 15:23:18 +00:00
|
|
|
KASSERT((lookupflags & ~(INPLOOKUP_WILDCARD)) == 0,
|
|
|
|
("%s: invalid lookup flags %d", __func__, lookupflags));
|
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_LOCK_ASSERT(pcbinfo);
|
2003-11-13 05:16:56 +00:00
|
|
|
|
2011-05-23 15:23:18 +00:00
|
|
|
if ((lookupflags & INPLOOKUP_WILDCARD) == 0) {
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
struct inpcbhead *head;
|
|
|
|
/*
|
|
|
|
* Look for an unconnected (wildcard foreign addr) PCB that
|
|
|
|
* matches the local address and port we're looking for.
|
|
|
|
*/
|
2007-04-30 23:12:05 +00:00
|
|
|
head = &pcbinfo->ipi_hashbase[INP_PCBHASH(INADDR_ANY, lport,
|
|
|
|
0, pcbinfo->ipi_hashmask)];
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_FOREACH(inp, head, inp_hash) {
|
1999-12-07 17:39:16 +00:00
|
|
|
#ifdef INET6
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
/* XXX inp locking */
|
1999-12-21 11:14:12 +00:00
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0)
|
1999-12-07 17:39:16 +00:00
|
|
|
continue;
|
|
|
|
#endif
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
if (inp->inp_faddr.s_addr == INADDR_ANY &&
|
|
|
|
inp->inp_laddr.s_addr == laddr.s_addr &&
|
|
|
|
inp->inp_lport == lport) {
|
|
|
|
/*
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
* Found?
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
*/
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
if (cred == NULL ||
|
2009-05-27 14:11:23 +00:00
|
|
|
prison_equal_ip4(cred->cr_prison,
|
|
|
|
inp->inp_cred->cr_prison))
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
return (inp);
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
}
|
1995-04-09 01:29:31 +00:00
|
|
|
}
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
/*
|
|
|
|
* Not found.
|
|
|
|
*/
|
|
|
|
return (NULL);
|
|
|
|
} else {
|
|
|
|
struct inpcbporthead *porthash;
|
|
|
|
struct inpcbport *phd;
|
|
|
|
struct inpcb *match = NULL;
|
|
|
|
/*
|
|
|
|
* Best fit PCB lookup.
|
|
|
|
*
|
|
|
|
* First see if this local port is in use by looking on the
|
|
|
|
* port hash list.
|
|
|
|
*/
|
2007-04-30 23:12:05 +00:00
|
|
|
porthash = &pcbinfo->ipi_porthashbase[INP_PCBPORTHASH(lport,
|
|
|
|
pcbinfo->ipi_porthashmask)];
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_FOREACH(phd, porthash, phd_hash) {
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
if (phd->phd_port == lport)
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
}
|
|
|
|
if (phd != NULL) {
|
|
|
|
/*
|
|
|
|
* Port is in use by one or more PCBs. Look for best
|
|
|
|
* fit.
|
|
|
|
*/
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_FOREACH(inp, &phd->phd_pcblist, inp_portlist) {
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
wildcard = 0;
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
if (cred != NULL &&
|
2009-05-27 14:11:23 +00:00
|
|
|
!prison_equal_ip4(inp->inp_cred->cr_prison,
|
|
|
|
cred->cr_prison))
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
continue;
|
1999-12-07 17:39:16 +00:00
|
|
|
#ifdef INET6
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
/* XXX inp locking */
|
1999-12-21 11:14:12 +00:00
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0)
|
1999-12-07 17:39:16 +00:00
|
|
|
continue;
|
2006-02-04 07:59:17 +00:00
|
|
|
/*
|
|
|
|
* We never select the PCB that has
|
|
|
|
* INP_IPV6 flag and is bound to :: if
|
|
|
|
* we have another PCB which is bound
|
|
|
|
* to 0.0.0.0. If a PCB has the
|
|
|
|
* INP_IPV6 flag, then we set its cost
|
|
|
|
* higher than IPv4 only PCBs.
|
|
|
|
*
|
|
|
|
* Note that the case only happens
|
|
|
|
* when a socket is bound to ::, under
|
|
|
|
* the condition that the use of the
|
|
|
|
* mapped address is allowed.
|
|
|
|
*/
|
|
|
|
if ((inp->inp_vflag & INP_IPV6) != 0)
|
|
|
|
wildcard += INP_LOOKUP_MAPPED_PCB_COST;
|
1999-12-07 17:39:16 +00:00
|
|
|
#endif
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
if (inp->inp_faddr.s_addr != INADDR_ANY)
|
|
|
|
wildcard++;
|
|
|
|
if (inp->inp_laddr.s_addr != INADDR_ANY) {
|
|
|
|
if (laddr.s_addr == INADDR_ANY)
|
|
|
|
wildcard++;
|
|
|
|
else if (inp->inp_laddr.s_addr != laddr.s_addr)
|
|
|
|
continue;
|
|
|
|
} else {
|
|
|
|
if (laddr.s_addr != INADDR_ANY)
|
|
|
|
wildcard++;
|
|
|
|
}
|
|
|
|
if (wildcard < matchwild) {
|
|
|
|
match = inp;
|
|
|
|
matchwild = wildcard;
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
if (matchwild == 0)
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
break;
|
|
|
|
}
|
1995-03-02 19:29:42 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
return (match);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
}
|
2006-02-04 07:59:17 +00:00
|
|
|
#undef INP_LOOKUP_MAPPED_PCB_COST
|
1995-04-09 01:29:31 +00:00
|
|
|
|
2018-06-06 15:45:57 +00:00
|
|
|
static struct inpcb *
|
|
|
|
in_pcblookup_lbgroup(const struct inpcbinfo *pcbinfo,
|
2018-09-05 15:04:11 +00:00
|
|
|
const struct in_addr *laddr, uint16_t lport, const struct in_addr *faddr,
|
|
|
|
uint16_t fport, int lookupflags)
|
2018-06-06 15:45:57 +00:00
|
|
|
{
|
2018-09-05 15:04:11 +00:00
|
|
|
struct inpcb *local_wild;
|
2018-06-06 15:45:57 +00:00
|
|
|
const struct inpcblbgrouphead *hdr;
|
|
|
|
struct inpcblbgroup *grp;
|
2018-09-05 15:04:11 +00:00
|
|
|
uint32_t idx;
|
2018-06-06 15:45:57 +00:00
|
|
|
|
|
|
|
INP_HASH_LOCK_ASSERT(pcbinfo);
|
|
|
|
|
2018-12-05 17:06:00 +00:00
|
|
|
hdr = &pcbinfo->ipi_lbgrouphashbase[
|
|
|
|
INP_PCBPORTHASH(lport, pcbinfo->ipi_lbgrouphashmask)];
|
2018-06-06 15:45:57 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Order of socket selection:
|
|
|
|
* 1. non-wild.
|
|
|
|
* 2. wild (if lookupflags contains INPLOOKUP_WILDCARD).
|
|
|
|
*
|
|
|
|
* NOTE:
|
|
|
|
* - Load balanced group does not contain jailed sockets
|
|
|
|
* - Load balanced group does not contain IPv4 mapped INET6 wild sockets
|
|
|
|
*/
|
2018-09-05 15:04:11 +00:00
|
|
|
local_wild = NULL;
|
2018-09-10 19:00:29 +00:00
|
|
|
CK_LIST_FOREACH(grp, hdr, il_list) {
|
2018-06-06 15:45:57 +00:00
|
|
|
#ifdef INET6
|
|
|
|
if (!(grp->il_vflag & INP_IPV4))
|
|
|
|
continue;
|
|
|
|
#endif
|
2018-09-05 15:04:11 +00:00
|
|
|
if (grp->il_lport != lport)
|
|
|
|
continue;
|
2018-06-06 15:45:57 +00:00
|
|
|
|
2018-09-05 15:04:11 +00:00
|
|
|
idx = INP_PCBLBGROUP_PKTHASH(faddr->s_addr, lport, fport) %
|
|
|
|
grp->il_inpcnt;
|
|
|
|
if (grp->il_laddr.s_addr == laddr->s_addr)
|
|
|
|
return (grp->il_inp[idx]);
|
|
|
|
if (grp->il_laddr.s_addr == INADDR_ANY &&
|
|
|
|
(lookupflags & INPLOOKUP_WILDCARD) != 0)
|
|
|
|
local_wild = grp->il_inp[idx];
|
2018-06-06 15:45:57 +00:00
|
|
|
}
|
2018-09-05 15:04:11 +00:00
|
|
|
return (local_wild);
|
2018-06-06 15:45:57 +00:00
|
|
|
}
|
|
|
|
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#ifdef PCBGROUP
|
|
|
|
/*
|
|
|
|
* Lookup PCB in hash list, using pcbgroup tables.
|
|
|
|
*/
|
|
|
|
static struct inpcb *
|
|
|
|
in_pcblookup_group(struct inpcbinfo *pcbinfo, struct inpcbgroup *pcbgroup,
|
|
|
|
struct in_addr faddr, u_int fport_arg, struct in_addr laddr,
|
|
|
|
u_int lport_arg, int lookupflags, struct ifnet *ifp)
|
|
|
|
{
|
|
|
|
struct inpcbhead *head;
|
|
|
|
struct inpcb *inp, *tmpinp;
|
|
|
|
u_short fport = fport_arg, lport = lport_arg;
|
2018-03-21 15:54:46 +00:00
|
|
|
bool locked;
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* First look for an exact match.
|
|
|
|
*/
|
|
|
|
tmpinp = NULL;
|
|
|
|
INP_GROUP_LOCK(pcbgroup);
|
|
|
|
head = &pcbgroup->ipg_hashbase[INP_PCBHASH(faddr.s_addr, lport, fport,
|
|
|
|
pcbgroup->ipg_hashmask)];
|
2018-06-13 23:19:54 +00:00
|
|
|
CK_LIST_FOREACH(inp, head, inp_pcbgrouphash) {
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#ifdef INET6
|
|
|
|
/* XXX inp locking */
|
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0)
|
|
|
|
continue;
|
|
|
|
#endif
|
|
|
|
if (inp->inp_faddr.s_addr == faddr.s_addr &&
|
|
|
|
inp->inp_laddr.s_addr == laddr.s_addr &&
|
|
|
|
inp->inp_fport == fport &&
|
|
|
|
inp->inp_lport == lport) {
|
|
|
|
/*
|
|
|
|
* XXX We should be able to directly return
|
|
|
|
* the inp here, without any checks.
|
|
|
|
* Well unless both bound with SO_REUSEPORT?
|
|
|
|
*/
|
|
|
|
if (prison_flag(inp->inp_cred, PR_IP4))
|
|
|
|
goto found;
|
|
|
|
if (tmpinp == NULL)
|
|
|
|
tmpinp = inp;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (tmpinp != NULL) {
|
|
|
|
inp = tmpinp;
|
|
|
|
goto found;
|
|
|
|
}
|
|
|
|
|
2014-07-10 03:10:56 +00:00
|
|
|
#ifdef RSS
|
|
|
|
/*
|
|
|
|
* For incoming connections, we may wish to do a wildcard
|
|
|
|
* match for an RSS-local socket.
|
|
|
|
*/
|
|
|
|
if ((lookupflags & INPLOOKUP_WILDCARD) != 0) {
|
|
|
|
struct inpcb *local_wild = NULL, *local_exact = NULL;
|
|
|
|
#ifdef INET6
|
|
|
|
struct inpcb *local_wild_mapped = NULL;
|
|
|
|
#endif
|
|
|
|
struct inpcb *jail_wild = NULL;
|
|
|
|
struct inpcbhead *head;
|
|
|
|
int injail;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Order of socket selection - we always prefer jails.
|
|
|
|
* 1. jailed, non-wild.
|
|
|
|
* 2. jailed, wild.
|
|
|
|
* 3. non-jailed, non-wild.
|
|
|
|
* 4. non-jailed, wild.
|
|
|
|
*/
|
|
|
|
|
|
|
|
head = &pcbgroup->ipg_hashbase[INP_PCBHASH(INADDR_ANY,
|
|
|
|
lport, 0, pcbgroup->ipg_hashmask)];
|
2018-06-13 23:19:54 +00:00
|
|
|
CK_LIST_FOREACH(inp, head, inp_pcbgrouphash) {
|
2014-07-10 03:10:56 +00:00
|
|
|
#ifdef INET6
|
|
|
|
/* XXX inp locking */
|
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0)
|
|
|
|
continue;
|
|
|
|
#endif
|
|
|
|
if (inp->inp_faddr.s_addr != INADDR_ANY ||
|
|
|
|
inp->inp_lport != lport)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
injail = prison_flag(inp->inp_cred, PR_IP4);
|
|
|
|
if (injail) {
|
|
|
|
if (prison_check_ip4(inp->inp_cred,
|
|
|
|
&laddr) != 0)
|
|
|
|
continue;
|
|
|
|
} else {
|
|
|
|
if (local_exact != NULL)
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (inp->inp_laddr.s_addr == laddr.s_addr) {
|
|
|
|
if (injail)
|
|
|
|
goto found;
|
|
|
|
else
|
|
|
|
local_exact = inp;
|
|
|
|
} else if (inp->inp_laddr.s_addr == INADDR_ANY) {
|
|
|
|
#ifdef INET6
|
|
|
|
/* XXX inp locking, NULL check */
|
|
|
|
if (inp->inp_vflag & INP_IPV6PROTO)
|
|
|
|
local_wild_mapped = inp;
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
if (injail)
|
|
|
|
jail_wild = inp;
|
|
|
|
else
|
|
|
|
local_wild = inp;
|
|
|
|
}
|
|
|
|
} /* LIST_FOREACH */
|
|
|
|
|
|
|
|
inp = jail_wild;
|
|
|
|
if (inp == NULL)
|
|
|
|
inp = local_exact;
|
|
|
|
if (inp == NULL)
|
|
|
|
inp = local_wild;
|
|
|
|
#ifdef INET6
|
|
|
|
if (inp == NULL)
|
|
|
|
inp = local_wild_mapped;
|
|
|
|
#endif
|
|
|
|
if (inp != NULL)
|
|
|
|
goto found;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
/*
|
|
|
|
* Then look for a wildcard match, if requested.
|
|
|
|
*/
|
|
|
|
if ((lookupflags & INPLOOKUP_WILDCARD) != 0) {
|
|
|
|
struct inpcb *local_wild = NULL, *local_exact = NULL;
|
|
|
|
#ifdef INET6
|
|
|
|
struct inpcb *local_wild_mapped = NULL;
|
|
|
|
#endif
|
|
|
|
struct inpcb *jail_wild = NULL;
|
|
|
|
struct inpcbhead *head;
|
|
|
|
int injail;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Order of socket selection - we always prefer jails.
|
|
|
|
* 1. jailed, non-wild.
|
|
|
|
* 2. jailed, wild.
|
|
|
|
* 3. non-jailed, non-wild.
|
|
|
|
* 4. non-jailed, wild.
|
|
|
|
*/
|
|
|
|
head = &pcbinfo->ipi_wildbase[INP_PCBHASH(INADDR_ANY, lport,
|
|
|
|
0, pcbinfo->ipi_wildmask)];
|
2018-06-13 23:19:54 +00:00
|
|
|
CK_LIST_FOREACH(inp, head, inp_pcbgroup_wild) {
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#ifdef INET6
|
|
|
|
/* XXX inp locking */
|
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0)
|
|
|
|
continue;
|
|
|
|
#endif
|
|
|
|
if (inp->inp_faddr.s_addr != INADDR_ANY ||
|
|
|
|
inp->inp_lport != lport)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
injail = prison_flag(inp->inp_cred, PR_IP4);
|
|
|
|
if (injail) {
|
|
|
|
if (prison_check_ip4(inp->inp_cred,
|
|
|
|
&laddr) != 0)
|
|
|
|
continue;
|
|
|
|
} else {
|
|
|
|
if (local_exact != NULL)
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (inp->inp_laddr.s_addr == laddr.s_addr) {
|
|
|
|
if (injail)
|
|
|
|
goto found;
|
|
|
|
else
|
|
|
|
local_exact = inp;
|
|
|
|
} else if (inp->inp_laddr.s_addr == INADDR_ANY) {
|
|
|
|
#ifdef INET6
|
|
|
|
/* XXX inp locking, NULL check */
|
|
|
|
if (inp->inp_vflag & INP_IPV6PROTO)
|
|
|
|
local_wild_mapped = inp;
|
|
|
|
else
|
2012-01-22 02:13:19 +00:00
|
|
|
#endif
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
if (injail)
|
|
|
|
jail_wild = inp;
|
|
|
|
else
|
|
|
|
local_wild = inp;
|
|
|
|
}
|
|
|
|
} /* LIST_FOREACH */
|
|
|
|
inp = jail_wild;
|
|
|
|
if (inp == NULL)
|
|
|
|
inp = local_exact;
|
|
|
|
if (inp == NULL)
|
|
|
|
inp = local_wild;
|
|
|
|
#ifdef INET6
|
|
|
|
if (inp == NULL)
|
|
|
|
inp = local_wild_mapped;
|
2012-01-22 02:13:19 +00:00
|
|
|
#endif
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
if (inp != NULL)
|
|
|
|
goto found;
|
|
|
|
} /* if (lookupflags & INPLOOKUP_WILDCARD) */
|
|
|
|
INP_GROUP_UNLOCK(pcbgroup);
|
|
|
|
return (NULL);
|
|
|
|
|
|
|
|
found:
|
2018-03-21 15:54:46 +00:00
|
|
|
if (lookupflags & INPLOOKUP_WLOCKPCB)
|
2018-03-29 19:48:17 +00:00
|
|
|
locked = INP_TRY_WLOCK(inp);
|
2018-03-21 15:54:46 +00:00
|
|
|
else if (lookupflags & INPLOOKUP_RLOCKPCB)
|
2018-03-29 19:48:17 +00:00
|
|
|
locked = INP_TRY_RLOCK(inp);
|
2018-03-21 15:54:46 +00:00
|
|
|
else
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
panic("%s: locking bug", __func__);
|
2018-06-13 04:23:49 +00:00
|
|
|
if (__predict_false(locked && (inp->inp_flags2 & INP_FREED))) {
|
|
|
|
if (lookupflags & INPLOOKUP_WLOCKPCB)
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
else
|
|
|
|
INP_RUNLOCK(inp);
|
|
|
|
return (NULL);
|
|
|
|
} else if (!locked)
|
2018-03-21 15:54:46 +00:00
|
|
|
in_pcbref(inp);
|
|
|
|
INP_GROUP_UNLOCK(pcbgroup);
|
|
|
|
if (!locked) {
|
|
|
|
if (lookupflags & INPLOOKUP_WLOCKPCB) {
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
if (in_pcbrele_wlocked(inp))
|
|
|
|
return (NULL);
|
|
|
|
} else {
|
|
|
|
INP_RLOCK(inp);
|
|
|
|
if (in_pcbrele_rlocked(inp))
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#ifdef INVARIANTS
|
|
|
|
if (lookupflags & INPLOOKUP_WLOCKPCB)
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
else
|
|
|
|
INP_RLOCK_ASSERT(inp);
|
|
|
|
#endif
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
return (inp);
|
|
|
|
}
|
|
|
|
#endif /* PCBGROUP */
|
|
|
|
|
1995-04-09 01:29:31 +00:00
|
|
|
/*
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
* Lookup PCB in hash list, using pcbinfo tables. This variation assumes
|
|
|
|
* that the caller has locked the hash list, and will not perform any further
|
|
|
|
* locking or reference operations on either the hash list or the connection.
|
1995-04-09 01:29:31 +00:00
|
|
|
*/
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
static struct inpcb *
|
|
|
|
in_pcblookup_hash_locked(struct inpcbinfo *pcbinfo, struct in_addr faddr,
|
2011-05-23 15:23:18 +00:00
|
|
|
u_int fport_arg, struct in_addr laddr, u_int lport_arg, int lookupflags,
|
2006-01-22 01:16:25 +00:00
|
|
|
struct ifnet *ifp)
|
1995-04-09 01:29:31 +00:00
|
|
|
{
|
|
|
|
struct inpcbhead *head;
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
struct inpcb *inp, *tmpinp;
|
1995-04-09 01:29:31 +00:00
|
|
|
u_short fport = fport_arg, lport = lport_arg;
|
|
|
|
|
2011-05-23 15:23:18 +00:00
|
|
|
KASSERT((lookupflags & ~(INPLOOKUP_WILDCARD)) == 0,
|
|
|
|
("%s: invalid lookup flags %d", __func__, lookupflags));
|
2019-11-07 20:49:56 +00:00
|
|
|
INP_HASH_LOCK_ASSERT(pcbinfo);
|
|
|
|
|
1995-04-09 01:29:31 +00:00
|
|
|
/*
|
|
|
|
* First look for an exact match.
|
|
|
|
*/
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
tmpinp = NULL;
|
2007-04-30 23:12:05 +00:00
|
|
|
head = &pcbinfo->ipi_hashbase[INP_PCBHASH(faddr.s_addr, lport, fport,
|
|
|
|
pcbinfo->ipi_hashmask)];
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_FOREACH(inp, head, inp_hash) {
|
1999-12-07 17:39:16 +00:00
|
|
|
#ifdef INET6
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
/* XXX inp locking */
|
1999-12-21 11:14:12 +00:00
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0)
|
1999-12-07 17:39:16 +00:00
|
|
|
continue;
|
|
|
|
#endif
|
1996-10-07 19:06:12 +00:00
|
|
|
if (inp->inp_faddr.s_addr == faddr.s_addr &&
|
1997-04-03 05:14:45 +00:00
|
|
|
inp->inp_laddr.s_addr == laddr.s_addr &&
|
|
|
|
inp->inp_fport == fport &&
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
inp->inp_lport == lport) {
|
|
|
|
/*
|
|
|
|
* XXX We should be able to directly return
|
|
|
|
* the inp here, without any checks.
|
|
|
|
* Well unless both bound with SO_REUSEPORT?
|
|
|
|
*/
|
2009-05-27 14:11:23 +00:00
|
|
|
if (prison_flag(inp->inp_cred, PR_IP4))
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
return (inp);
|
|
|
|
if (tmpinp == NULL)
|
|
|
|
tmpinp = inp;
|
|
|
|
}
|
1996-10-07 19:06:12 +00:00
|
|
|
}
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
if (tmpinp != NULL)
|
|
|
|
return (tmpinp);
|
2006-11-30 10:54:54 +00:00
|
|
|
|
2018-06-06 15:45:57 +00:00
|
|
|
/*
|
|
|
|
* Then look in lb group (for wildcard match).
|
|
|
|
*/
|
2018-11-01 15:52:49 +00:00
|
|
|
if ((lookupflags & INPLOOKUP_WILDCARD) != 0) {
|
2018-06-06 15:45:57 +00:00
|
|
|
inp = in_pcblookup_lbgroup(pcbinfo, &laddr, lport, &faddr,
|
|
|
|
fport, lookupflags);
|
2018-11-01 15:52:49 +00:00
|
|
|
if (inp != NULL)
|
2018-06-06 15:45:57 +00:00
|
|
|
return (inp);
|
|
|
|
}
|
|
|
|
|
2006-11-30 10:54:54 +00:00
|
|
|
/*
|
|
|
|
* Then look for a wildcard match, if requested.
|
|
|
|
*/
|
2011-05-23 15:23:18 +00:00
|
|
|
if ((lookupflags & INPLOOKUP_WILDCARD) != 0) {
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
struct inpcb *local_wild = NULL, *local_exact = NULL;
|
2006-11-30 10:54:54 +00:00
|
|
|
#ifdef INET6
|
1999-12-07 17:39:16 +00:00
|
|
|
struct inpcb *local_wild_mapped = NULL;
|
2006-11-30 10:54:54 +00:00
|
|
|
#endif
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
struct inpcb *jail_wild = NULL;
|
|
|
|
int injail;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Order of socket selection - we always prefer jails.
|
|
|
|
* 1. jailed, non-wild.
|
|
|
|
* 2. jailed, wild.
|
|
|
|
* 3. non-jailed, non-wild.
|
|
|
|
* 4. non-jailed, wild.
|
|
|
|
*/
|
1996-10-07 19:06:12 +00:00
|
|
|
|
2007-04-30 23:12:05 +00:00
|
|
|
head = &pcbinfo->ipi_hashbase[INP_PCBHASH(INADDR_ANY, lport,
|
|
|
|
0, pcbinfo->ipi_hashmask)];
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_FOREACH(inp, head, inp_hash) {
|
1999-12-07 17:39:16 +00:00
|
|
|
#ifdef INET6
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
/* XXX inp locking */
|
1999-12-21 11:14:12 +00:00
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0)
|
1999-12-07 17:39:16 +00:00
|
|
|
continue;
|
|
|
|
#endif
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
if (inp->inp_faddr.s_addr != INADDR_ANY ||
|
|
|
|
inp->inp_lport != lport)
|
|
|
|
continue;
|
|
|
|
|
2009-05-27 14:11:23 +00:00
|
|
|
injail = prison_flag(inp->inp_cred, PR_IP4);
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
if (injail) {
|
2009-02-05 14:06:09 +00:00
|
|
|
if (prison_check_ip4(inp->inp_cred,
|
|
|
|
&laddr) != 0)
|
1999-12-07 17:39:16 +00:00
|
|
|
continue;
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
} else {
|
|
|
|
if (local_exact != NULL)
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (inp->inp_laddr.s_addr == laddr.s_addr) {
|
|
|
|
if (injail)
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
return (inp);
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
else
|
|
|
|
local_exact = inp;
|
|
|
|
} else if (inp->inp_laddr.s_addr == INADDR_ANY) {
|
2006-11-30 10:54:54 +00:00
|
|
|
#ifdef INET6
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
/* XXX inp locking, NULL check */
|
|
|
|
if (inp->inp_vflag & INP_IPV6PROTO)
|
|
|
|
local_wild_mapped = inp;
|
|
|
|
else
|
2012-01-22 02:13:19 +00:00
|
|
|
#endif
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
if (injail)
|
|
|
|
jail_wild = inp;
|
1999-12-07 17:39:16 +00:00
|
|
|
else
|
2006-11-30 10:54:54 +00:00
|
|
|
local_wild = inp;
|
1996-10-07 19:06:12 +00:00
|
|
|
}
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
} /* LIST_FOREACH */
|
|
|
|
if (jail_wild != NULL)
|
|
|
|
return (jail_wild);
|
|
|
|
if (local_exact != NULL)
|
|
|
|
return (local_exact);
|
|
|
|
if (local_wild != NULL)
|
|
|
|
return (local_wild);
|
2006-11-30 10:54:54 +00:00
|
|
|
#ifdef INET6
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
if (local_wild_mapped != NULL)
|
1999-12-07 17:39:16 +00:00
|
|
|
return (local_wild_mapped);
|
2012-01-22 02:13:19 +00:00
|
|
|
#endif
|
2011-05-23 15:23:18 +00:00
|
|
|
} /* if ((lookupflags & INPLOOKUP_WILDCARD) != 0) */
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
return (NULL);
|
1995-04-09 01:29:31 +00:00
|
|
|
}
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Lookup PCB in hash list, using pcbinfo tables. This variation locks the
|
|
|
|
* hash list lock, and will return the inpcb locked (i.e., requires
|
|
|
|
* INPLOOKUP_LOCKPCB).
|
|
|
|
*/
|
|
|
|
static struct inpcb *
|
|
|
|
in_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in_addr faddr,
|
|
|
|
u_int fport, struct in_addr laddr, u_int lport, int lookupflags,
|
|
|
|
struct ifnet *ifp)
|
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
|
|
|
|
|
|
|
inp = in_pcblookup_hash_locked(pcbinfo, faddr, fport, laddr, lport,
|
|
|
|
(lookupflags & ~(INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)), ifp);
|
|
|
|
if (inp != NULL) {
|
2018-06-21 18:40:15 +00:00
|
|
|
if (lookupflags & INPLOOKUP_WLOCKPCB) {
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
if (__predict_false(inp->inp_flags2 & INP_FREED)) {
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
inp = NULL;
|
2018-03-21 15:54:46 +00:00
|
|
|
}
|
2018-06-21 18:40:15 +00:00
|
|
|
} else if (lookupflags & INPLOOKUP_RLOCKPCB) {
|
|
|
|
INP_RLOCK(inp);
|
|
|
|
if (__predict_false(inp->inp_flags2 & INP_FREED)) {
|
|
|
|
INP_RUNLOCK(inp);
|
|
|
|
inp = NULL;
|
|
|
|
}
|
|
|
|
} else
|
|
|
|
panic("%s: locking bug", __func__);
|
2018-03-21 15:54:46 +00:00
|
|
|
#ifdef INVARIANTS
|
2018-06-21 18:40:15 +00:00
|
|
|
if (inp != NULL) {
|
|
|
|
if (lookupflags & INPLOOKUP_WLOCKPCB)
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
else
|
|
|
|
INP_RLOCK_ASSERT(inp);
|
|
|
|
}
|
2018-03-21 15:54:46 +00:00
|
|
|
#endif
|
2018-06-21 18:40:15 +00:00
|
|
|
}
|
2019-11-07 20:49:56 +00:00
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
return (inp);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
* Public inpcb lookup routines, accepting a 4-tuple, and optionally, an mbuf
|
|
|
|
* from which a pre-calculated hash value may be extracted.
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
*
|
|
|
|
* Possibly more of this logic should be in in_pcbgroup.c.
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
*/
|
|
|
|
struct inpcb *
|
|
|
|
in_pcblookup(struct inpcbinfo *pcbinfo, struct in_addr faddr, u_int fport,
|
|
|
|
struct in_addr laddr, u_int lport, int lookupflags, struct ifnet *ifp)
|
|
|
|
{
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
#if defined(PCBGROUP) && !defined(RSS)
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
struct inpcbgroup *pcbgroup;
|
|
|
|
#endif
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
|
|
|
|
KASSERT((lookupflags & ~INPLOOKUP_MASK) == 0,
|
|
|
|
("%s: invalid lookup flags %d", __func__, lookupflags));
|
|
|
|
KASSERT((lookupflags & (INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)) != 0,
|
|
|
|
("%s: LOCKPCB not set", __func__));
|
|
|
|
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
/*
|
|
|
|
* When not using RSS, use connection groups in preference to the
|
|
|
|
* reservation table when looking up 4-tuples. When using RSS, just
|
|
|
|
* use the reservation table, due to the cost of the Toeplitz hash
|
|
|
|
* in software.
|
|
|
|
*
|
|
|
|
* XXXRW: This policy belongs in the pcbgroup code, as in principle
|
|
|
|
* we could be doing RSS with a non-Toeplitz hash that is affordable
|
|
|
|
* in software.
|
|
|
|
*/
|
|
|
|
#if defined(PCBGROUP) && !defined(RSS)
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
if (in_pcbgroup_enabled(pcbinfo)) {
|
|
|
|
pcbgroup = in_pcbgroup_bytuple(pcbinfo, laddr, lport, faddr,
|
|
|
|
fport);
|
|
|
|
return (in_pcblookup_group(pcbinfo, pcbgroup, faddr, fport,
|
|
|
|
laddr, lport, lookupflags, ifp));
|
|
|
|
}
|
|
|
|
#endif
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
return (in_pcblookup_hash(pcbinfo, faddr, fport, laddr, lport,
|
|
|
|
lookupflags, ifp));
|
|
|
|
}
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
|
|
|
|
struct inpcb *
|
|
|
|
in_pcblookup_mbuf(struct inpcbinfo *pcbinfo, struct in_addr faddr,
|
|
|
|
u_int fport, struct in_addr laddr, u_int lport, int lookupflags,
|
|
|
|
struct ifnet *ifp, struct mbuf *m)
|
|
|
|
{
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#ifdef PCBGROUP
|
|
|
|
struct inpcbgroup *pcbgroup;
|
|
|
|
#endif
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
|
|
|
|
KASSERT((lookupflags & ~INPLOOKUP_MASK) == 0,
|
|
|
|
("%s: invalid lookup flags %d", __func__, lookupflags));
|
|
|
|
KASSERT((lookupflags & (INPLOOKUP_RLOCKPCB | INPLOOKUP_WLOCKPCB)) != 0,
|
|
|
|
("%s: LOCKPCB not set", __func__));
|
|
|
|
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#ifdef PCBGROUP
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
/*
|
|
|
|
* If we can use a hardware-generated hash to look up the connection
|
|
|
|
* group, use that connection group to find the inpcb. Otherwise
|
|
|
|
* fall back on a software hash -- or the reservation table if we're
|
|
|
|
* using RSS.
|
|
|
|
*
|
|
|
|
* XXXRW: As above, that policy belongs in the pcbgroup code.
|
|
|
|
*/
|
|
|
|
if (in_pcbgroup_enabled(pcbinfo) &&
|
|
|
|
!(M_HASHTYPE_TEST(m, M_HASHTYPE_NONE))) {
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
pcbgroup = in_pcbgroup_byhash(pcbinfo, M_HASHTYPE_GET(m),
|
|
|
|
m->m_pkthdr.flowid);
|
|
|
|
if (pcbgroup != NULL)
|
|
|
|
return (in_pcblookup_group(pcbinfo, pcbgroup, faddr,
|
|
|
|
fport, laddr, lport, lookupflags, ifp));
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
#ifndef RSS
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
pcbgroup = in_pcbgroup_bytuple(pcbinfo, laddr, lport, faddr,
|
|
|
|
fport);
|
|
|
|
return (in_pcblookup_group(pcbinfo, pcbgroup, faddr, fport,
|
|
|
|
laddr, lport, lookupflags, ifp));
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
#endif
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
}
|
|
|
|
#endif
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
return (in_pcblookup_hash(pcbinfo, faddr, fport, laddr, lport,
|
|
|
|
lookupflags, ifp));
|
|
|
|
}
|
2011-04-30 11:04:34 +00:00
|
|
|
#endif /* INET */
|
1995-04-09 01:29:31 +00:00
|
|
|
|
1995-04-10 08:52:45 +00:00
|
|
|
/*
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
* Insert PCB onto various hash lists.
|
1995-04-10 08:52:45 +00:00
|
|
|
*/
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
static int
|
2020-01-12 17:52:32 +00:00
|
|
|
in_pcbinshash_internal(struct inpcb *inp, struct mbuf *m)
|
1995-04-09 01:29:31 +00:00
|
|
|
{
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
struct inpcbhead *pcbhash;
|
|
|
|
struct inpcbporthead *pcbporthash;
|
|
|
|
struct inpcbinfo *pcbinfo = inp->inp_pcbinfo;
|
|
|
|
struct inpcbport *phd;
|
1999-12-07 17:39:16 +00:00
|
|
|
u_int32_t hashkey_faddr;
|
2018-06-06 15:45:57 +00:00
|
|
|
int so_options;
|
1995-04-09 01:29:31 +00:00
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK_ASSERT(pcbinfo);
|
|
|
|
|
2009-03-11 00:29:22 +00:00
|
|
|
KASSERT((inp->inp_flags & INP_INHASHLIST) == 0,
|
|
|
|
("in_pcbinshash: INP_INHASHLIST"));
|
2006-04-22 19:15:20 +00:00
|
|
|
|
1999-12-07 17:39:16 +00:00
|
|
|
#ifdef INET6
|
|
|
|
if (inp->inp_vflag & INP_IPV6)
|
2014-09-10 12:35:42 +00:00
|
|
|
hashkey_faddr = INP6_PCBHASHKEY(&inp->in6p_faddr);
|
1999-12-07 17:39:16 +00:00
|
|
|
else
|
2012-01-22 02:13:19 +00:00
|
|
|
#endif
|
1999-12-07 17:39:16 +00:00
|
|
|
hashkey_faddr = inp->inp_faddr.s_addr;
|
|
|
|
|
2007-04-30 23:12:05 +00:00
|
|
|
pcbhash = &pcbinfo->ipi_hashbase[INP_PCBHASH(hashkey_faddr,
|
|
|
|
inp->inp_lport, inp->inp_fport, pcbinfo->ipi_hashmask)];
|
1995-04-09 01:29:31 +00:00
|
|
|
|
2007-04-30 23:12:05 +00:00
|
|
|
pcbporthash = &pcbinfo->ipi_porthashbase[
|
|
|
|
INP_PCBPORTHASH(inp->inp_lport, pcbinfo->ipi_porthashmask)];
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
|
2018-06-06 15:45:57 +00:00
|
|
|
/*
|
|
|
|
* Add entry to load balance group.
|
|
|
|
* Only do this if SO_REUSEPORT_LB is set.
|
|
|
|
*/
|
|
|
|
so_options = inp_so_options(inp);
|
|
|
|
if (so_options & SO_REUSEPORT_LB) {
|
|
|
|
int ret = in_pcbinslbgrouphash(inp);
|
|
|
|
if (ret) {
|
|
|
|
/* pcb lb group malloc fail (ret=ENOBUFS). */
|
|
|
|
return (ret);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
/*
|
|
|
|
* Go through port list and look for a head for this lport.
|
|
|
|
*/
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_FOREACH(phd, pcbporthash, phd_hash) {
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
if (phd->phd_port == inp->inp_lport)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* If none exists, malloc one and tack it on.
|
|
|
|
*/
|
|
|
|
if (phd == NULL) {
|
2008-10-23 15:53:51 +00:00
|
|
|
phd = malloc(sizeof(struct inpcbport), M_PCB, M_NOWAIT);
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
if (phd == NULL) {
|
|
|
|
return (ENOBUFS); /* XXX */
|
|
|
|
}
|
2018-06-12 22:18:27 +00:00
|
|
|
bzero(&phd->phd_epoch_ctx, sizeof(struct epoch_context));
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
phd->phd_port = inp->inp_lport;
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_INIT(&phd->phd_pcblist);
|
|
|
|
CK_LIST_INSERT_HEAD(pcbporthash, phd, phd_hash);
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
}
|
|
|
|
inp->inp_phd = phd;
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_INSERT_HEAD(&phd->phd_pcblist, inp, inp_portlist);
|
|
|
|
CK_LIST_INSERT_HEAD(pcbhash, inp, inp_hash);
|
2009-03-11 00:29:22 +00:00
|
|
|
inp->inp_flags |= INP_INHASHLIST;
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#ifdef PCBGROUP
|
2020-01-12 17:52:32 +00:00
|
|
|
if (m != NULL) {
|
|
|
|
in_pcbgroup_update_mbuf(inp, m);
|
|
|
|
} else {
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
in_pcbgroup_update(inp);
|
2020-01-12 17:52:32 +00:00
|
|
|
}
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#endif
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
return (0);
|
1995-04-09 01:29:31 +00:00
|
|
|
}
|
|
|
|
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
int
|
|
|
|
in_pcbinshash(struct inpcb *inp)
|
|
|
|
{
|
|
|
|
|
2020-01-12 17:52:32 +00:00
|
|
|
return (in_pcbinshash_internal(inp, NULL));
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
int
|
2020-01-12 17:52:32 +00:00
|
|
|
in_pcbinshash_mbuf(struct inpcb *inp, struct mbuf *m)
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
{
|
|
|
|
|
2020-01-12 17:52:32 +00:00
|
|
|
return (in_pcbinshash_internal(inp, m));
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
}
|
|
|
|
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
/*
|
|
|
|
* Move PCB to the proper hash bucket when { faddr, fport } have been
|
|
|
|
* changed. NOTE: This does not handle the case of the lport changing (the
|
|
|
|
* hashed port list would have to be updated as well), so the lport must
|
|
|
|
* not change after in_pcbinshash() has been called.
|
|
|
|
*/
|
1995-04-09 01:29:31 +00:00
|
|
|
void
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
in_pcbrehash_mbuf(struct inpcb *inp, struct mbuf *m)
|
1995-04-09 01:29:31 +00:00
|
|
|
{
|
2003-11-08 23:02:36 +00:00
|
|
|
struct inpcbinfo *pcbinfo = inp->inp_pcbinfo;
|
1995-04-09 01:29:31 +00:00
|
|
|
struct inpcbhead *head;
|
1999-12-07 17:39:16 +00:00
|
|
|
u_int32_t hashkey_faddr;
|
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK_ASSERT(pcbinfo);
|
|
|
|
|
2009-03-11 00:29:22 +00:00
|
|
|
KASSERT(inp->inp_flags & INP_INHASHLIST,
|
|
|
|
("in_pcbrehash: !INP_INHASHLIST"));
|
2006-04-22 19:15:20 +00:00
|
|
|
|
1999-12-07 17:39:16 +00:00
|
|
|
#ifdef INET6
|
|
|
|
if (inp->inp_vflag & INP_IPV6)
|
2014-09-10 12:35:42 +00:00
|
|
|
hashkey_faddr = INP6_PCBHASHKEY(&inp->in6p_faddr);
|
1999-12-07 17:39:16 +00:00
|
|
|
else
|
2012-01-22 02:13:19 +00:00
|
|
|
#endif
|
1999-12-07 17:39:16 +00:00
|
|
|
hashkey_faddr = inp->inp_faddr.s_addr;
|
1995-04-09 01:29:31 +00:00
|
|
|
|
2007-04-30 23:12:05 +00:00
|
|
|
head = &pcbinfo->ipi_hashbase[INP_PCBHASH(hashkey_faddr,
|
|
|
|
inp->inp_lport, inp->inp_fport, pcbinfo->ipi_hashmask)];
|
1995-04-09 01:29:31 +00:00
|
|
|
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_REMOVE(inp, inp_hash);
|
|
|
|
CK_LIST_INSERT_HEAD(head, inp, inp_hash);
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
|
|
|
|
#ifdef PCBGROUP
|
|
|
|
if (m != NULL)
|
|
|
|
in_pcbgroup_update_mbuf(inp, m);
|
|
|
|
else
|
|
|
|
in_pcbgroup_update(inp);
|
|
|
|
#endif
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
}
|
|
|
|
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
void
|
|
|
|
in_pcbrehash(struct inpcb *inp)
|
|
|
|
{
|
|
|
|
|
|
|
|
in_pcbrehash_mbuf(inp, NULL);
|
|
|
|
}
|
|
|
|
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
/*
|
|
|
|
* Remove PCB from various lists.
|
|
|
|
*/
|
2009-05-14 20:59:36 +00:00
|
|
|
static void
|
2006-01-22 01:16:25 +00:00
|
|
|
in_pcbremlists(struct inpcb *inp)
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
{
|
2003-11-08 23:02:36 +00:00
|
|
|
struct inpcbinfo *pcbinfo = inp->inp_pcbinfo;
|
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
2015-08-03 12:13:54 +00:00
|
|
|
INP_LIST_WLOCK_ASSERT(pcbinfo);
|
2003-11-08 23:02:36 +00:00
|
|
|
|
|
|
|
inp->inp_gencnt = ++pcbinfo->ipi_gencnt;
|
2009-03-11 00:29:22 +00:00
|
|
|
if (inp->inp_flags & INP_INHASHLIST) {
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
struct inpcbport *phd = inp->inp_phd;
|
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK(pcbinfo);
|
2018-06-06 15:45:57 +00:00
|
|
|
|
|
|
|
/* XXX: Only do if SO_REUSEPORT_LB set? */
|
|
|
|
in_pcbremlbgrouphash(inp);
|
|
|
|
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_REMOVE(inp, inp_hash);
|
|
|
|
CK_LIST_REMOVE(inp, inp_portlist);
|
|
|
|
if (CK_LIST_FIRST(&phd->phd_pcblist) == NULL) {
|
|
|
|
CK_LIST_REMOVE(phd, phd_hash);
|
2020-01-15 06:05:20 +00:00
|
|
|
NET_EPOCH_CALL(inpcbport_free, &phd->phd_epoch_ctx);
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
}
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WUNLOCK(pcbinfo);
|
2009-03-11 00:29:22 +00:00
|
|
|
inp->inp_flags &= ~INP_INHASHLIST;
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
}
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_REMOVE(inp, inp_list);
|
2003-11-08 23:02:36 +00:00
|
|
|
pcbinfo->ipi_count--;
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#ifdef PCBGROUP
|
|
|
|
in_pcbgroup_remove(inp);
|
|
|
|
#endif
|
1995-04-09 01:29:31 +00:00
|
|
|
}
|
This Implements the mumbled about "Jail" feature.
This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.
For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".
Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.
Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.
It generally does what one would expect, but setting up a jail
still takes a little knowledge.
A few notes:
I have no scripts for setting up a jail, don't ask me for them.
The IP number should be an alias on one of the interfaces.
mount a /proc in each jail, it will make ps more useable.
/proc/<pid>/status tells the hostname of the prison for
jailed processes.
Quotas are only sensible if you have a mountpoint per prison.
There are no privisions for stopping resource-hogging.
Some "#ifdef INET" and similar may be missing (send patches!)
If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!
Tools, comments, patches & documentation most welcome.
Have fun...
Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/
1999-04-28 11:38:52 +00:00
|
|
|
|
2016-03-24 07:54:56 +00:00
|
|
|
/*
|
|
|
|
* Check for alternatives when higher level complains
|
|
|
|
* about service problems. For now, invalidate cached
|
|
|
|
* routing information. If the route was created dynamically
|
|
|
|
* (by a redirect), time to try a default gateway again.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
in_losing(struct inpcb *inp)
|
|
|
|
{
|
|
|
|
|
2018-01-23 03:15:39 +00:00
|
|
|
RO_INVALIDATE_CACHE(&inp->inp_route);
|
2016-03-24 07:54:56 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
|
|
|
/*
|
|
|
|
* A set label operation has occurred at the socket layer, propagate the
|
|
|
|
* label change into the in_pcb for the socket.
|
|
|
|
*/
|
|
|
|
void
|
2006-01-22 01:16:25 +00:00
|
|
|
in_pcbsosetlabel(struct socket *so)
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
|
|
|
{
|
|
|
|
#ifdef MAC
|
|
|
|
struct inpcb *inp;
|
|
|
|
|
2006-04-01 16:04:42 +00:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("in_pcbsosetlabel: so->so_pcb == NULL"));
|
2006-04-22 19:15:20 +00:00
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2004-06-13 02:50:07 +00:00
|
|
|
SOCK_LOCK(so);
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
|
|
|
mac_inpcb_sosetlabel(so, inp);
|
2004-06-13 02:50:07 +00:00
|
|
|
SOCK_UNLOCK(so);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
|
|
|
#endif
|
|
|
|
}
|
2005-01-02 01:50:57 +00:00
|
|
|
|
|
|
|
/*
|
2006-06-02 08:18:27 +00:00
|
|
|
* ipport_tick runs once per second, determining if random port allocation
|
|
|
|
* should be continued. If more than ipport_randomcps ports have been
|
|
|
|
* allocated in the last second, then we return to sequential port
|
|
|
|
* allocation. We return to random allocation only once we drop below
|
|
|
|
* ipport_randomcps for at least ipport_randomtime seconds.
|
2005-01-02 01:50:57 +00:00
|
|
|
*/
|
2011-04-20 08:00:29 +00:00
|
|
|
static void
|
2006-01-22 01:16:25 +00:00
|
|
|
ipport_tick(void *xtp)
|
2005-01-02 01:50:57 +00:00
|
|
|
{
|
Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit
Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.
Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().
Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).
All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).
(*) netipsec/keysock.c did not validate depending on compile time options.
Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
|
|
|
VNET_ITERATOR_DECL(vnet_iter);
|
|
|
|
|
2009-07-19 14:20:53 +00:00
|
|
|
VNET_LIST_RLOCK_NOSLEEP();
|
Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit
Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.
Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().
Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).
All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).
(*) netipsec/keysock.c did not validate depending on compile time options.
Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
|
|
|
VNET_FOREACH(vnet_iter) {
|
|
|
|
CURVNET_SET(vnet_iter); /* XXX appease INVARIANTS here */
|
|
|
|
if (V_ipport_tcpallocs <=
|
|
|
|
V_ipport_tcplastcount + V_ipport_randomcps) {
|
|
|
|
if (V_ipport_stoprandom > 0)
|
|
|
|
V_ipport_stoprandom--;
|
|
|
|
} else
|
|
|
|
V_ipport_stoprandom = V_ipport_randomtime;
|
|
|
|
V_ipport_tcplastcount = V_ipport_tcpallocs;
|
|
|
|
CURVNET_RESTORE();
|
|
|
|
}
|
2009-07-19 14:20:53 +00:00
|
|
|
VNET_LIST_RUNLOCK_NOSLEEP();
|
2005-01-02 01:50:57 +00:00
|
|
|
callout_reset(&ipport_tick_callout, hz, ipport_tick, NULL);
|
|
|
|
}
|
2007-02-17 21:02:38 +00:00
|
|
|
|
2011-04-20 08:00:29 +00:00
|
|
|
static void
|
|
|
|
ip_fini(void *xtp)
|
|
|
|
{
|
|
|
|
|
|
|
|
callout_stop(&ipport_tick_callout);
|
|
|
|
}
|
|
|
|
|
2020-02-12 13:31:36 +00:00
|
|
|
/*
|
2011-04-20 08:00:29 +00:00
|
|
|
* The ipport_callout should start running at about the time we attach the
|
|
|
|
* inet or inet6 domains.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ipport_tick_init(const void *unused __unused)
|
|
|
|
{
|
|
|
|
|
|
|
|
/* Start ipport_tick. */
|
2015-05-22 17:05:21 +00:00
|
|
|
callout_init(&ipport_tick_callout, 1);
|
2011-04-20 08:00:29 +00:00
|
|
|
callout_reset(&ipport_tick_callout, 1, ipport_tick, NULL);
|
|
|
|
EVENTHANDLER_REGISTER(shutdown_pre_sync, ip_fini, NULL,
|
|
|
|
SHUTDOWN_PRI_DEFAULT);
|
|
|
|
}
|
2020-02-12 13:31:36 +00:00
|
|
|
SYSINIT(ipport_tick_init, SI_SUB_PROTO_DOMAIN, SI_ORDER_MIDDLE,
|
2011-04-20 08:00:29 +00:00
|
|
|
ipport_tick_init, NULL);
|
|
|
|
|
2008-03-23 22:34:16 +00:00
|
|
|
void
|
|
|
|
inp_wlock(struct inpcb *inp)
|
|
|
|
{
|
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2008-03-23 22:34:16 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
inp_wunlock(struct inpcb *inp)
|
|
|
|
{
|
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2008-03-23 22:34:16 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
inp_rlock(struct inpcb *inp)
|
|
|
|
{
|
|
|
|
|
2008-04-19 14:34:38 +00:00
|
|
|
INP_RLOCK(inp);
|
2008-03-23 22:34:16 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
inp_runlock(struct inpcb *inp)
|
|
|
|
{
|
|
|
|
|
2008-04-19 14:34:38 +00:00
|
|
|
INP_RUNLOCK(inp);
|
2008-03-23 22:34:16 +00:00
|
|
|
}
|
|
|
|
|
2017-03-09 00:55:19 +00:00
|
|
|
#ifdef INVARIANT_SUPPORT
|
2008-03-23 22:34:16 +00:00
|
|
|
void
|
2008-03-24 20:24:04 +00:00
|
|
|
inp_lock_assert(struct inpcb *inp)
|
2008-03-23 22:34:16 +00:00
|
|
|
{
|
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
2008-03-23 22:34:16 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2008-03-24 20:24:04 +00:00
|
|
|
inp_unlock_assert(struct inpcb *inp)
|
2008-03-23 22:34:16 +00:00
|
|
|
{
|
|
|
|
|
|
|
|
INP_UNLOCK_ASSERT(inp);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2008-07-21 00:08:34 +00:00
|
|
|
void
|
|
|
|
inp_apply_all(void (*func)(struct inpcb *, void *), void *arg)
|
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
|
|
|
|
2015-08-03 12:13:54 +00:00
|
|
|
INP_INFO_WLOCK(&V_tcbinfo);
|
2018-06-12 22:18:20 +00:00
|
|
|
CK_LIST_FOREACH(inp, V_tcbinfo.ipi_listhead, inp_list) {
|
2008-07-21 00:08:34 +00:00
|
|
|
INP_WLOCK(inp);
|
|
|
|
func(inp, arg);
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
}
|
2015-08-03 12:13:54 +00:00
|
|
|
INP_INFO_WUNLOCK(&V_tcbinfo);
|
2008-07-21 00:08:34 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
struct socket *
|
|
|
|
inp_inpcbtosocket(struct inpcb *inp)
|
|
|
|
{
|
|
|
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
return (inp->inp_socket);
|
|
|
|
}
|
|
|
|
|
|
|
|
struct tcpcb *
|
|
|
|
inp_inpcbtotcpcb(struct inpcb *inp)
|
|
|
|
{
|
|
|
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
return ((struct tcpcb *)inp->inp_ppcb);
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
inp_ip_tos_get(const struct inpcb *inp)
|
|
|
|
{
|
|
|
|
|
|
|
|
return (inp->inp_ip_tos);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
inp_ip_tos_set(struct inpcb *inp, int val)
|
|
|
|
{
|
|
|
|
|
|
|
|
inp->inp_ip_tos = val;
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2008-07-22 04:23:57 +00:00
|
|
|
inp_4tuple_get(struct inpcb *inp, uint32_t *laddr, uint16_t *lp,
|
2008-07-21 22:11:39 +00:00
|
|
|
uint32_t *faddr, uint16_t *fp)
|
2008-07-21 00:08:34 +00:00
|
|
|
{
|
|
|
|
|
2008-07-21 22:11:39 +00:00
|
|
|
INP_LOCK_ASSERT(inp);
|
2008-07-22 04:23:57 +00:00
|
|
|
*laddr = inp->inp_laddr.s_addr;
|
|
|
|
*faddr = inp->inp_faddr.s_addr;
|
2008-07-21 00:08:34 +00:00
|
|
|
*lp = inp->inp_lport;
|
|
|
|
*fp = inp->inp_fport;
|
|
|
|
}
|
|
|
|
|
2008-07-21 00:49:34 +00:00
|
|
|
struct inpcb *
|
|
|
|
so_sotoinpcb(struct socket *so)
|
|
|
|
{
|
|
|
|
|
|
|
|
return (sotoinpcb(so));
|
|
|
|
}
|
|
|
|
|
|
|
|
struct tcpcb *
|
|
|
|
so_sototcpcb(struct socket *so)
|
|
|
|
{
|
|
|
|
|
|
|
|
return (sototcpcb(so));
|
|
|
|
}
|
|
|
|
|
Hide struct inpcb, struct tcpcb from the userland.
This is a painful change, but it is needed. On the one hand, we avoid
modifying them, and this slows down some ideas, on the other hand we still
eventually modify them and tools like netstat(1) never work on next version of
FreeBSD. We maintain a ton of spares in them, and we already got some ifdef
hell at the end of tcpcb.
Details:
- Hide struct inpcb, struct tcpcb under _KERNEL || _WANT_FOO.
- Make struct xinpcb, struct xtcpcb pure API structures, not including
kernel structures inpcb and tcpcb inside. Export into these structures
the fields from inpcb and tcpcb that are known to be used, and put there
a ton of spare space.
- Make kernel and userland utilities compilable after these changes.
- Bump __FreeBSD_version.
Reviewed by: rrs, gnn
Differential Revision: D10018
2017-03-21 06:39:49 +00:00
|
|
|
/*
|
|
|
|
* Create an external-format (``xinpcb'') structure using the information in
|
|
|
|
* the kernel-format in_pcb structure pointed to by inp. This is done to
|
|
|
|
* reduce the spew of irrelevant information over this interface, to isolate
|
|
|
|
* user code from changes in the kernel structure, and potentially to provide
|
|
|
|
* information-hiding if we decide that some of this information should be
|
|
|
|
* hidden from users.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
in_pcbtoxinpcb(const struct inpcb *inp, struct xinpcb *xi)
|
|
|
|
{
|
|
|
|
|
2018-11-22 20:49:41 +00:00
|
|
|
bzero(xi, sizeof(*xi));
|
Hide struct inpcb, struct tcpcb from the userland.
This is a painful change, but it is needed. On the one hand, we avoid
modifying them, and this slows down some ideas, on the other hand we still
eventually modify them and tools like netstat(1) never work on next version of
FreeBSD. We maintain a ton of spares in them, and we already got some ifdef
hell at the end of tcpcb.
Details:
- Hide struct inpcb, struct tcpcb under _KERNEL || _WANT_FOO.
- Make struct xinpcb, struct xtcpcb pure API structures, not including
kernel structures inpcb and tcpcb inside. Export into these structures
the fields from inpcb and tcpcb that are known to be used, and put there
a ton of spare space.
- Make kernel and userland utilities compilable after these changes.
- Bump __FreeBSD_version.
Reviewed by: rrs, gnn
Differential Revision: D10018
2017-03-21 06:39:49 +00:00
|
|
|
xi->xi_len = sizeof(struct xinpcb);
|
|
|
|
if (inp->inp_socket)
|
|
|
|
sotoxsocket(inp->inp_socket, &xi->xi_socket);
|
|
|
|
bcopy(&inp->inp_inc, &xi->inp_inc, sizeof(struct in_conninfo));
|
|
|
|
xi->inp_gencnt = inp->inp_gencnt;
|
2018-07-10 13:03:06 +00:00
|
|
|
xi->inp_ppcb = (uintptr_t)inp->inp_ppcb;
|
Hide struct inpcb, struct tcpcb from the userland.
This is a painful change, but it is needed. On the one hand, we avoid
modifying them, and this slows down some ideas, on the other hand we still
eventually modify them and tools like netstat(1) never work on next version of
FreeBSD. We maintain a ton of spares in them, and we already got some ifdef
hell at the end of tcpcb.
Details:
- Hide struct inpcb, struct tcpcb under _KERNEL || _WANT_FOO.
- Make struct xinpcb, struct xtcpcb pure API structures, not including
kernel structures inpcb and tcpcb inside. Export into these structures
the fields from inpcb and tcpcb that are known to be used, and put there
a ton of spare space.
- Make kernel and userland utilities compilable after these changes.
- Bump __FreeBSD_version.
Reviewed by: rrs, gnn
Differential Revision: D10018
2017-03-21 06:39:49 +00:00
|
|
|
xi->inp_flow = inp->inp_flow;
|
|
|
|
xi->inp_flowid = inp->inp_flowid;
|
|
|
|
xi->inp_flowtype = inp->inp_flowtype;
|
|
|
|
xi->inp_flags = inp->inp_flags;
|
|
|
|
xi->inp_flags2 = inp->inp_flags2;
|
|
|
|
xi->inp_rss_listen_bucket = inp->inp_rss_listen_bucket;
|
|
|
|
xi->in6p_cksum = inp->in6p_cksum;
|
|
|
|
xi->in6p_hops = inp->in6p_hops;
|
|
|
|
xi->inp_ip_tos = inp->inp_ip_tos;
|
|
|
|
xi->inp_vflag = inp->inp_vflag;
|
|
|
|
xi->inp_ip_ttl = inp->inp_ip_ttl;
|
|
|
|
xi->inp_ip_p = inp->inp_ip_p;
|
|
|
|
xi->inp_ip_minttl = inp->inp_ip_minttl;
|
|
|
|
}
|
|
|
|
|
2007-02-17 21:02:38 +00:00
|
|
|
#ifdef DDB
|
|
|
|
static void
|
|
|
|
db_print_indent(int indent)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < indent; i++)
|
|
|
|
db_printf(" ");
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
db_print_inconninfo(struct in_conninfo *inc, const char *name, int indent)
|
|
|
|
{
|
|
|
|
char faddr_str[48], laddr_str[48];
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("%s at %p\n", name, inc);
|
|
|
|
|
|
|
|
indent += 2;
|
|
|
|
|
2007-02-18 08:57:23 +00:00
|
|
|
#ifdef INET6
|
2008-12-17 12:52:34 +00:00
|
|
|
if (inc->inc_flags & INC_ISIPV6) {
|
2007-02-17 21:02:38 +00:00
|
|
|
/* IPv6. */
|
|
|
|
ip6_sprintf(laddr_str, &inc->inc6_laddr);
|
|
|
|
ip6_sprintf(faddr_str, &inc->inc6_faddr);
|
2012-01-22 02:13:19 +00:00
|
|
|
} else
|
2007-02-18 08:57:23 +00:00
|
|
|
#endif
|
2012-01-22 02:13:19 +00:00
|
|
|
{
|
2007-02-17 21:02:38 +00:00
|
|
|
/* IPv4. */
|
|
|
|
inet_ntoa_r(inc->inc_laddr, laddr_str);
|
|
|
|
inet_ntoa_r(inc->inc_faddr, faddr_str);
|
|
|
|
}
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("inc_laddr %s inc_lport %u\n", laddr_str,
|
|
|
|
ntohs(inc->inc_lport));
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("inc_faddr %s inc_fport %u\n", faddr_str,
|
|
|
|
ntohs(inc->inc_fport));
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
db_print_inpflags(int inp_flags)
|
|
|
|
{
|
|
|
|
int comma;
|
|
|
|
|
|
|
|
comma = 0;
|
|
|
|
if (inp_flags & INP_RECVOPTS) {
|
|
|
|
db_printf("%sINP_RECVOPTS", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & INP_RECVRETOPTS) {
|
|
|
|
db_printf("%sINP_RECVRETOPTS", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & INP_RECVDSTADDR) {
|
|
|
|
db_printf("%sINP_RECVDSTADDR", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
2017-03-06 04:01:58 +00:00
|
|
|
if (inp_flags & INP_ORIGDSTADDR) {
|
|
|
|
db_printf("%sINP_ORIGDSTADDR", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
2007-02-17 21:02:38 +00:00
|
|
|
if (inp_flags & INP_HDRINCL) {
|
|
|
|
db_printf("%sINP_HDRINCL", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & INP_HIGHPORT) {
|
|
|
|
db_printf("%sINP_HIGHPORT", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & INP_LOWPORT) {
|
|
|
|
db_printf("%sINP_LOWPORT", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & INP_ANONPORT) {
|
|
|
|
db_printf("%sINP_ANONPORT", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & INP_RECVIF) {
|
|
|
|
db_printf("%sINP_RECVIF", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & INP_MTUDISC) {
|
|
|
|
db_printf("%sINP_MTUDISC", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & INP_RECVTTL) {
|
|
|
|
db_printf("%sINP_RECVTTL", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & INP_DONTFRAG) {
|
|
|
|
db_printf("%sINP_DONTFRAG", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
2012-06-12 14:02:38 +00:00
|
|
|
if (inp_flags & INP_RECVTOS) {
|
|
|
|
db_printf("%sINP_RECVTOS", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
2007-02-17 21:02:38 +00:00
|
|
|
if (inp_flags & IN6P_IPV6_V6ONLY) {
|
|
|
|
db_printf("%sIN6P_IPV6_V6ONLY", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & IN6P_PKTINFO) {
|
|
|
|
db_printf("%sIN6P_PKTINFO", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & IN6P_HOPLIMIT) {
|
|
|
|
db_printf("%sIN6P_HOPLIMIT", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & IN6P_HOPOPTS) {
|
|
|
|
db_printf("%sIN6P_HOPOPTS", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & IN6P_DSTOPTS) {
|
|
|
|
db_printf("%sIN6P_DSTOPTS", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & IN6P_RTHDR) {
|
|
|
|
db_printf("%sIN6P_RTHDR", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & IN6P_RTHDRDSTOPTS) {
|
|
|
|
db_printf("%sIN6P_RTHDRDSTOPTS", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & IN6P_TCLASS) {
|
|
|
|
db_printf("%sIN6P_TCLASS", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & IN6P_AUTOFLOWLABEL) {
|
|
|
|
db_printf("%sIN6P_AUTOFLOWLABEL", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
2009-03-15 09:58:31 +00:00
|
|
|
if (inp_flags & INP_TIMEWAIT) {
|
|
|
|
db_printf("%sINP_TIMEWAIT", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & INP_ONESBCAST) {
|
|
|
|
db_printf("%sINP_ONESBCAST", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & INP_DROPPED) {
|
|
|
|
db_printf("%sINP_DROPPED", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & INP_SOCKREF) {
|
|
|
|
db_printf("%sINP_SOCKREF", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
2007-02-17 21:02:38 +00:00
|
|
|
if (inp_flags & IN6P_RFC2292) {
|
|
|
|
db_printf("%sIN6P_RFC2292", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_flags & IN6P_MTU) {
|
|
|
|
db_printf("IN6P_MTU%s", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
db_print_inpvflag(u_char inp_vflag)
|
|
|
|
{
|
|
|
|
int comma;
|
|
|
|
|
|
|
|
comma = 0;
|
|
|
|
if (inp_vflag & INP_IPV4) {
|
|
|
|
db_printf("%sINP_IPV4", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_vflag & INP_IPV6) {
|
|
|
|
db_printf("%sINP_IPV6", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (inp_vflag & INP_IPV6PROTO) {
|
|
|
|
db_printf("%sINP_IPV6PROTO", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-05-14 20:59:36 +00:00
|
|
|
static void
|
2007-02-17 21:02:38 +00:00
|
|
|
db_print_inpcb(struct inpcb *inp, const char *name, int indent)
|
|
|
|
{
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("%s at %p\n", name, inp);
|
|
|
|
|
|
|
|
indent += 2;
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("inp_flow: 0x%x\n", inp->inp_flow);
|
|
|
|
|
|
|
|
db_print_inconninfo(&inp->inp_inc, "inp_conninfo", indent);
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("inp_ppcb: %p inp_pcbinfo: %p inp_socket: %p\n",
|
|
|
|
inp->inp_ppcb, inp->inp_pcbinfo, inp->inp_socket);
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("inp_label: %p inp_flags: 0x%x (",
|
|
|
|
inp->inp_label, inp->inp_flags);
|
|
|
|
db_print_inpflags(inp->inp_flags);
|
|
|
|
db_printf(")\n");
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("inp_sp: %p inp_vflag: 0x%x (", inp->inp_sp,
|
|
|
|
inp->inp_vflag);
|
|
|
|
db_print_inpvflag(inp->inp_vflag);
|
|
|
|
db_printf(")\n");
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("inp_ip_ttl: %d inp_ip_p: %d inp_ip_minttl: %d\n",
|
|
|
|
inp->inp_ip_ttl, inp->inp_ip_p, inp->inp_ip_minttl);
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
#ifdef INET6
|
|
|
|
if (inp->inp_vflag & INP_IPV6) {
|
|
|
|
db_printf("in6p_options: %p in6p_outputopts: %p "
|
|
|
|
"in6p_moptions: %p\n", inp->in6p_options,
|
|
|
|
inp->in6p_outputopts, inp->in6p_moptions);
|
|
|
|
db_printf("in6p_icmp6filt: %p in6p_cksum %d "
|
|
|
|
"in6p_hops %u\n", inp->in6p_icmp6filt, inp->in6p_cksum,
|
|
|
|
inp->in6p_hops);
|
|
|
|
} else
|
|
|
|
#endif
|
|
|
|
{
|
|
|
|
db_printf("inp_ip_tos: %d inp_ip_options: %p "
|
|
|
|
"inp_ip_moptions: %p\n", inp->inp_ip_tos,
|
|
|
|
inp->inp_options, inp->inp_moptions);
|
|
|
|
}
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("inp_phd: %p inp_gencnt: %ju\n", inp->inp_phd,
|
|
|
|
(uintmax_t)inp->inp_gencnt);
|
|
|
|
}
|
|
|
|
|
|
|
|
DB_SHOW_COMMAND(inpcb, db_show_inpcb)
|
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
|
|
|
|
|
|
|
if (!have_addr) {
|
|
|
|
db_printf("usage: show inpcb <addr>\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
inp = (struct inpcb *)addr;
|
|
|
|
|
|
|
|
db_print_inpcb(inp, "inpcb", 0);
|
|
|
|
}
|
2012-01-22 02:13:19 +00:00
|
|
|
#endif /* DDB */
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
|
|
|
|
#ifdef RATELIMIT
|
|
|
|
/*
|
|
|
|
* Modify TX rate limit based on the existing "inp->inp_snd_tag",
|
|
|
|
* if any.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
in_pcbmodify_txrtlmt(struct inpcb *inp, uint32_t max_pacing_rate)
|
|
|
|
{
|
|
|
|
union if_snd_tag_modify_params params = {
|
|
|
|
.rate_limit.max_rate = max_pacing_rate,
|
2019-08-01 14:17:31 +00:00
|
|
|
.rate_limit.flags = M_NOWAIT,
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
};
|
|
|
|
struct m_snd_tag *mst;
|
|
|
|
struct ifnet *ifp;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
mst = inp->inp_snd_tag;
|
|
|
|
if (mst == NULL)
|
|
|
|
return (EINVAL);
|
|
|
|
|
|
|
|
ifp = mst->ifp;
|
|
|
|
if (ifp == NULL)
|
|
|
|
return (EINVAL);
|
|
|
|
|
|
|
|
if (ifp->if_snd_tag_modify == NULL) {
|
|
|
|
error = EOPNOTSUPP;
|
|
|
|
} else {
|
|
|
|
error = ifp->if_snd_tag_modify(mst, ¶ms);
|
|
|
|
}
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Query existing TX rate limit based on the existing
|
|
|
|
* "inp->inp_snd_tag", if any.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
in_pcbquery_txrtlmt(struct inpcb *inp, uint32_t *p_max_pacing_rate)
|
|
|
|
{
|
|
|
|
union if_snd_tag_query_params params = { };
|
|
|
|
struct m_snd_tag *mst;
|
|
|
|
struct ifnet *ifp;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
mst = inp->inp_snd_tag;
|
|
|
|
if (mst == NULL)
|
|
|
|
return (EINVAL);
|
|
|
|
|
|
|
|
ifp = mst->ifp;
|
|
|
|
if (ifp == NULL)
|
|
|
|
return (EINVAL);
|
|
|
|
|
|
|
|
if (ifp->if_snd_tag_query == NULL) {
|
|
|
|
error = EOPNOTSUPP;
|
|
|
|
} else {
|
|
|
|
error = ifp->if_snd_tag_query(mst, ¶ms);
|
|
|
|
if (error == 0 && p_max_pacing_rate != NULL)
|
|
|
|
*p_max_pacing_rate = params.rate_limit.max_rate;
|
|
|
|
}
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2017-09-06 13:56:18 +00:00
|
|
|
/*
|
|
|
|
* Query existing TX queue level based on the existing
|
|
|
|
* "inp->inp_snd_tag", if any.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
in_pcbquery_txrlevel(struct inpcb *inp, uint32_t *p_txqueue_level)
|
|
|
|
{
|
|
|
|
union if_snd_tag_query_params params = { };
|
|
|
|
struct m_snd_tag *mst;
|
|
|
|
struct ifnet *ifp;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
mst = inp->inp_snd_tag;
|
|
|
|
if (mst == NULL)
|
|
|
|
return (EINVAL);
|
|
|
|
|
|
|
|
ifp = mst->ifp;
|
|
|
|
if (ifp == NULL)
|
|
|
|
return (EINVAL);
|
|
|
|
|
|
|
|
if (ifp->if_snd_tag_query == NULL)
|
|
|
|
return (EOPNOTSUPP);
|
|
|
|
|
|
|
|
error = ifp->if_snd_tag_query(mst, ¶ms);
|
|
|
|
if (error == 0 && p_txqueue_level != NULL)
|
|
|
|
*p_txqueue_level = params.rate_limit.queue_level;
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
/*
|
|
|
|
* Allocate a new TX rate limit send tag from the network interface
|
|
|
|
* given by the "ifp" argument and save it in "inp->inp_snd_tag":
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
in_pcbattach_txrtlmt(struct inpcb *inp, struct ifnet *ifp,
|
2019-08-01 14:17:31 +00:00
|
|
|
uint32_t flowtype, uint32_t flowid, uint32_t max_pacing_rate, struct m_snd_tag **st)
|
|
|
|
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
{
|
|
|
|
union if_snd_tag_alloc_params params = {
|
2017-09-06 13:56:18 +00:00
|
|
|
.rate_limit.hdr.type = (max_pacing_rate == -1U) ?
|
|
|
|
IF_SND_TAG_TYPE_UNLIMITED : IF_SND_TAG_TYPE_RATE_LIMIT,
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
.rate_limit.hdr.flowid = flowid,
|
|
|
|
.rate_limit.hdr.flowtype = flowtype,
|
2020-03-09 13:44:51 +00:00
|
|
|
.rate_limit.hdr.numa_domain = inp->inp_numa_domain,
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
.rate_limit.max_rate = max_pacing_rate,
|
2019-08-01 14:17:31 +00:00
|
|
|
.rate_limit.flags = M_NOWAIT,
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
};
|
|
|
|
int error;
|
|
|
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
|
2019-08-01 14:17:31 +00:00
|
|
|
if (*st != NULL)
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
return (EINVAL);
|
|
|
|
|
|
|
|
if (ifp->if_snd_tag_alloc == NULL) {
|
|
|
|
error = EOPNOTSUPP;
|
|
|
|
} else {
|
|
|
|
error = ifp->if_snd_tag_alloc(ifp, ¶ms, &inp->inp_snd_tag);
|
2019-08-01 14:17:31 +00:00
|
|
|
|
2019-08-02 22:43:09 +00:00
|
|
|
#ifdef INET
|
2019-08-01 14:17:31 +00:00
|
|
|
if (error == 0) {
|
|
|
|
counter_u64_add(rate_limit_set_ok, 1);
|
|
|
|
counter_u64_add(rate_limit_active, 1);
|
|
|
|
} else
|
|
|
|
counter_u64_add(rate_limit_alloc_fail, 1);
|
2019-08-02 22:43:09 +00:00
|
|
|
#endif
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
}
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2019-08-01 14:17:31 +00:00
|
|
|
void
|
|
|
|
in_pcbdetach_tag(struct ifnet *ifp, struct m_snd_tag *mst)
|
|
|
|
{
|
|
|
|
if (ifp == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the device was detached while we still had reference(s)
|
|
|
|
* on the ifp, we assume if_snd_tag_free() was replaced with
|
|
|
|
* stubs.
|
|
|
|
*/
|
|
|
|
ifp->if_snd_tag_free(mst);
|
|
|
|
|
|
|
|
/* release reference count on network interface */
|
|
|
|
if_rele(ifp);
|
2019-08-02 22:43:09 +00:00
|
|
|
#ifdef INET
|
2019-08-01 14:17:31 +00:00
|
|
|
counter_u64_add(rate_limit_active, -1);
|
2019-08-02 22:43:09 +00:00
|
|
|
#endif
|
2019-08-01 14:17:31 +00:00
|
|
|
}
|
|
|
|
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
/*
|
|
|
|
* Free an existing TX rate limit tag based on the "inp->inp_snd_tag",
|
|
|
|
* if any:
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
in_pcbdetach_txrtlmt(struct inpcb *inp)
|
|
|
|
{
|
|
|
|
struct m_snd_tag *mst;
|
|
|
|
|
|
|
|
INP_WLOCK_ASSERT(inp);
|
|
|
|
|
|
|
|
mst = inp->inp_snd_tag;
|
|
|
|
inp->inp_snd_tag = NULL;
|
|
|
|
|
|
|
|
if (mst == NULL)
|
|
|
|
return;
|
|
|
|
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
|
|
|
m_snd_tag_rele(mst);
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
}
|
|
|
|
|
2019-08-01 14:17:31 +00:00
|
|
|
int
|
|
|
|
in_pcboutput_txrtlmt_locked(struct inpcb *inp, struct ifnet *ifp, struct mbuf *mb, uint32_t max_pacing_rate)
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
|
|
|
/*
|
|
|
|
* If the existing send tag is for the wrong interface due to
|
|
|
|
* a route change, first drop the existing tag. Set the
|
|
|
|
* CHANGED flag so that we will keep trying to allocate a new
|
|
|
|
* tag if we fail to allocate one this time.
|
|
|
|
*/
|
|
|
|
if (inp->inp_snd_tag != NULL && inp->inp_snd_tag->ifp != ifp) {
|
|
|
|
in_pcbdetach_txrtlmt(inp);
|
|
|
|
inp->inp_flags2 |= INP_RATE_LIMIT_CHANGED;
|
|
|
|
}
|
|
|
|
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
/*
|
|
|
|
* NOTE: When attaching to a network interface a reference is
|
|
|
|
* made to ensure the network interface doesn't go away until
|
|
|
|
* all ratelimit connections are gone. The network interface
|
|
|
|
* pointers compared below represent valid network interfaces,
|
|
|
|
* except when comparing towards NULL.
|
|
|
|
*/
|
|
|
|
if (max_pacing_rate == 0 && inp->inp_snd_tag == NULL) {
|
|
|
|
error = 0;
|
|
|
|
} else if (!(ifp->if_capenable & IFCAP_TXRTLMT)) {
|
|
|
|
if (inp->inp_snd_tag != NULL)
|
|
|
|
in_pcbdetach_txrtlmt(inp);
|
|
|
|
error = 0;
|
|
|
|
} else if (inp->inp_snd_tag == NULL) {
|
|
|
|
/*
|
|
|
|
* In order to utilize packet pacing with RSS, we need
|
|
|
|
* to wait until there is a valid RSS hash before we
|
|
|
|
* can proceed:
|
|
|
|
*/
|
|
|
|
if (M_HASHTYPE_GET(mb) == M_HASHTYPE_NONE) {
|
|
|
|
error = EAGAIN;
|
|
|
|
} else {
|
|
|
|
error = in_pcbattach_txrtlmt(inp, ifp, M_HASHTYPE_GET(mb),
|
2019-08-01 14:17:31 +00:00
|
|
|
mb->m_pkthdr.flowid, max_pacing_rate, &inp->inp_snd_tag);
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
error = in_pcbmodify_txrtlmt(inp, max_pacing_rate);
|
|
|
|
}
|
|
|
|
if (error == 0 || error == EOPNOTSUPP)
|
|
|
|
inp->inp_flags2 &= ~INP_RATE_LIMIT_CHANGED;
|
2019-08-01 14:17:31 +00:00
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function should be called when the INP_RATE_LIMIT_CHANGED flag
|
|
|
|
* is set in the fast path and will attach/detach/modify the TX rate
|
|
|
|
* limit send tag based on the socket's so_max_pacing_rate value.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
in_pcboutput_txrtlmt(struct inpcb *inp, struct ifnet *ifp, struct mbuf *mb)
|
|
|
|
{
|
|
|
|
struct socket *socket;
|
|
|
|
uint32_t max_pacing_rate;
|
|
|
|
bool did_upgrade;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
if (inp == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
socket = inp->inp_socket;
|
|
|
|
if (socket == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (!INP_WLOCKED(inp)) {
|
|
|
|
/*
|
|
|
|
* NOTE: If the write locking fails, we need to bail
|
|
|
|
* out and use the non-ratelimited ring for the
|
|
|
|
* transmit until there is a new chance to get the
|
|
|
|
* write lock.
|
|
|
|
*/
|
|
|
|
if (!INP_TRY_UPGRADE(inp))
|
|
|
|
return;
|
|
|
|
did_upgrade = 1;
|
|
|
|
} else {
|
|
|
|
did_upgrade = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* NOTE: The so_max_pacing_rate value is read unlocked,
|
|
|
|
* because atomic updates are not required since the variable
|
|
|
|
* is checked at every mbuf we send. It is assumed that the
|
|
|
|
* variable read itself will be atomic.
|
|
|
|
*/
|
|
|
|
max_pacing_rate = socket->so_max_pacing_rate;
|
|
|
|
|
|
|
|
error = in_pcboutput_txrtlmt_locked(inp, ifp, mb, max_pacing_rate);
|
|
|
|
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
if (did_upgrade)
|
|
|
|
INP_DOWNGRADE(inp);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Track route changes for TX rate limiting.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
in_pcboutput_eagain(struct inpcb *inp)
|
|
|
|
{
|
|
|
|
bool did_upgrade;
|
|
|
|
|
|
|
|
if (inp == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (inp->inp_snd_tag == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (!INP_WLOCKED(inp)) {
|
|
|
|
/*
|
|
|
|
* NOTE: If the write locking fails, we need to bail
|
|
|
|
* out and use the non-ratelimited ring for the
|
|
|
|
* transmit until there is a new chance to get the
|
|
|
|
* write lock.
|
|
|
|
*/
|
|
|
|
if (!INP_TRY_UPGRADE(inp))
|
|
|
|
return;
|
|
|
|
did_upgrade = 1;
|
|
|
|
} else {
|
|
|
|
did_upgrade = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* detach rate limiting */
|
|
|
|
in_pcbdetach_txrtlmt(inp);
|
|
|
|
|
|
|
|
/* make sure new mbuf send tag allocation is made */
|
|
|
|
inp->inp_flags2 |= INP_RATE_LIMIT_CHANGED;
|
|
|
|
|
|
|
|
if (did_upgrade)
|
|
|
|
INP_DOWNGRADE(inp);
|
|
|
|
}
|
2019-08-01 14:17:31 +00:00
|
|
|
|
2019-08-02 22:43:09 +00:00
|
|
|
#ifdef INET
|
2019-08-01 14:17:31 +00:00
|
|
|
static void
|
|
|
|
rl_init(void *st)
|
|
|
|
{
|
|
|
|
rate_limit_active = counter_u64_alloc(M_WAITOK);
|
|
|
|
rate_limit_alloc_fail = counter_u64_alloc(M_WAITOK);
|
|
|
|
rate_limit_set_ok = counter_u64_alloc(M_WAITOK);
|
|
|
|
}
|
|
|
|
|
|
|
|
SYSINIT(rl, SI_SUB_PROTO_DOMAININIT, SI_ORDER_ANY, rl_init, NULL);
|
2019-08-02 22:43:09 +00:00
|
|
|
#endif
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
#endif /* RATELIMIT */
|