2005-01-07 01:45:51 +00:00
|
|
|
/*-
|
1994-05-24 10:09:53 +00:00
|
|
|
* Copyright (c) 1982, 1986, 1990, 1993
|
2008-12-09 10:21:38 +00:00
|
|
|
* The Regents of the University of California.
|
2011-05-23 13:51:57 +00:00
|
|
|
* Copyright (c) 2010-2011 Juniper Networks, Inc.
|
2008-12-09 10:21:38 +00:00
|
|
|
* All rights reserved.
|
1994-05-24 10:09:53 +00:00
|
|
|
*
|
2011-05-23 13:51:57 +00:00
|
|
|
* Portions of this software were developed by Robert N. M. Watson under
|
|
|
|
* contract to Juniper Networks, Inc.
|
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
* 4. Neither the name of the University nor the names of its contributors
|
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
|
|
|
* @(#)in_pcb.h 8.1 (Berkeley) 6/10/93
|
1999-08-28 01:08:13 +00:00
|
|
|
* $FreeBSD$
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
|
1994-08-21 05:27:42 +00:00
|
|
|
#ifndef _NETINET_IN_PCB_H_
|
|
|
|
#define _NETINET_IN_PCB_H_
|
|
|
|
|
1995-12-05 21:26:34 +00:00
|
|
|
#include <sys/queue.h>
|
2002-09-05 19:48:52 +00:00
|
|
|
#include <sys/_lock.h>
|
|
|
|
#include <sys/_mutex.h>
|
2008-04-17 21:38:18 +00:00
|
|
|
#include <sys/_rwlock.h>
|
1995-12-05 21:26:34 +00:00
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
#ifdef _KERNEL
|
2011-06-06 21:45:32 +00:00
|
|
|
#include <sys/lock.h>
|
2008-04-17 21:38:18 +00:00
|
|
|
#include <sys/rwlock.h>
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
#include <net/vnet.h>
|
2010-03-14 18:59:11 +00:00
|
|
|
#include <vm/uma.h>
|
2008-04-17 21:38:18 +00:00
|
|
|
#endif
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
#define in6pcb inpcb /* for KAME src sync over BSD*'s */
|
|
|
|
#define in6p_sp inp_sp /* for KAME src sync over BSD*'s */
|
2002-10-16 02:25:05 +00:00
|
|
|
struct inpcbpolicy;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2008-09-29 13:48:48 +00:00
|
|
|
* struct inpcb is the common protocol control block structure used in most
|
|
|
|
* IP transport protocols.
|
2007-04-30 23:12:05 +00:00
|
|
|
*
|
|
|
|
* Pointers to local and foreign host table entries, local and foreign socket
|
|
|
|
* numbers, and pointers up (to a socket structure) and down (to a
|
|
|
|
* protocol-specific control block) are stored here.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2000-05-26 02:09:24 +00:00
|
|
|
LIST_HEAD(inpcbhead, inpcb);
|
|
|
|
LIST_HEAD(inpcbporthead, inpcbport);
|
1998-05-15 20:11:40 +00:00
|
|
|
typedef u_quad_t inp_gen_t;
|
1995-04-09 01:29:31 +00:00
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
|
|
|
* PCB with AF_INET6 null bind'ed laddr can receive AF_INET input packet.
|
2007-04-30 23:12:05 +00:00
|
|
|
* So, AF_INET6 null laddr is also used as AF_INET null laddr, by utilizing
|
|
|
|
* the following structure.
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
|
|
|
struct in_addr_4in6 {
|
|
|
|
u_int32_t ia46_pad32[3];
|
|
|
|
struct in_addr ia46_addr4;
|
|
|
|
};
|
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
/*
|
2007-04-30 23:12:05 +00:00
|
|
|
* NOTE: ipv6 addrs should be 64-bit aligned, per RFC 2553. in_conninfo has
|
|
|
|
* some extra padding to accomplish this.
|
Use Jenkins hash for TCP syncache.
o Unlike xor, in Jenkins hash every bit of input affects virtually
every bit of output, thus salting the hash actually works. With
xor salting only provides a false sense of security, since if
hash(x) collides with hash(y), then of course, hash(x) ^ salt
would also collide with hash(y) ^ salt. [1]
o Jenkins provides much better distribution than xor, very close to
ideal.
TCP connection setup/teardown benchmark has shown a 10% increase
with default hash size, and with bigger hashes that still provide
possibility for collisions. With enormous hash size, when dataset is
by an order of magnitude smaller than hash size, the benchmark has
shown 4% decrease in performance decrease, which is expected and
acceptable.
Noticed by: Jeffrey Knockel <jeffk cs.unm.edu> [1]
Benchmarks by: jch
Reviewed by: jch, pkelsey, delphij
Security: strengthens protection against hash collision DoS
Sponsored by: Nginx, Inc.
2015-09-05 10:15:19 +00:00
|
|
|
* NOTE 2: tcp_syncache.c uses first 5 32-bit words, which identify fport,
|
|
|
|
* lport, faddr to generate hash, so these fields shouldn't be moved.
|
2001-11-22 04:50:44 +00:00
|
|
|
*/
|
|
|
|
struct in_endpoints {
|
|
|
|
u_int16_t ie_fport; /* foreign port */
|
|
|
|
u_int16_t ie_lport; /* local port */
|
|
|
|
/* protocol dependent part, local and foreign addr */
|
|
|
|
union {
|
|
|
|
/* foreign host table entry */
|
|
|
|
struct in_addr_4in6 ie46_foreign;
|
|
|
|
struct in6_addr ie6_foreign;
|
|
|
|
} ie_dependfaddr;
|
|
|
|
union {
|
|
|
|
/* local host table entry */
|
|
|
|
struct in_addr_4in6 ie46_local;
|
|
|
|
struct in6_addr ie6_local;
|
|
|
|
} ie_dependladdr;
|
2014-09-10 16:26:18 +00:00
|
|
|
u_int32_t ie6_zoneid; /* scope zone id */
|
2008-12-09 10:21:38 +00:00
|
|
|
};
|
2001-11-22 04:50:44 +00:00
|
|
|
#define ie_faddr ie_dependfaddr.ie46_foreign.ia46_addr4
|
|
|
|
#define ie_laddr ie_dependladdr.ie46_local.ia46_addr4
|
|
|
|
#define ie6_faddr ie_dependfaddr.ie6_foreign
|
|
|
|
#define ie6_laddr ie_dependladdr.ie6_local
|
|
|
|
|
|
|
|
/*
|
2007-04-30 23:12:05 +00:00
|
|
|
* XXX The defines for inc_* are hacks and should be changed to direct
|
|
|
|
* references.
|
2001-11-22 04:50:44 +00:00
|
|
|
*/
|
|
|
|
struct in_conninfo {
|
|
|
|
u_int8_t inc_flags;
|
|
|
|
u_int8_t inc_len;
|
Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)
Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.
From my notes:
-----
One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.
Constraints:
------------
I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.
One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".
One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.
This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.
Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.
To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.
The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.
The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.
In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.
One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).
You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.
This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.
Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.
Packets fall into one of a number of classes.
1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..
setfib -3 ping target.example.com # will use fib 3 for ping.
It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.
2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)
3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).
4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.
5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.
6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.
Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)
In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.
In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.
Early testing experience:
-------------------------
Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.
For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.
Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.
ipfw has grown 2 new keywords:
setfib N ip from anay to any
count ip from any to any fib N
In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.
SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.
Where to next:
--------------------
After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.
Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.
My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.
When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.
Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.
This work was sponsored by Ironport Systems/Cisco
Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco
2008-05-09 23:03:00 +00:00
|
|
|
u_int16_t inc_fibnum; /* XXX was pad, 16 bits is plenty */
|
2003-11-20 20:07:39 +00:00
|
|
|
/* protocol dependent part */
|
2001-11-22 04:50:44 +00:00
|
|
|
struct in_endpoints inc_ie;
|
|
|
|
};
|
2008-12-17 12:52:34 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Flags for inc_flags.
|
|
|
|
*/
|
|
|
|
#define INC_ISIPV6 0x01
|
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
#define inc_isipv6 inc_flags /* temp compatability */
|
|
|
|
#define inc_fport inc_ie.ie_fport
|
|
|
|
#define inc_lport inc_ie.ie_lport
|
|
|
|
#define inc_faddr inc_ie.ie_faddr
|
|
|
|
#define inc_laddr inc_ie.ie_laddr
|
|
|
|
#define inc6_faddr inc_ie.ie6_faddr
|
|
|
|
#define inc6_laddr inc_ie.ie6_laddr
|
2014-09-10 16:26:18 +00:00
|
|
|
#define inc6_zoneid inc_ie.ie6_zoneid
|
2001-11-22 04:50:44 +00:00
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
struct icmp6_filter;
|
|
|
|
|
2008-07-08 17:22:59 +00:00
|
|
|
/*-
|
2015-08-03 12:13:54 +00:00
|
|
|
* struct inpcb captures the network layer state for TCP, UDP, and raw IPv4 and
|
|
|
|
* IPv6 sockets. In the case of TCP and UDP, further per-connection state is
|
2008-07-08 17:22:59 +00:00
|
|
|
* hung off of inp_ppcb most of the time. Almost all fields of struct inpcb
|
|
|
|
* are static after creation or protected by a per-inpcb rwlock, inp_lock. A
|
2015-08-03 12:13:54 +00:00
|
|
|
* few fields are protected by multiple locks as indicated in the locking notes
|
|
|
|
* below. For these fields, all of the listed locks must be write-locked for
|
|
|
|
* any modifications. However, these fields can be safely read while any one of
|
|
|
|
* the listed locks are read-locked. This model can permit greater concurrency
|
|
|
|
* for read operations. For example, connections can be looked up while only
|
|
|
|
* holding a read lock on the global pcblist lock. This is important for
|
|
|
|
* performance when attempting to find the connection for a packet given its IP
|
|
|
|
* and port tuple.
|
|
|
|
*
|
|
|
|
* One noteworthy exception is that the global pcbinfo lock follows a different
|
|
|
|
* set of rules in relation to the inp_list field. Rather than being
|
|
|
|
* write-locked for modifications and read-locked for list iterations, it must
|
|
|
|
* be read-locked during modifications and write-locked during list iterations.
|
|
|
|
* This ensures that the relatively rare global list iterations safely walk a
|
|
|
|
* stable snapshot of connections while allowing more common list modifications
|
|
|
|
* to safely grab the pcblist lock just while adding or removing a connection
|
|
|
|
* from the global list.
|
2008-07-08 17:22:59 +00:00
|
|
|
*
|
|
|
|
* Key:
|
|
|
|
* (c) - Constant after initialization
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
* (g) - Protected by the pcbgroup lock
|
2008-07-08 17:22:59 +00:00
|
|
|
* (i) - Protected by the inpcb lock
|
|
|
|
* (p) - Protected by the pcbinfo lock for the inpcb
|
2015-08-03 12:13:54 +00:00
|
|
|
* (l) - Protected by the pcblist lock for the inpcb
|
|
|
|
* (h) - Protected by the pcbhash lock for the inpcb
|
2008-07-08 17:22:59 +00:00
|
|
|
* (s) - Protected by another subsystem's locks
|
|
|
|
* (x) - Undefined locking
|
|
|
|
*
|
|
|
|
* A few other notes:
|
|
|
|
*
|
|
|
|
* When a read lock is held, stability of the field is guaranteed; to write
|
|
|
|
* to a field, a write lock must generally be held.
|
|
|
|
*
|
|
|
|
* netinet/netinet6-layer code should not assume that the inp_socket pointer
|
|
|
|
* is safe to dereference without inp_lock being held, even for protocols
|
|
|
|
* other than TCP (where the inpcb persists during TIMEWAIT even after the
|
|
|
|
* socket has been freed), or there may be close(2)-related races.
|
|
|
|
*
|
|
|
|
* The inp_vflag field is overloaded, and would otherwise ideally be (c).
|
2015-08-03 12:13:54 +00:00
|
|
|
*
|
|
|
|
* TODO: Currently only the TCP stack is leveraging the global pcbinfo lock
|
|
|
|
* read-lock usage during modification, this model can be applied to other
|
|
|
|
* protocols (especially SCTP).
|
2008-07-08 17:22:59 +00:00
|
|
|
*/
|
1994-05-24 10:09:53 +00:00
|
|
|
struct inpcb {
|
2015-08-03 12:13:54 +00:00
|
|
|
LIST_ENTRY(inpcb) inp_hash; /* (h/i) hash list */
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
LIST_ENTRY(inpcb) inp_pcbgrouphash; /* (g/i) hash list */
|
2015-08-03 12:13:54 +00:00
|
|
|
LIST_ENTRY(inpcb) inp_list; /* (p/l) list for all PCBs for proto */
|
|
|
|
/* (p[w]) for list iteration */
|
|
|
|
/* (p[r]/l) for addition/removal */
|
2008-07-08 17:22:59 +00:00
|
|
|
void *inp_ppcb; /* (i) pointer to per-protocol pcb */
|
|
|
|
struct inpcbinfo *inp_pcbinfo; /* (c) PCB list info */
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
struct inpcbgroup *inp_pcbgroup; /* (g/i) PCB group list */
|
2015-08-03 12:13:54 +00:00
|
|
|
LIST_ENTRY(inpcb) inp_pcbgroup_wild; /* (g/i/h) group wildcard entry */
|
2008-12-09 10:21:38 +00:00
|
|
|
struct socket *inp_socket; /* (i) back pointer to socket */
|
2008-10-04 15:06:34 +00:00
|
|
|
struct ucred *inp_cred; /* (c) cache of socket cred */
|
2008-12-09 10:21:38 +00:00
|
|
|
u_int32_t inp_flow; /* (i) IPv6 flow information */
|
2008-07-08 17:22:59 +00:00
|
|
|
int inp_flags; /* (i) generic IP/datagram flags */
|
2009-04-15 22:09:42 +00:00
|
|
|
int inp_flags2; /* (i) generic IP/datagram flags #2*/
|
2008-07-08 17:22:59 +00:00
|
|
|
u_char inp_vflag; /* (i) IP version flag (v4/v6) */
|
|
|
|
u_char inp_ip_ttl; /* (i) time to live proto */
|
|
|
|
u_char inp_ip_p; /* (c) protocol proto */
|
|
|
|
u_char inp_ip_minttl; /* (i) minimum TTL or drop */
|
2009-04-10 06:16:14 +00:00
|
|
|
uint32_t inp_flowid; /* (x) flow id / queue id */
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 20:18:50 +00:00
|
|
|
u_int inp_refcount; /* (i) refcount */
|
2011-07-17 21:15:20 +00:00
|
|
|
void *inp_pspare[5]; /* (x) route caching / general use */
|
2014-05-18 22:30:12 +00:00
|
|
|
uint32_t inp_flowtype; /* (x) M_HASHTYPE value */
|
2014-07-10 03:10:56 +00:00
|
|
|
uint32_t inp_rss_listen_bucket; /* (x) overridden RSS listen bucket */
|
|
|
|
u_int inp_ispare[4]; /* (x) route caching / user cookie /
|
2011-07-17 21:15:20 +00:00
|
|
|
* general use */
|
2007-12-07 01:46:13 +00:00
|
|
|
|
|
|
|
/* Local and foreign ports, local and foreign addr. */
|
2015-08-03 12:13:54 +00:00
|
|
|
struct in_conninfo inp_inc; /* (i) list for PCB's local port */
|
2007-12-07 01:46:13 +00:00
|
|
|
|
2008-12-09 10:21:38 +00:00
|
|
|
/* MAC and IPSEC policy information. */
|
2008-07-08 17:22:59 +00:00
|
|
|
struct label *inp_label; /* (i) MAC label */
|
|
|
|
struct inpcbpolicy *inp_sp; /* (s) for IPSEC */
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2007-04-30 23:12:05 +00:00
|
|
|
/* Protocol-dependent part; options. */
|
1999-11-22 02:45:11 +00:00
|
|
|
struct {
|
2008-07-08 17:22:59 +00:00
|
|
|
u_char inp4_ip_tos; /* (i) type of service proto */
|
|
|
|
struct mbuf *inp4_options; /* (i) IP options */
|
2008-12-09 10:21:38 +00:00
|
|
|
struct ip_moptions *inp4_moptions; /* (i) IP mcast options */
|
1999-11-22 02:45:11 +00:00
|
|
|
} inp_depend4;
|
|
|
|
struct {
|
2008-07-08 17:22:59 +00:00
|
|
|
/* (i) IP options */
|
1999-11-22 02:45:11 +00:00
|
|
|
struct mbuf *inp6_options;
|
2008-07-08 17:22:59 +00:00
|
|
|
/* (i) IP6 options for outgoing packets */
|
1999-11-22 02:45:11 +00:00
|
|
|
struct ip6_pktopts *inp6_outputopts;
|
2008-07-08 17:22:59 +00:00
|
|
|
/* (i) IP multicast options */
|
1999-11-22 02:45:11 +00:00
|
|
|
struct ip6_moptions *inp6_moptions;
|
2008-07-08 17:22:59 +00:00
|
|
|
/* (i) ICMPv6 code type filter */
|
1999-11-22 02:45:11 +00:00
|
|
|
struct icmp6_filter *inp6_icmp6filt;
|
2008-07-08 17:22:59 +00:00
|
|
|
/* (i) IPV6_CHECKSUM setsockopt */
|
1999-11-22 02:45:11 +00:00
|
|
|
int inp6_cksum;
|
|
|
|
short inp6_hops;
|
|
|
|
} inp_depend6;
|
2015-08-03 12:13:54 +00:00
|
|
|
LIST_ENTRY(inpcb) inp_portlist; /* (i/h) */
|
|
|
|
struct inpcbport *inp_phd; /* (i/h) head of this list */
|
2006-07-18 22:34:27 +00:00
|
|
|
#define inp_zero_size offsetof(struct inpcb, inp_gencnt)
|
2008-12-09 10:21:38 +00:00
|
|
|
inp_gen_t inp_gencnt; /* (c) generation count */
|
2009-04-16 22:47:43 +00:00
|
|
|
struct llentry *inp_lle; /* cached L2 information */
|
|
|
|
struct rtentry *inp_rt; /* cached L3 information */
|
2008-04-17 21:38:18 +00:00
|
|
|
struct rwlock inp_lock;
|
2008-12-09 10:21:38 +00:00
|
|
|
};
|
|
|
|
#define inp_fport inp_inc.inc_fport
|
|
|
|
#define inp_lport inp_inc.inc_lport
|
|
|
|
#define inp_faddr inp_inc.inc_faddr
|
|
|
|
#define inp_laddr inp_inc.inc_laddr
|
|
|
|
#define inp_ip_tos inp_depend4.inp4_ip_tos
|
|
|
|
#define inp_options inp_depend4.inp4_options
|
|
|
|
#define inp_moptions inp_depend4.inp4_moptions
|
2002-06-10 20:05:46 +00:00
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
#define in6p_faddr inp_inc.inc6_faddr
|
|
|
|
#define in6p_laddr inp_inc.inc6_laddr
|
2014-09-10 16:26:18 +00:00
|
|
|
#define in6p_zoneid inp_inc.inc6_zoneid
|
1999-11-22 02:45:11 +00:00
|
|
|
#define in6p_hops inp_depend6.inp6_hops /* default hop limit */
|
|
|
|
#define in6p_flowinfo inp_flow
|
|
|
|
#define in6p_options inp_depend6.inp6_options
|
|
|
|
#define in6p_outputopts inp_depend6.inp6_outputopts
|
|
|
|
#define in6p_moptions inp_depend6.inp6_moptions
|
|
|
|
#define in6p_icmp6filt inp_depend6.inp6_icmp6filt
|
|
|
|
#define in6p_cksum inp_depend6.inp6_cksum
|
2008-12-09 10:21:38 +00:00
|
|
|
|
Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:
1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:
options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet
2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:
INIT_VNET_NET(ifp->if_vnet); becomes
struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];
3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.
4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.
5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.
6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.
Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.
Reviewed by: bz, rwatson
Approved by: julian (mentor)
2009-04-30 13:36:26 +00:00
|
|
|
#define inp_vnet inp_pcbinfo->ipi_vnet
|
|
|
|
|
1998-03-24 18:06:34 +00:00
|
|
|
/*
|
2007-04-30 23:12:05 +00:00
|
|
|
* The range of the generation count, as used in this implementation, is 9e19.
|
|
|
|
* We would have to create 300 billion connections per second for this number
|
|
|
|
* to roll over in a year. This seems sufficiently unlikely that we simply
|
|
|
|
* don't concern ourselves with that possibility.
|
1998-03-24 18:06:34 +00:00
|
|
|
*/
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
|
1998-05-15 20:11:40 +00:00
|
|
|
/*
|
2007-04-30 23:12:05 +00:00
|
|
|
* Interface exported to userland by various protocols which use inpcbs. Hack
|
|
|
|
* alert -- only define if struct xsocket is in scope.
|
1998-05-15 20:11:40 +00:00
|
|
|
*/
|
|
|
|
#ifdef _SYS_SOCKETVAR_H_
|
|
|
|
struct xinpcb {
|
|
|
|
size_t xi_len; /* length of this structure */
|
|
|
|
struct inpcb xi_inp;
|
|
|
|
struct xsocket xi_socket;
|
|
|
|
u_quad_t xi_alignment_hack;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct xinpgen {
|
|
|
|
size_t xig_len; /* length of this structure */
|
|
|
|
u_int xig_count; /* number of PCBs at this time */
|
|
|
|
inp_gen_t xig_gen; /* generation count at this time */
|
|
|
|
so_gen_t xig_sogen; /* socket generation count at this time */
|
|
|
|
};
|
|
|
|
#endif /* _SYS_SOCKETVAR_H_ */
|
|
|
|
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
struct inpcbport {
|
2000-05-26 02:09:24 +00:00
|
|
|
LIST_ENTRY(inpcbport) phd_hash;
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
struct inpcbhead phd_pcblist;
|
|
|
|
u_short phd_port;
|
1994-05-24 10:09:53 +00:00
|
|
|
};
|
|
|
|
|
2011-05-23 13:51:57 +00:00
|
|
|
/*-
|
2007-04-30 23:12:05 +00:00
|
|
|
* Global data structure for each high-level protocol (UDP, TCP, ...) in both
|
|
|
|
* IPv4 and IPv6. Holds inpcb lists and information for managing them.
|
2011-05-23 13:51:57 +00:00
|
|
|
*
|
2015-08-03 12:13:54 +00:00
|
|
|
* Each pcbinfo is protected by three locks: ipi_lock, ipi_hash_lock and
|
|
|
|
* ipi_list_lock:
|
|
|
|
* - ipi_lock covering the global pcb list stability during loop iteration,
|
|
|
|
* - ipi_hash_lock covering the hashed lookup tables,
|
|
|
|
* - ipi_list_lock covering mutable global fields (such as the global
|
|
|
|
* pcb list)
|
|
|
|
*
|
|
|
|
* The lock order is:
|
2011-05-23 13:51:57 +00:00
|
|
|
*
|
2015-08-03 12:13:54 +00:00
|
|
|
* ipi_lock (before)
|
|
|
|
* inpcb locks (before)
|
|
|
|
* ipi_list locks (before)
|
|
|
|
* {ipi_hash_lock, pcbgroup locks}
|
2011-05-23 13:51:57 +00:00
|
|
|
*
|
|
|
|
* Locking key:
|
|
|
|
*
|
|
|
|
* (c) Constant or nearly constant after initialisation
|
|
|
|
* (g) Locked by ipi_lock
|
2015-08-03 12:13:54 +00:00
|
|
|
* (l) Locked by ipi_list_lock
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
* (h) Read using either ipi_hash_lock or inpcb lock; write requires both
|
|
|
|
* (p) Protected by one or more pcbgroup locks
|
2011-05-23 13:51:57 +00:00
|
|
|
* (x) Synchronisation properties poorly defined
|
2007-04-30 23:12:05 +00:00
|
|
|
*/
|
|
|
|
struct inpcbinfo {
|
|
|
|
/*
|
2015-08-03 12:13:54 +00:00
|
|
|
* Global lock protecting full inpcb list traversal
|
2007-04-30 23:12:05 +00:00
|
|
|
*/
|
2011-05-23 13:51:57 +00:00
|
|
|
struct rwlock ipi_lock;
|
2007-04-30 23:12:05 +00:00
|
|
|
|
|
|
|
/*
|
2011-05-23 13:51:57 +00:00
|
|
|
* Global list of inpcbs on the protocol.
|
2007-04-30 23:12:05 +00:00
|
|
|
*/
|
2015-08-03 12:13:54 +00:00
|
|
|
struct inpcbhead *ipi_listhead; /* (g/l) */
|
|
|
|
u_int ipi_count; /* (l) */
|
2007-04-30 23:12:05 +00:00
|
|
|
|
|
|
|
/*
|
2011-05-23 13:51:57 +00:00
|
|
|
* Generation count -- incremented each time a connection is allocated
|
|
|
|
* or freed.
|
2007-04-30 23:12:05 +00:00
|
|
|
*/
|
2015-08-03 12:13:54 +00:00
|
|
|
u_quad_t ipi_gencnt; /* (l) */
|
2007-04-30 23:12:05 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Fields associated with port lookup and allocation.
|
|
|
|
*/
|
2011-05-23 13:51:57 +00:00
|
|
|
u_short ipi_lastport; /* (x) */
|
|
|
|
u_short ipi_lastlow; /* (x) */
|
|
|
|
u_short ipi_lasthi; /* (x) */
|
2007-04-30 23:12:05 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* UMA zone from which inpcbs are allocated for this protocol.
|
|
|
|
*/
|
2011-05-23 13:51:57 +00:00
|
|
|
struct uma_zone *ipi_zone; /* (c) */
|
2007-04-30 23:12:05 +00:00
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
/*
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
* Connection groups associated with this protocol. These fields are
|
|
|
|
* constant, but pcbgroup structures themselves are protected by
|
|
|
|
* per-pcbgroup locks.
|
|
|
|
*/
|
|
|
|
struct inpcbgroup *ipi_pcbgroups; /* (c) */
|
|
|
|
u_int ipi_npcbgroups; /* (c) */
|
|
|
|
u_int ipi_hashfields; /* (c) */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Global lock protecting non-pcbgroup hash lookup tables.
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
*/
|
2013-05-06 16:42:18 +00:00
|
|
|
struct rwlock ipi_hash_lock;
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
|
2007-04-30 23:12:05 +00:00
|
|
|
/*
|
2011-05-23 13:51:57 +00:00
|
|
|
* Global hash of inpcbs, hashed by local and foreign addresses and
|
|
|
|
* port numbers.
|
2007-04-30 23:12:05 +00:00
|
|
|
*/
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
struct inpcbhead *ipi_hashbase; /* (h) */
|
|
|
|
u_long ipi_hashmask; /* (h) */
|
2011-05-23 13:51:57 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Global hash of inpcbs, hashed by only local port number.
|
|
|
|
*/
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
struct inpcbporthead *ipi_porthashbase; /* (h) */
|
|
|
|
u_long ipi_porthashmask; /* (h) */
|
2007-12-07 01:46:13 +00:00
|
|
|
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
/*
|
|
|
|
* List of wildcard inpcbs for use with pcbgroups. In the past, was
|
|
|
|
* per-pcbgroup but is now global. All pcbgroup locks must be held
|
|
|
|
* to modify the list, so any is sufficient to read it.
|
|
|
|
*/
|
|
|
|
struct inpcbhead *ipi_wildbase; /* (p) */
|
|
|
|
u_long ipi_wildmask; /* (p) */
|
|
|
|
|
2007-12-07 01:46:13 +00:00
|
|
|
/*
|
Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:
1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:
options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet
2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:
INIT_VNET_NET(ifp->if_vnet); becomes
struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];
3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.
4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.
5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.
6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.
Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.
Reviewed by: bz, rwatson
Approved by: julian (mentor)
2009-04-30 13:36:26 +00:00
|
|
|
* Pointer to network stack instance
|
|
|
|
*/
|
2011-05-23 13:51:57 +00:00
|
|
|
struct vnet *ipi_vnet; /* (c) */
|
Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:
1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:
options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet
2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:
INIT_VNET_NET(ifp->if_vnet); becomes
struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];
3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.
4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.
5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.
6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.
Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.
Reviewed by: bz, rwatson
Approved by: julian (mentor)
2009-04-30 13:36:26 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* general use 2
|
2007-12-07 01:46:13 +00:00
|
|
|
*/
|
2008-08-07 09:06:04 +00:00
|
|
|
void *ipi_pspare[2];
|
2015-08-03 12:13:54 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Global lock protecting global inpcb list, inpcb count, etc.
|
|
|
|
*/
|
|
|
|
struct rwlock ipi_list_lock;
|
1995-04-09 01:29:31 +00:00
|
|
|
};
|
|
|
|
|
2012-03-17 21:51:39 +00:00
|
|
|
#ifdef _KERNEL
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
/*
|
|
|
|
* Connection groups hold sets of connections that have similar CPU/thread
|
|
|
|
* affinity. Each connection belongs to exactly one connection group.
|
|
|
|
*/
|
|
|
|
struct inpcbgroup {
|
|
|
|
/*
|
|
|
|
* Per-connection group hash of inpcbs, hashed by local and foreign
|
|
|
|
* addresses and port numbers.
|
|
|
|
*/
|
|
|
|
struct inpcbhead *ipg_hashbase; /* (c) */
|
|
|
|
u_long ipg_hashmask; /* (c) */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Notional affinity of this pcbgroup.
|
|
|
|
*/
|
|
|
|
u_int ipg_cpu; /* (p) */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Per-connection group lock, not to be confused with ipi_lock.
|
|
|
|
* Protects the hash table hung off the group, but also the global
|
|
|
|
* wildcard list in inpcbinfo.
|
|
|
|
*/
|
|
|
|
struct mtx ipg_lock;
|
|
|
|
} __aligned(CACHE_LINE_SIZE);
|
|
|
|
|
2003-11-26 01:40:44 +00:00
|
|
|
#define INP_LOCK_INIT(inp, d, t) \
|
2008-04-17 21:38:18 +00:00
|
|
|
rw_init_flags(&(inp)->inp_lock, (t), RW_RECURSE | RW_DUPOK)
|
|
|
|
#define INP_LOCK_DESTROY(inp) rw_destroy(&(inp)->inp_lock)
|
|
|
|
#define INP_RLOCK(inp) rw_rlock(&(inp)->inp_lock)
|
|
|
|
#define INP_WLOCK(inp) rw_wlock(&(inp)->inp_lock)
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
#define INP_TRY_RLOCK(inp) rw_try_rlock(&(inp)->inp_lock)
|
|
|
|
#define INP_TRY_WLOCK(inp) rw_try_wlock(&(inp)->inp_lock)
|
2008-04-17 21:38:18 +00:00
|
|
|
#define INP_RUNLOCK(inp) rw_runlock(&(inp)->inp_lock)
|
|
|
|
#define INP_WUNLOCK(inp) rw_wunlock(&(inp)->inp_lock)
|
2009-04-15 21:39:56 +00:00
|
|
|
#define INP_TRY_UPGRADE(inp) rw_try_upgrade(&(inp)->inp_lock)
|
|
|
|
#define INP_DOWNGRADE(inp) rw_downgrade(&(inp)->inp_lock)
|
|
|
|
#define INP_WLOCKED(inp) rw_wowned(&(inp)->inp_lock)
|
|
|
|
#define INP_LOCK_ASSERT(inp) rw_assert(&(inp)->inp_lock, RA_LOCKED)
|
2008-04-17 21:38:18 +00:00
|
|
|
#define INP_RLOCK_ASSERT(inp) rw_assert(&(inp)->inp_lock, RA_RLOCKED)
|
|
|
|
#define INP_WLOCK_ASSERT(inp) rw_assert(&(inp)->inp_lock, RA_WLOCKED)
|
|
|
|
#define INP_UNLOCK_ASSERT(inp) rw_assert(&(inp)->inp_lock, RA_UNLOCKED)
|
2002-06-10 20:05:46 +00:00
|
|
|
|
2008-03-23 22:34:16 +00:00
|
|
|
/*
|
2008-08-07 09:06:04 +00:00
|
|
|
* These locking functions are for inpcb consumers outside of sys/netinet,
|
2008-03-23 22:34:16 +00:00
|
|
|
* more specifically, they were added for the benefit of TOE drivers. The
|
|
|
|
* macros are reserved for use by the stack.
|
|
|
|
*/
|
|
|
|
void inp_wlock(struct inpcb *);
|
|
|
|
void inp_wunlock(struct inpcb *);
|
|
|
|
void inp_rlock(struct inpcb *);
|
|
|
|
void inp_runlock(struct inpcb *);
|
|
|
|
|
|
|
|
#ifdef INVARIANTS
|
2008-03-24 20:24:04 +00:00
|
|
|
void inp_lock_assert(struct inpcb *);
|
|
|
|
void inp_unlock_assert(struct inpcb *);
|
2008-03-23 22:34:16 +00:00
|
|
|
#else
|
|
|
|
static __inline void
|
2008-03-24 20:24:04 +00:00
|
|
|
inp_lock_assert(struct inpcb *inp __unused)
|
2008-03-23 22:34:16 +00:00
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static __inline void
|
2008-03-24 20:24:04 +00:00
|
|
|
inp_unlock_assert(struct inpcb *inp __unused)
|
2008-03-23 22:34:16 +00:00
|
|
|
{
|
|
|
|
}
|
2008-03-24 20:24:04 +00:00
|
|
|
|
2008-03-23 22:34:16 +00:00
|
|
|
#endif
|
2008-07-21 00:08:34 +00:00
|
|
|
|
2008-07-21 22:11:39 +00:00
|
|
|
void inp_apply_all(void (*func)(struct inpcb *, void *), void *arg);
|
|
|
|
int inp_ip_tos_get(const struct inpcb *inp);
|
|
|
|
void inp_ip_tos_set(struct inpcb *inp, int val);
|
|
|
|
struct socket *
|
|
|
|
inp_inpcbtosocket(struct inpcb *inp);
|
|
|
|
struct tcpcb *
|
|
|
|
inp_inpcbtotcpcb(struct inpcb *inp);
|
2008-08-07 09:06:04 +00:00
|
|
|
void inp_4tuple_get(struct inpcb *inp, uint32_t *laddr, uint16_t *lp,
|
2008-07-21 22:11:39 +00:00
|
|
|
uint32_t *faddr, uint16_t *fp);
|
2013-07-04 18:38:00 +00:00
|
|
|
short inp_so_options(const struct inpcb *inp);
|
2008-07-21 00:08:34 +00:00
|
|
|
|
2008-03-24 20:24:04 +00:00
|
|
|
#endif /* _KERNEL */
|
|
|
|
|
2002-06-10 20:05:46 +00:00
|
|
|
#define INP_INFO_LOCK_INIT(ipi, d) \
|
2008-04-17 21:38:18 +00:00
|
|
|
rw_init_flags(&(ipi)->ipi_lock, (d), RW_RECURSE)
|
|
|
|
#define INP_INFO_LOCK_DESTROY(ipi) rw_destroy(&(ipi)->ipi_lock)
|
|
|
|
#define INP_INFO_RLOCK(ipi) rw_rlock(&(ipi)->ipi_lock)
|
|
|
|
#define INP_INFO_WLOCK(ipi) rw_wlock(&(ipi)->ipi_lock)
|
Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
2008-07-15 15:38:47 +00:00
|
|
|
#define INP_INFO_TRY_RLOCK(ipi) rw_try_rlock(&(ipi)->ipi_lock)
|
|
|
|
#define INP_INFO_TRY_WLOCK(ipi) rw_try_wlock(&(ipi)->ipi_lock)
|
2010-03-06 21:24:11 +00:00
|
|
|
#define INP_INFO_TRY_UPGRADE(ipi) rw_try_upgrade(&(ipi)->ipi_lock)
|
2015-08-08 08:40:36 +00:00
|
|
|
#define INP_INFO_WLOCKED(ipi) rw_wowned(&(ipi)->ipi_lock)
|
2008-04-17 21:38:18 +00:00
|
|
|
#define INP_INFO_RUNLOCK(ipi) rw_runlock(&(ipi)->ipi_lock)
|
|
|
|
#define INP_INFO_WUNLOCK(ipi) rw_wunlock(&(ipi)->ipi_lock)
|
|
|
|
#define INP_INFO_LOCK_ASSERT(ipi) rw_assert(&(ipi)->ipi_lock, RA_LOCKED)
|
|
|
|
#define INP_INFO_RLOCK_ASSERT(ipi) rw_assert(&(ipi)->ipi_lock, RA_RLOCKED)
|
|
|
|
#define INP_INFO_WLOCK_ASSERT(ipi) rw_assert(&(ipi)->ipi_lock, RA_WLOCKED)
|
|
|
|
#define INP_INFO_UNLOCK_ASSERT(ipi) rw_assert(&(ipi)->ipi_lock, RA_UNLOCKED)
|
2002-06-10 20:05:46 +00:00
|
|
|
|
2015-08-03 12:13:54 +00:00
|
|
|
#define INP_LIST_LOCK_INIT(ipi, d) \
|
|
|
|
rw_init_flags(&(ipi)->ipi_list_lock, (d), 0)
|
|
|
|
#define INP_LIST_LOCK_DESTROY(ipi) rw_destroy(&(ipi)->ipi_list_lock)
|
|
|
|
#define INP_LIST_RLOCK(ipi) rw_rlock(&(ipi)->ipi_list_lock)
|
|
|
|
#define INP_LIST_WLOCK(ipi) rw_wlock(&(ipi)->ipi_list_lock)
|
|
|
|
#define INP_LIST_TRY_RLOCK(ipi) rw_try_rlock(&(ipi)->ipi_list_lock)
|
|
|
|
#define INP_LIST_TRY_WLOCK(ipi) rw_try_wlock(&(ipi)->ipi_list_lock)
|
|
|
|
#define INP_LIST_TRY_UPGRADE(ipi) rw_try_upgrade(&(ipi)->ipi_list_lock)
|
|
|
|
#define INP_LIST_RUNLOCK(ipi) rw_runlock(&(ipi)->ipi_list_lock)
|
|
|
|
#define INP_LIST_WUNLOCK(ipi) rw_wunlock(&(ipi)->ipi_list_lock)
|
|
|
|
#define INP_LIST_LOCK_ASSERT(ipi) \
|
|
|
|
rw_assert(&(ipi)->ipi_list_lock, RA_LOCKED)
|
|
|
|
#define INP_LIST_RLOCK_ASSERT(ipi) \
|
|
|
|
rw_assert(&(ipi)->ipi_list_lock, RA_RLOCKED)
|
|
|
|
#define INP_LIST_WLOCK_ASSERT(ipi) \
|
|
|
|
rw_assert(&(ipi)->ipi_list_lock, RA_WLOCKED)
|
|
|
|
#define INP_LIST_UNLOCK_ASSERT(ipi) \
|
|
|
|
rw_assert(&(ipi)->ipi_list_lock, RA_UNLOCKED)
|
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
#define INP_HASH_LOCK_INIT(ipi, d) \
|
|
|
|
rw_init_flags(&(ipi)->ipi_hash_lock, (d), 0)
|
|
|
|
#define INP_HASH_LOCK_DESTROY(ipi) rw_destroy(&(ipi)->ipi_hash_lock)
|
|
|
|
#define INP_HASH_RLOCK(ipi) rw_rlock(&(ipi)->ipi_hash_lock)
|
|
|
|
#define INP_HASH_WLOCK(ipi) rw_wlock(&(ipi)->ipi_hash_lock)
|
|
|
|
#define INP_HASH_RUNLOCK(ipi) rw_runlock(&(ipi)->ipi_hash_lock)
|
|
|
|
#define INP_HASH_WUNLOCK(ipi) rw_wunlock(&(ipi)->ipi_hash_lock)
|
|
|
|
#define INP_HASH_LOCK_ASSERT(ipi) rw_assert(&(ipi)->ipi_hash_lock, \
|
|
|
|
RA_LOCKED)
|
|
|
|
#define INP_HASH_WLOCK_ASSERT(ipi) rw_assert(&(ipi)->ipi_hash_lock, \
|
|
|
|
RA_WLOCKED)
|
|
|
|
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#define INP_GROUP_LOCK_INIT(ipg, d) mtx_init(&(ipg)->ipg_lock, (d), NULL, \
|
|
|
|
MTX_DEF | MTX_DUPOK)
|
|
|
|
#define INP_GROUP_LOCK_DESTROY(ipg) mtx_destroy(&(ipg)->ipg_lock)
|
|
|
|
|
|
|
|
#define INP_GROUP_LOCK(ipg) mtx_lock(&(ipg)->ipg_lock)
|
|
|
|
#define INP_GROUP_LOCK_ASSERT(ipg) mtx_assert(&(ipg)->ipg_lock, MA_OWNED)
|
|
|
|
#define INP_GROUP_UNLOCK(ipg) mtx_unlock(&(ipg)->ipg_lock)
|
|
|
|
|
1997-03-03 09:23:37 +00:00
|
|
|
#define INP_PCBHASH(faddr, lport, fport, mask) \
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
(((faddr) ^ ((faddr) >> 16) ^ ntohs((lport) ^ (fport))) & (mask))
|
|
|
|
#define INP_PCBPORTHASH(lport, mask) \
|
|
|
|
(ntohs((lport)) & (mask))
|
2014-09-10 12:35:42 +00:00
|
|
|
#define INP6_PCBHASHKEY(faddr) ((faddr)->s6_addr32[3])
|
1997-03-03 09:23:37 +00:00
|
|
|
|
2008-12-09 10:21:38 +00:00
|
|
|
/*
|
2009-04-15 22:09:42 +00:00
|
|
|
* Flags for inp_vflags -- historically version flags only
|
2008-12-09 10:21:38 +00:00
|
|
|
*/
|
|
|
|
#define INP_IPV4 0x1
|
|
|
|
#define INP_IPV6 0x2
|
|
|
|
#define INP_IPV6PROTO 0x4 /* opened under IPv6 protocol */
|
|
|
|
|
|
|
|
/*
|
2009-04-15 22:09:42 +00:00
|
|
|
* Flags for inp_flags.
|
2008-12-09 10:21:38 +00:00
|
|
|
*/
|
2009-03-15 09:58:31 +00:00
|
|
|
#define INP_RECVOPTS 0x00000001 /* receive incoming IP options */
|
|
|
|
#define INP_RECVRETOPTS 0x00000002 /* receive IP options for reply */
|
|
|
|
#define INP_RECVDSTADDR 0x00000004 /* receive IP dst address */
|
|
|
|
#define INP_HDRINCL 0x00000008 /* user supplies entire IP header */
|
|
|
|
#define INP_HIGHPORT 0x00000010 /* user wants "high" port binding */
|
|
|
|
#define INP_LOWPORT 0x00000020 /* user wants "low" port binding */
|
|
|
|
#define INP_ANONPORT 0x00000040 /* port chosen for user */
|
|
|
|
#define INP_RECVIF 0x00000080 /* receive incoming interface */
|
|
|
|
#define INP_MTUDISC 0x00000100 /* user can do MTU discovery */
|
2014-11-09 21:33:01 +00:00
|
|
|
/* 0x000200 unused: was INP_FAITH */
|
2009-03-15 09:58:31 +00:00
|
|
|
#define INP_RECVTTL 0x00000400 /* receive incoming IP TTL */
|
|
|
|
#define INP_DONTFRAG 0x00000800 /* don't fragment packet */
|
2009-06-01 10:30:00 +00:00
|
|
|
#define INP_BINDANY 0x00001000 /* allow bind to any address */
|
2009-03-15 09:58:31 +00:00
|
|
|
#define INP_INHASHLIST 0x00002000 /* in_pcbinshash() has been called */
|
2012-06-12 14:02:38 +00:00
|
|
|
#define INP_RECVTOS 0x00004000 /* receive incoming IP TOS */
|
2009-03-15 09:58:31 +00:00
|
|
|
#define IN6P_IPV6_V6ONLY 0x00008000 /* restrict AF_INET6 socket for v6 */
|
|
|
|
#define IN6P_PKTINFO 0x00010000 /* receive IP6 dst and I/F */
|
|
|
|
#define IN6P_HOPLIMIT 0x00020000 /* receive hoplimit */
|
|
|
|
#define IN6P_HOPOPTS 0x00040000 /* receive hop-by-hop options */
|
|
|
|
#define IN6P_DSTOPTS 0x00080000 /* receive dst options after rthdr */
|
|
|
|
#define IN6P_RTHDR 0x00100000 /* receive routing header */
|
|
|
|
#define IN6P_RTHDRDSTOPTS 0x00200000 /* receive dstoptions before rthdr */
|
|
|
|
#define IN6P_TCLASS 0x00400000 /* receive traffic class value */
|
|
|
|
#define IN6P_AUTOFLOWLABEL 0x00800000 /* attach flowlabel automatically */
|
|
|
|
#define INP_TIMEWAIT 0x01000000 /* in TIMEWAIT, ppcb is tcptw */
|
|
|
|
#define INP_ONESBCAST 0x02000000 /* send all-ones broadcast */
|
|
|
|
#define INP_DROPPED 0x04000000 /* protocol drop flag */
|
|
|
|
#define INP_SOCKREF 0x08000000 /* strong socket reference */
|
2014-12-01 11:45:24 +00:00
|
|
|
#define INP_RESERVED_0 0x10000000 /* reserved field */
|
|
|
|
#define INP_RESERVED_1 0x20000000 /* reserved field */
|
2003-10-24 19:51:49 +00:00
|
|
|
#define IN6P_RFC2292 0x40000000 /* used RFC2292 API on the socket */
|
|
|
|
#define IN6P_MTU 0x80000000 /* receive path MTU */
|
2001-06-11 12:39:29 +00:00
|
|
|
|
1996-11-11 04:56:32 +00:00
|
|
|
#define INP_CONTROLOPTS (INP_RECVOPTS|INP_RECVRETOPTS|INP_RECVDSTADDR|\
|
2012-06-12 14:02:38 +00:00
|
|
|
INP_RECVIF|INP_RECVTTL|INP_RECVTOS|\
|
2001-06-11 12:39:29 +00:00
|
|
|
IN6P_PKTINFO|IN6P_HOPLIMIT|IN6P_HOPOPTS|\
|
|
|
|
IN6P_DSTOPTS|IN6P_RTHDR|IN6P_RTHDRDSTOPTS|\
|
2003-10-24 18:26:30 +00:00
|
|
|
IN6P_TCLASS|IN6P_AUTOFLOWLABEL|IN6P_RFC2292|\
|
|
|
|
IN6P_MTU)
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2009-04-15 22:09:42 +00:00
|
|
|
/*
|
|
|
|
* Flags for inp_flags2.
|
|
|
|
*/
|
2009-04-15 22:22:00 +00:00
|
|
|
#define INP_LLE_VALID 0x00000001 /* cached lle is valid */
|
|
|
|
#define INP_RT_VALID 0x00000002 /* cached rtentry is valid */
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#define INP_PCBGROUPWILD 0x00000004 /* in pcbgroup wildcard list */
|
2011-11-06 10:47:20 +00:00
|
|
|
#define INP_REUSEPORT 0x00000008 /* SO_REUSEPORT option is set */
|
There is a complex race in in_pcblookup_hash() and in_pcblookup_group().
Both functions need to obtain lock on the found PCB, and they can't do
classic inter-lock with the PCB hash lock, due to lock order reversal.
To keep the PCB stable, these functions put a reference on it and after PCB
lock is acquired drop it. If the reference was the last one, this means
we've raced with in_pcbfree() and the PCB is no longer valid.
This approach works okay only if we are acquiring writer-lock on the PCB.
In case of reader-lock, the following scenario can happen:
- 2 threads locate pcb, and do in_pcbref() on it.
- These 2 threads drop the inp hash lock.
- Another thread comes to delete pcb via in_pcbfree(), it obtains hash lock,
does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(), which
doesn't free the pcb due to two references on it. Then it unlocks the pcb.
- 2 aforementioned threads acquire reader lock on the pcb and run
in_pcbrele_rlocked(). One gets 1 from in_pcbrele_rlocked() and continues,
second gets 0 and considers pcb freed, returns.
- The thread that got 1 continutes working with detached pcb, which later
leads to panic in the underlying protocol level.
To plumb that problem an additional INPCB flag introduced - INP_FREED. We
check for that flag in the in_pcbrele_rlocked() and if it is set, we pretend
that that was the last reference.
Discussed with: rwatson, jhb
Reported by: Vladimir Medvedkin <medved rambler-co.ru>
2012-10-02 12:03:02 +00:00
|
|
|
#define INP_FREED 0x00000010 /* inp itself is not valid */
|
2013-07-04 18:38:00 +00:00
|
|
|
#define INP_REUSEADDR 0x00000020 /* SO_REUSEADDR option is set */
|
2014-07-10 03:10:56 +00:00
|
|
|
#define INP_BINDMULTI 0x00000040 /* IP_BINDMULTI option is set */
|
|
|
|
#define INP_RSS_BUCKET_SET 0x00000080 /* IP_RSS_LISTEN_BUCKET is set */
|
2014-09-09 01:45:39 +00:00
|
|
|
#define INP_RECVFLOWID 0x00000100 /* populate recv datagram with flow info */
|
|
|
|
#define INP_RECVRSSBUCKETID 0x00000200 /* populate recv datagram with bucket id */
|
2009-04-15 22:09:42 +00:00
|
|
|
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
/*
|
|
|
|
* Flags passed to in_pcblookup*() functions.
|
|
|
|
*/
|
|
|
|
#define INPLOOKUP_WILDCARD 0x00000001 /* Allow wildcard sockets. */
|
|
|
|
#define INPLOOKUP_RLOCKPCB 0x00000002 /* Return inpcb read-locked. */
|
|
|
|
#define INPLOOKUP_WLOCKPCB 0x00000004 /* Return inpcb write-locked. */
|
|
|
|
|
|
|
|
#define INPLOOKUP_MASK (INPLOOKUP_WILDCARD | INPLOOKUP_RLOCKPCB | \
|
|
|
|
INPLOOKUP_WLOCKPCB)
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
#define sotoinpcb(so) ((struct inpcb *)(so)->so_pcb)
|
1999-11-22 02:45:11 +00:00
|
|
|
#define sotoin6pcb(so) sotoinpcb(so) /* for KAME src sync over BSD*'s */
|
|
|
|
|
|
|
|
#define INP_SOCKAF(so) so->so_proto->pr_domain->dom_family
|
|
|
|
|
2004-08-16 18:32:07 +00:00
|
|
|
#define INP_CHECK_SOCKAF(so, af) (INP_SOCKAF(so) == af)
|
1994-05-24 10:09:53 +00:00
|
|
|
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
/*
|
|
|
|
* Constants for pcbinfo.ipi_hashfields.
|
|
|
|
*/
|
|
|
|
#define IPI_HASHFIELDS_NONE 0
|
|
|
|
#define IPI_HASHFIELDS_2TUPLE 1
|
|
|
|
#define IPI_HASHFIELDS_4TUPLE 2
|
|
|
|
|
1999-12-29 04:46:21 +00:00
|
|
|
#ifdef _KERNEL
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
VNET_DECLARE(int, ipport_reservedhigh);
|
|
|
|
VNET_DECLARE(int, ipport_reservedlow);
|
|
|
|
VNET_DECLARE(int, ipport_lowfirstauto);
|
|
|
|
VNET_DECLARE(int, ipport_lowlastauto);
|
|
|
|
VNET_DECLARE(int, ipport_firstauto);
|
|
|
|
VNET_DECLARE(int, ipport_lastauto);
|
|
|
|
VNET_DECLARE(int, ipport_hifirstauto);
|
|
|
|
VNET_DECLARE(int, ipport_hilastauto);
|
|
|
|
VNET_DECLARE(int, ipport_randomized);
|
|
|
|
VNET_DECLARE(int, ipport_randomcps);
|
|
|
|
VNET_DECLARE(int, ipport_randomtime);
|
|
|
|
VNET_DECLARE(int, ipport_stoprandom);
|
|
|
|
VNET_DECLARE(int, ipport_tcpallocs);
|
|
|
|
|
2009-07-16 21:13:04 +00:00
|
|
|
#define V_ipport_reservedhigh VNET(ipport_reservedhigh)
|
|
|
|
#define V_ipport_reservedlow VNET(ipport_reservedlow)
|
|
|
|
#define V_ipport_lowfirstauto VNET(ipport_lowfirstauto)
|
|
|
|
#define V_ipport_lowlastauto VNET(ipport_lowlastauto)
|
|
|
|
#define V_ipport_firstauto VNET(ipport_firstauto)
|
|
|
|
#define V_ipport_lastauto VNET(ipport_lastauto)
|
|
|
|
#define V_ipport_hifirstauto VNET(ipport_hifirstauto)
|
|
|
|
#define V_ipport_hilastauto VNET(ipport_hilastauto)
|
|
|
|
#define V_ipport_randomized VNET(ipport_randomized)
|
|
|
|
#define V_ipport_randomcps VNET(ipport_randomcps)
|
|
|
|
#define V_ipport_randomtime VNET(ipport_randomtime)
|
|
|
|
#define V_ipport_stoprandom VNET(ipport_stoprandom)
|
|
|
|
#define V_ipport_tcpallocs VNET(ipport_tcpallocs)
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
|
2010-03-14 18:59:11 +00:00
|
|
|
void in_pcbinfo_destroy(struct inpcbinfo *);
|
|
|
|
void in_pcbinfo_init(struct inpcbinfo *, const char *, struct inpcbhead *,
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
int, int, char *, uma_init, uma_fini, uint32_t, u_int);
|
|
|
|
|
2014-07-12 05:40:13 +00:00
|
|
|
int in_pcbbind_check_bindmulti(const struct inpcb *ni,
|
|
|
|
const struct inpcb *oi);
|
|
|
|
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
struct inpcbgroup *
|
|
|
|
in_pcbgroup_byhash(struct inpcbinfo *, u_int, uint32_t);
|
|
|
|
struct inpcbgroup *
|
|
|
|
in_pcbgroup_byinpcb(struct inpcb *);
|
|
|
|
struct inpcbgroup *
|
|
|
|
in_pcbgroup_bytuple(struct inpcbinfo *, struct in_addr, u_short,
|
|
|
|
struct in_addr, u_short);
|
|
|
|
void in_pcbgroup_destroy(struct inpcbinfo *);
|
|
|
|
int in_pcbgroup_enabled(struct inpcbinfo *);
|
|
|
|
void in_pcbgroup_init(struct inpcbinfo *, u_int, int);
|
|
|
|
void in_pcbgroup_remove(struct inpcb *);
|
|
|
|
void in_pcbgroup_update(struct inpcb *);
|
|
|
|
void in_pcbgroup_update_mbuf(struct inpcb *, struct mbuf *);
|
2010-03-14 18:59:11 +00:00
|
|
|
|
2002-06-10 20:05:46 +00:00
|
|
|
void in_pcbpurgeif0(struct inpcbinfo *, struct ifnet *);
|
2006-07-18 22:34:27 +00:00
|
|
|
int in_pcballoc(struct socket *, struct inpcbinfo *);
|
2004-03-27 21:05:46 +00:00
|
|
|
int in_pcbbind(struct inpcb *, struct sockaddr *, struct ucred *);
|
2011-03-12 21:46:37 +00:00
|
|
|
int in_pcb_lport(struct inpcb *, struct in_addr *, u_short *,
|
|
|
|
struct ucred *, int);
|
2002-10-20 21:44:31 +00:00
|
|
|
int in_pcbbind_setup(struct inpcb *, struct sockaddr *, in_addr_t *,
|
2004-03-27 21:05:46 +00:00
|
|
|
u_short *, struct ucred *);
|
|
|
|
int in_pcbconnect(struct inpcb *, struct sockaddr *, struct ucred *);
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
int in_pcbconnect_mbuf(struct inpcb *, struct sockaddr *, struct ucred *,
|
|
|
|
struct mbuf *);
|
2002-10-21 13:55:50 +00:00
|
|
|
int in_pcbconnect_setup(struct inpcb *, struct sockaddr *, in_addr_t *,
|
|
|
|
u_short *, in_addr_t *, u_short *, struct inpcb **,
|
2004-03-27 21:05:46 +00:00
|
|
|
struct ucred *);
|
2002-03-19 21:25:46 +00:00
|
|
|
void in_pcbdetach(struct inpcb *);
|
|
|
|
void in_pcbdisconnect(struct inpcb *);
|
2006-04-25 11:17:35 +00:00
|
|
|
void in_pcbdrop(struct inpcb *);
|
2006-04-01 16:04:42 +00:00
|
|
|
void in_pcbfree(struct inpcb *);
|
2002-03-19 21:25:46 +00:00
|
|
|
int in_pcbinshash(struct inpcb *);
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
int in_pcbinshash_nopcbgroup(struct inpcb *);
|
2014-04-24 12:52:31 +00:00
|
|
|
int in_pcbladdr(struct inpcb *, struct in_addr *, struct in_addr *,
|
|
|
|
struct ucred *);
|
1994-05-24 10:09:53 +00:00
|
|
|
struct inpcb *
|
2002-03-19 21:25:46 +00:00
|
|
|
in_pcblookup_local(struct inpcbinfo *,
|
2008-07-10 13:31:11 +00:00
|
|
|
struct in_addr, u_short, int, struct ucred *);
|
1995-04-09 01:29:31 +00:00
|
|
|
struct inpcb *
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
in_pcblookup(struct inpcbinfo *, struct in_addr, u_int,
|
2002-03-24 10:19:10 +00:00
|
|
|
struct in_addr, u_int, int, struct ifnet *);
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
struct inpcb *
|
|
|
|
in_pcblookup_mbuf(struct inpcbinfo *, struct in_addr, u_int,
|
|
|
|
struct in_addr, u_int, int, struct ifnet *, struct mbuf *);
|
2002-06-10 20:05:46 +00:00
|
|
|
void in_pcbnotifyall(struct inpcbinfo *pcbinfo, struct in_addr,
|
2002-06-14 08:35:21 +00:00
|
|
|
int, struct inpcb *(*)(struct inpcb *, int));
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 20:18:50 +00:00
|
|
|
void in_pcbref(struct inpcb *);
|
2002-03-19 21:25:46 +00:00
|
|
|
void in_pcbrehash(struct inpcb *);
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
void in_pcbrehash_mbuf(struct inpcb *, struct mbuf *);
|
Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
2008-12-08 20:18:50 +00:00
|
|
|
int in_pcbrele(struct inpcb *);
|
Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-05-23 19:32:02 +00:00
|
|
|
int in_pcbrele_rlocked(struct inpcb *);
|
|
|
|
int in_pcbrele_wlocked(struct inpcb *);
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
|
|
|
void in_pcbsetsolabel(struct socket *so);
|
2007-05-11 10:20:51 +00:00
|
|
|
int in_getpeeraddr(struct socket *so, struct sockaddr **nam);
|
|
|
|
int in_getsockaddr(struct socket *so, struct sockaddr **nam);
|
2002-08-21 11:57:12 +00:00
|
|
|
struct sockaddr *
|
|
|
|
in_sockaddr(in_port_t port, struct in_addr *addr);
|
Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.
This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.
For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.
Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.
Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
2003-11-18 00:39:07 +00:00
|
|
|
void in_pcbsosetlabel(struct socket *so);
|
1999-12-29 04:46:21 +00:00
|
|
|
#endif /* _KERNEL */
|
1998-03-24 18:06:34 +00:00
|
|
|
|
1998-03-28 10:18:26 +00:00
|
|
|
#endif /* !_NETINET_IN_PCB_H_ */
|