2014-07-21 04:48:02 +00:00
|
|
|
.\" Copyright (c) 2014 Adrian Chadd
|
|
|
|
.\" All rights reserved.
|
|
|
|
.\"
|
|
|
|
.\" Redistribution and use in source and binary forms, with or without
|
|
|
|
.\" modification, are permitted provided that the following conditions
|
|
|
|
.\" are met:
|
|
|
|
.\" 1. Redistributions of source code must retain the above copyright
|
|
|
|
.\" notice, this list of conditions and the following disclaimer.
|
|
|
|
.\" 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
.\" notice, this list of conditions and the following disclaimer in the
|
|
|
|
.\" documentation and/or other materials provided with the distribution.
|
|
|
|
.\" 3. The name of the author may not be used to endorse or promote products
|
|
|
|
.\" derived from this software without specific prior written permission.
|
|
|
|
.\"
|
|
|
|
.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
|
|
|
|
.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
.\" SUCH DAMAGE.
|
|
|
|
.\"
|
|
|
|
.\" $FreeBSD$
|
|
|
|
.\"
|
|
|
|
.Dd July 18, 2014
|
2014-07-22 22:16:23 +00:00
|
|
|
.Dt PCBGROUP 9
|
2014-07-21 04:48:02 +00:00
|
|
|
.Os
|
|
|
|
.Sh NAME
|
2014-07-22 22:16:23 +00:00
|
|
|
.Nm PCBGROUP
|
2014-07-21 04:48:02 +00:00
|
|
|
.Nd Distributed Protocol Control Block Groups
|
|
|
|
.Sh SYNOPSIS
|
|
|
|
.Ft void
|
2014-07-21 08:42:35 +00:00
|
|
|
.Fo in_pcbgroup_init
|
|
|
|
.Fa "struct inpcbinfo *pcbinfo" "u_int hashfields" "int hash_nelements"
|
|
|
|
.Fc
|
2014-07-21 04:48:02 +00:00
|
|
|
.Ft void
|
2014-07-21 08:42:35 +00:00
|
|
|
.Fn in_pcbgroup_destroy "struct inpcbinfo *pcbinfo"
|
2014-07-21 04:48:02 +00:00
|
|
|
.Ft struct inpcbgroup *
|
2014-07-21 08:42:35 +00:00
|
|
|
.Fo in_pcbgroup_byhash
|
|
|
|
.Fa "struct inpcbinfo *pcbinfo" "u_int hashtype" "uint32_t hash"
|
|
|
|
.Fc
|
2014-07-21 04:48:02 +00:00
|
|
|
.Ft struct inpcbgroup *
|
2014-07-21 08:42:35 +00:00
|
|
|
.Fn in_pcbgroup_byinpcb "struct inpcb *inp"
|
2014-07-21 04:48:02 +00:00
|
|
|
.Ft void
|
2014-07-21 08:42:35 +00:00
|
|
|
.Fn in_pcbgroup_update "struct inpcb *inp"
|
2014-07-21 04:48:02 +00:00
|
|
|
.Ft void
|
2014-07-21 08:42:35 +00:00
|
|
|
.Fn in_pcbgroup_update_mbuf "struct inpcb *inp" "struct mbuf *m"
|
2014-07-21 04:48:02 +00:00
|
|
|
.Ft void
|
2014-07-21 08:42:35 +00:00
|
|
|
.Fn in_pcbgroup_remove "struct inpcb *inp"
|
2014-07-21 04:48:02 +00:00
|
|
|
.Ft int
|
2014-07-21 08:42:35 +00:00
|
|
|
.Fn in_pcbgroup_enabled "struct inpcbinfo *pcbinfo"
|
2014-07-21 04:48:02 +00:00
|
|
|
.Ft struct inpcbgroup *
|
2014-07-21 08:42:35 +00:00
|
|
|
.Fo in6_pcbgroup_byhash
|
|
|
|
.Fa "struct inpcbinfo *pcbinfo" "u_int hashtype" "uint32_t hash"
|
|
|
|
.Fc
|
2014-07-21 04:48:02 +00:00
|
|
|
.Pp
|
2014-07-22 22:16:23 +00:00
|
|
|
.Cd "options PCBGROUP"
|
2014-07-21 04:48:02 +00:00
|
|
|
.Sh DESCRIPTION
|
2014-07-22 22:16:23 +00:00
|
|
|
PCBGROUP, or "connection groups", are based on Willman, Rixner, and Cox's
|
2014-07-21 04:48:02 +00:00
|
|
|
2006 USENIX paper,
|
|
|
|
.Qo
|
|
|
|
An Evaluation of Network Stack Parallelization Strategies in Modern
|
|
|
|
Operating Systems
|
|
|
|
.Qc .
|
|
|
|
.Pp
|
2014-07-22 22:16:23 +00:00
|
|
|
The PCBGROUP paper describes two main kind of connection groups.
|
2014-07-21 04:48:02 +00:00
|
|
|
The first, called ConnP-T, uses a pool of worker threads which
|
|
|
|
implement the network stack.
|
|
|
|
Serialization occurs when queuing work into and completing work from
|
|
|
|
the network stack.
|
|
|
|
No locking is required inside each worker thread.
|
|
|
|
.Pp
|
|
|
|
The second type of connection group, called ConnP-L, uses an array
|
|
|
|
of PCB groups rather than a single list.
|
|
|
|
Each PCB group is protected by its own lock.
|
|
|
|
.Pp
|
|
|
|
This implementation differs significantly from that described in the
|
|
|
|
paper, in that it attempts to introduce not just notions of affinity
|
|
|
|
for connections and distribute work so as to reduce lock contention,
|
|
|
|
but also align those notions with hardware work distribution strategies
|
|
|
|
such as RSS.
|
|
|
|
In this construction, connection groups supplement, rather than replace,
|
|
|
|
existing reservation tables for protocol 4-tuples, offering CPU-affine
|
|
|
|
lookup tables with minimal cache line migration and lock contention
|
|
|
|
during steady state operation.
|
|
|
|
.Pp
|
|
|
|
Internet protocols like UDP and TCP register to use connection groups
|
|
|
|
by providing an ipi_hashfields value other than IPI_HASHFIELDS_NONE.
|
|
|
|
This indicates to the connection group code whether a 2-tuple or
|
|
|
|
4-tuple is used as an argument to hashes that assign a connection to
|
|
|
|
a particular group.
|
|
|
|
This must be aligned with any hardware-offloaded distribution model,
|
|
|
|
such as RSS or similar approaches taken in embedded network boards.
|
|
|
|
Wildcard sockets require special handling, as in Willman 2006, and
|
|
|
|
are shared between connection groups while being protected by
|
|
|
|
group-local locks.
|
|
|
|
Connection establishment and teardown can be signficantly more
|
|
|
|
expensive than without connection groups, but that steady-state
|
|
|
|
processing can be significantly faster.
|
|
|
|
.Pp
|
2014-07-22 22:16:23 +00:00
|
|
|
Enabling PCBGROUP in the kernel only provides the infrastructure
|
2014-07-21 04:48:02 +00:00
|
|
|
required to create and manage multiple PCB groups.
|
|
|
|
An implementation needs to fill in a few functions to provide PCB
|
|
|
|
group hash information in order for PCBs to be placed in a PCB group.
|
|
|
|
.Ss Operation
|
|
|
|
By default, each PCB info block (struct pcbinfo) has a single hash for
|
|
|
|
all PCB entries for the given protocol with a single lock protecting it.
|
|
|
|
This can be a significant source of lock contention on SMP hardware.
|
|
|
|
When a PCBGROUP is created, an array of separate hash tables are
|
|
|
|
created, each with its own lock.
|
|
|
|
A separate table for wildcard PCBs is provided.
|
|
|
|
By default, a PCBGROUP table is created for each available CPU.
|
|
|
|
The PCBGROUP code attempts to calculate a hash value from the given
|
|
|
|
PCB or mbuf when looking up a PCBGROUP.
|
|
|
|
While processing a received frame,
|
2014-07-21 08:42:35 +00:00
|
|
|
.Fn in_pcbgroup_byhash
|
2014-07-21 04:48:02 +00:00
|
|
|
can be used in conjunction with either a hardware-provided hash
|
|
|
|
value
|
|
|
|
.Po
|
|
|
|
eg the
|
|
|
|
.Xr RSS 9
|
|
|
|
calculated hash value provided by some NICs
|
|
|
|
.Pc
|
|
|
|
or a software-provided hash value in order to choose a PCBGROUP
|
|
|
|
table to query.
|
|
|
|
A single table lock is held while performing a wildcard match.
|
|
|
|
However, all of the table locks are acquired before modifying the
|
|
|
|
wildcard table.
|
|
|
|
The PCBGROUP tables operate in conjunction with the normal single PCB list
|
|
|
|
in a PCB info block.
|
|
|
|
Thus, inserting and removing a PCB will still incur the same costs
|
2014-07-22 22:16:23 +00:00
|
|
|
as without PCBGROUP.
|
|
|
|
A protocol which uses PCBGROUP should fall back to the normal PCB list
|
|
|
|
lookup if a call to the PCBGROUP layer does not yield a lookup hit.
|
2014-07-21 04:48:02 +00:00
|
|
|
.Ss Usage
|
|
|
|
Initialize a PCBGROUP in a PCB info block
|
|
|
|
.Pq Vt "struct pcbinfo"
|
|
|
|
by calling
|
|
|
|
.Fn in_pcbgroup_init .
|
|
|
|
.Pp
|
|
|
|
Add a connection to a PCBGROUP with
|
|
|
|
.Fn in_pcbgroup_update .
|
|
|
|
Connections are removed by with
|
|
|
|
.Fn in_pcbgroup_remove .
|
|
|
|
These in turn will determine which PCBGROUP bucket the given PCB
|
|
|
|
is placed into and calculate the hash value appropriately.
|
|
|
|
.Pp
|
|
|
|
Wildcard PCBs are hashed differently and placed in a single wildcard
|
|
|
|
PCB list.
|
|
|
|
If
|
|
|
|
.Xr RSS 9
|
|
|
|
is enabled and in use, RSS-aware wildcard PCBs are placed in a single
|
|
|
|
PCBGROUP based on RSS information.
|
|
|
|
Protocols may look up the PCB entry in a PCBGROUP by using the lookup
|
|
|
|
functions
|
|
|
|
.Fn in_pcbgroup_byhash
|
|
|
|
and
|
|
|
|
.Fn in_pcbgroup_byinpcb .
|
|
|
|
.Sh IMPLEMENTATION NOTES
|
|
|
|
The PCB code in
|
|
|
|
.Pa sys/netinet
|
|
|
|
and
|
|
|
|
.Pa sys/netinet6
|
2014-07-22 22:16:23 +00:00
|
|
|
is aware of PCBGROUP and will call into the PCBGROUP code to do
|
2014-07-21 04:48:02 +00:00
|
|
|
PCBGROUP assignment and lookup, preferring a PCBGROUP lookup to the
|
|
|
|
default global PCB info table.
|
|
|
|
.Pp
|
|
|
|
An implementor wishing to experiment or modify the PCBGROUP assignment
|
|
|
|
should modify this set of functions:
|
|
|
|
.Bl -tag -width "12345678" -offset indent
|
|
|
|
.It Fn in_pcbgroup_getbucket No and Fn in6_pcbgroup_getbucket
|
|
|
|
Map a given 32 bit hash value to a PCBGROUP.
|
|
|
|
By default this is hash % number_of_pcbgroups.
|
|
|
|
However, this distribution may not align with NIC receive queues or
|
|
|
|
the
|
|
|
|
.Xr netisr 9
|
|
|
|
configuration.
|
|
|
|
.It Fn in_pcbgroup_byhash No and Fn in6_pcbgroup_byhash
|
|
|
|
Map a 32 bit hash value and a hash type identifier to a PCBGROUP.
|
|
|
|
By default, this simply returns NULL.
|
|
|
|
This function is used by the
|
|
|
|
.Xr mbuf 9
|
|
|
|
receive path in
|
|
|
|
.Pa sys/netinet/in_pcb.c
|
|
|
|
to map an mbuf to a PCBGROUP.
|
|
|
|
.It Fn in_pcbgroup_bytuple No and Fn in6_pcbgroup_bytuple
|
|
|
|
Map the source and destination address and port details to a PCBGROUP.
|
|
|
|
By default, this does a very simple XOR hash.
|
|
|
|
This function is used by both the PCB lookup code and as a fallback in
|
|
|
|
the
|
|
|
|
.Xr mbuf 9
|
|
|
|
receive path in
|
|
|
|
.Pa sys/netinet/in_pcb.c .
|
|
|
|
.El
|
|
|
|
.Sh SEE ALSO
|
|
|
|
.Xr mbuf 9 ,
|
2014-07-21 08:42:35 +00:00
|
|
|
.Xr netisr 9 ,
|
|
|
|
.Xr RSS 9
|
2014-07-21 04:48:02 +00:00
|
|
|
.Sh HISTORY
|
2014-07-22 22:16:23 +00:00
|
|
|
PCBGROUP first appeared in
|
2014-07-21 08:47:54 +00:00
|
|
|
.Fx 9.0 .
|
2014-07-21 04:48:02 +00:00
|
|
|
.Pp
|
2014-07-22 22:16:23 +00:00
|
|
|
The PCBGROUP implementation is inspired by Willman, Rixner, and Cox's
|
2014-07-21 04:48:02 +00:00
|
|
|
2006 USENIX paper,
|
|
|
|
.Qo
|
|
|
|
An Evaluation of Network Stack Parallelization Strategies in Modern
|
|
|
|
Operating Systems
|
|
|
|
.Qc :
|
|
|
|
.Li http://www.ece.rice.edu/~willmann/pubs/paranet_usenix.pdf
|
|
|
|
.Sh AUTHORS
|
|
|
|
.An -nosplit
|
2014-07-22 22:16:23 +00:00
|
|
|
The PCBGROUP implementation was written by
|
2014-07-21 04:48:02 +00:00
|
|
|
.An Robert N. M. Watson Aq Mt rwatson@FreeBSD.org
|
|
|
|
under contract to Juniper Networks, Inc.
|
|
|
|
.Pp
|
|
|
|
This manual page written by
|
|
|
|
.An Adrian Chadd Aq Mt adrian@FreeBSD.org .
|
|
|
|
.Sh NOTES
|
|
|
|
The
|
|
|
|
.Xr RSS 9
|
|
|
|
implementation currently uses
|
|
|
|
.Ic #ifdef
|
2014-07-22 22:16:23 +00:00
|
|
|
blocks to tie into PCBGROUP.
|
2014-07-21 04:48:02 +00:00
|
|
|
This is a sign that a more abstract programming API is needed.
|
|
|
|
.Pp
|
2014-07-22 22:16:23 +00:00
|
|
|
There is currently no support for re-balancing the PCBGROUP assignment,
|
2014-07-21 04:48:02 +00:00
|
|
|
nor is there any support for overriding which PCBGROUP a socket/PCB
|
|
|
|
should be in.
|
|
|
|
.Pp
|
2014-07-22 22:16:23 +00:00
|
|
|
No statistics are kept to indicate how often PCBGROUP lookups
|
2014-07-21 04:48:02 +00:00
|
|
|
succeed or fail.
|