Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
/*-
|
|
|
|
* Copyright (c) 2010-2011 Juniper Networks, Inc.
|
|
|
|
* All rights reserved.
|
|
|
|
*
|
|
|
|
* This software was developed by Robert N. M. Watson under contract
|
|
|
|
* to Juniper Networks, Inc.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
|
|
|
* $FreeBSD$
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef _NETINET_IN_RSS_H_
|
|
|
|
#define _NETINET_IN_RSS_H_
|
|
|
|
|
|
|
|
#include <netinet/in.h> /* in_addr_t */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Supported RSS hash functions.
|
|
|
|
*/
|
|
|
|
#define RSS_HASH_NAIVE 0x00000001 /* Poor but fast hash. */
|
|
|
|
#define RSS_HASH_TOEPLITZ 0x00000002 /* Required by RSS. */
|
|
|
|
#define RSS_HASH_CRC32 0x00000004 /* Future; some NICs do it. */
|
|
|
|
|
|
|
|
#define RSS_HASH_MASK (RSS_HASH_NAIVE | RSS_HASH_TOEPLITZ)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Instances of struct inpcbinfo declare an RSS hash type indicating what
|
|
|
|
* header fields are covered.
|
|
|
|
*/
|
|
|
|
#define RSS_HASHFIELDS_NONE 0
|
|
|
|
#define RSS_HASHFIELDS_4TUPLE 1
|
|
|
|
#define RSS_HASHFIELDS_2TUPLE 2
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Compile-time limits on the size of the indirection table.
|
|
|
|
*/
|
|
|
|
#define RSS_MAXBITS 7
|
|
|
|
#define RSS_TABLE_MAXLEN (1 << RSS_MAXBITS)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Maximum key size used throughout. It's OK for hardware to use only the
|
|
|
|
* first 16 bytes, which is all that's required for IPv4.
|
|
|
|
*/
|
|
|
|
#define RSS_KEYSIZE 40
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Device driver interfaces to query RSS properties that must be programmed
|
|
|
|
* into hardware.
|
|
|
|
*/
|
|
|
|
u_int rss_getbits(void);
|
|
|
|
u_int rss_getbucket(u_int hash);
|
|
|
|
u_int rss_getcpu(u_int bucket);
|
|
|
|
void rss_getkey(uint8_t *key);
|
|
|
|
u_int rss_gethashalgo(void);
|
|
|
|
u_int rss_getnumbuckets(void);
|
|
|
|
u_int rss_getnumcpus(void);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Network stack interface to generate a hash for a protocol tuple.
|
|
|
|
*/
|
|
|
|
uint32_t rss_hash_ip4_4tuple(struct in_addr src, u_short srcport,
|
|
|
|
struct in_addr dst, u_short dstport);
|
|
|
|
uint32_t rss_hash_ip4_2tuple(struct in_addr src, struct in_addr dst);
|
|
|
|
uint32_t rss_hash_ip6_4tuple(struct in6_addr src, u_short srcport,
|
|
|
|
struct in6_addr dst, u_short dstport);
|
|
|
|
uint32_t rss_hash_ip6_2tuple(struct in6_addr src,
|
|
|
|
struct in6_addr dst);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Network stack interface to query desired CPU affinity of a packet.
|
|
|
|
*/
|
|
|
|
struct mbuf *rss_m2cpuid(struct mbuf *m, uintptr_t source, u_int *cpuid);
|
2014-05-18 22:32:04 +00:00
|
|
|
u_int rss_hash2cpuid(uint32_t hash_val, uint32_t hash_type);
|
2014-05-27 08:06:20 +00:00
|
|
|
int rss_hash2bucket(uint32_t hash_val, uint32_t hash_type,
|
|
|
|
uint32_t *bucket_id);
|
|
|
|
int rss_m2bucket(struct mbuf *m, uint32_t *bucket_id);
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
|
|
|
|
#endif /* !_NETINET_IN_RSS_H_ */
|