2005-01-07 01:45:51 +00:00
|
|
|
/*-
|
1994-05-24 10:09:53 +00:00
|
|
|
* Copyright (c) 1982, 1989, 1993
|
|
|
|
* The Regents of the University of California. All rights reserved.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
* 4. Neither the name of the University nor the names of its contributors
|
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
|
|
|
* @(#)if_ethersubr.c 8.1 (Berkeley) 6/10/93
|
1999-08-28 01:08:13 +00:00
|
|
|
* $FreeBSD$
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
|
1998-01-08 23:42:31 +00:00
|
|
|
#include "opt_inet.h"
|
1999-12-07 17:39:16 +00:00
|
|
|
#include "opt_inet6.h"
|
1999-10-21 09:06:11 +00:00
|
|
|
#include "opt_netgraph.h"
|
2008-04-29 21:23:21 +00:00
|
|
|
#include "opt_mbuf_profiling.h"
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
#include "opt_rss.h"
|
1997-12-15 20:31:25 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/param.h>
|
|
|
|
#include <sys/systm.h>
|
|
|
|
#include <sys/kernel.h>
|
Conditionally compile out V_ globals while instantiating the appropriate
container structures, depending on VIMAGE_GLOBALS compile time option.
Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out. Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.
Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively
#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif
Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.
Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs. This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.
Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.
De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps. PF virtualization will be done
separately, most probably after next PF import.
Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw. Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.
Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-12-10 23:12:39 +00:00
|
|
|
#include <sys/lock.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/malloc.h>
|
2004-05-30 17:57:46 +00:00
|
|
|
#include <sys/module.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/mbuf.h>
|
2001-02-18 17:54:52 +00:00
|
|
|
#include <sys/random.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/socket.h>
|
1997-03-24 11:33:46 +00:00
|
|
|
#include <sys/sockio.h>
|
1995-12-20 21:53:53 +00:00
|
|
|
#include <sys/sysctl.h>
|
2013-07-24 04:24:21 +00:00
|
|
|
#include <sys/uuid.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
#include <net/if.h>
|
2013-10-26 17:58:36 +00:00
|
|
|
#include <net/if_var.h>
|
2004-06-24 10:58:08 +00:00
|
|
|
#include <net/if_arp.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <net/netisr.h>
|
|
|
|
#include <net/route.h>
|
|
|
|
#include <net/if_llc.h>
|
|
|
|
#include <net/if_dl.h>
|
|
|
|
#include <net/if_types.h>
|
2000-05-14 02:18:43 +00:00
|
|
|
#include <net/bpf.h>
|
2000-06-26 23:34:54 +00:00
|
|
|
#include <net/ethernet.h>
|
2005-10-14 02:38:47 +00:00
|
|
|
#include <net/if_bridgevar.h>
|
2002-11-14 23:35:06 +00:00
|
|
|
#include <net/if_vlan_var.h>
|
This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,
The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.
Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:
- Kip Macy revised the locking code completely, thus completing
the last piece of the puzzle, Kip has also been conducting
active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
me maintaining that branch before the svn conversion
2008-12-15 06:10:57 +00:00
|
|
|
#include <net/if_llatbl.h>
|
2012-09-04 19:43:26 +00:00
|
|
|
#include <net/pfil.h>
|
2015-01-18 18:06:40 +00:00
|
|
|
#include <net/rss_config.h>
|
2008-12-02 21:37:28 +00:00
|
|
|
#include <net/vnet.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2013-10-27 16:25:57 +00:00
|
|
|
#include <netpfil/pf/pf_mtag.h>
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
#if defined(INET) || defined(INET6)
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <netinet/in.h>
|
|
|
|
#include <netinet/in_var.h>
|
|
|
|
#include <netinet/if_ether.h>
|
2010-08-11 00:51:50 +00:00
|
|
|
#include <netinet/ip_carp.h>
|
2010-01-07 10:27:52 +00:00
|
|
|
#include <netinet/ip_var.h>
|
1998-01-08 23:42:31 +00:00
|
|
|
#endif
|
1999-11-22 02:45:11 +00:00
|
|
|
#ifdef INET6
|
|
|
|
#include <netinet6/nd6.h>
|
|
|
|
#endif
|
2006-10-22 11:52:19 +00:00
|
|
|
#include <security/mac/mac_framework.h>
|
|
|
|
|
2008-08-27 17:10:37 +00:00
|
|
|
#ifdef CTASSERT
|
|
|
|
CTASSERT(sizeof (struct ether_header) == ETHER_ADDR_LEN * 2 + 2);
|
|
|
|
CTASSERT(sizeof (struct ether_addr) == ETHER_ADDR_LEN);
|
|
|
|
#endif
|
|
|
|
|
2012-09-04 19:43:26 +00:00
|
|
|
VNET_DEFINE(struct pfil_head, link_pfil_hook); /* Packet filter hooks */
|
|
|
|
|
2000-06-26 23:34:54 +00:00
|
|
|
/* netgraph node hooks for ng_ether(4) */
|
2002-11-14 23:35:06 +00:00
|
|
|
void (*ng_ether_input_p)(struct ifnet *ifp, struct mbuf **mp);
|
|
|
|
void (*ng_ether_input_orphan_p)(struct ifnet *ifp, struct mbuf *m);
|
2000-06-26 23:34:54 +00:00
|
|
|
int (*ng_ether_output_p)(struct ifnet *ifp, struct mbuf **mp);
|
|
|
|
void (*ng_ether_attach_p)(struct ifnet *ifp);
|
|
|
|
void (*ng_ether_detach_p)(struct ifnet *ifp);
|
|
|
|
|
2003-05-05 09:15:50 +00:00
|
|
|
void (*vlan_input_p)(struct ifnet *, struct mbuf *);
|
2001-09-05 21:10:28 +00:00
|
|
|
|
2005-10-14 02:38:47 +00:00
|
|
|
/* if_bridge(4) support */
|
2005-06-05 03:13:13 +00:00
|
|
|
struct mbuf *(*bridge_input_p)(struct ifnet *, struct mbuf *);
|
2005-06-10 01:25:22 +00:00
|
|
|
int (*bridge_output_p)(struct ifnet *, struct mbuf *,
|
2005-06-05 03:13:13 +00:00
|
|
|
struct sockaddr *, struct rtentry *);
|
2005-06-10 01:25:22 +00:00
|
|
|
void (*bridge_dn_p)(struct mbuf *, struct ifnet *);
|
2005-06-05 03:13:13 +00:00
|
|
|
|
2007-04-17 00:35:11 +00:00
|
|
|
/* if_lagg(4) support */
|
|
|
|
struct mbuf *(*lagg_input_p)(struct ifnet *, struct mbuf *);
|
2007-04-10 00:27:25 +00:00
|
|
|
|
2004-03-09 23:55:59 +00:00
|
|
|
static const u_char etherbroadcastaddr[ETHER_ADDR_LEN] =
|
2003-03-21 17:53:16 +00:00
|
|
|
{ 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
|
|
|
|
|
2002-03-19 21:54:18 +00:00
|
|
|
static int ether_resolvemulti(struct ifnet *, struct sockaddr **,
|
2002-06-23 11:19:53 +00:00
|
|
|
struct sockaddr *);
|
2010-08-13 18:17:32 +00:00
|
|
|
#ifdef VIMAGE
|
|
|
|
static void ether_reassign(struct ifnet *, struct vnet *, char *);
|
|
|
|
#endif
|
2015-12-31 05:03:27 +00:00
|
|
|
static int ether_requestencap(struct ifnet *, struct if_encap_req *);
|
2003-03-21 17:53:16 +00:00
|
|
|
|
2007-03-19 18:39:36 +00:00
|
|
|
#define ETHER_IS_BROADCAST(addr) \
|
|
|
|
(bcmp(etherbroadcastaddr, (addr), ETHER_ADDR_LEN) == 0)
|
|
|
|
|
1999-01-31 08:17:16 +00:00
|
|
|
#define senderr(e) do { error = (e); goto bad;} while (0)
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2013-05-18 08:14:21 +00:00
|
|
|
static void
|
|
|
|
update_mbuf_csumflags(struct mbuf *src, struct mbuf *dst)
|
|
|
|
{
|
|
|
|
int csum_flags = 0;
|
|
|
|
|
|
|
|
if (src->m_pkthdr.csum_flags & CSUM_IP)
|
|
|
|
csum_flags |= (CSUM_IP_CHECKED|CSUM_IP_VALID);
|
|
|
|
if (src->m_pkthdr.csum_flags & CSUM_DELAY_DATA)
|
|
|
|
csum_flags |= (CSUM_DATA_VALID|CSUM_PSEUDO_HDR);
|
|
|
|
if (src->m_pkthdr.csum_flags & CSUM_SCTP)
|
|
|
|
csum_flags |= CSUM_SCTP_VALID;
|
|
|
|
dst->m_pkthdr.csum_flags |= csum_flags;
|
|
|
|
if (csum_flags & CSUM_DATA_VALID)
|
|
|
|
dst->m_pkthdr.csum_data = 0xffff;
|
|
|
|
}
|
|
|
|
|
2015-12-31 05:03:27 +00:00
|
|
|
/*
|
|
|
|
* Handle link-layer encapsulation requests.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
ether_requestencap(struct ifnet *ifp, struct if_encap_req *req)
|
|
|
|
{
|
|
|
|
struct ether_header *eh;
|
|
|
|
struct arphdr *ah;
|
|
|
|
uint16_t etype;
|
|
|
|
const u_char *lladdr;
|
|
|
|
|
|
|
|
if (req->rtype != IFENCAP_LL)
|
|
|
|
return (EOPNOTSUPP);
|
|
|
|
|
|
|
|
if (req->bufsize < ETHER_HDR_LEN)
|
|
|
|
return (ENOMEM);
|
|
|
|
|
|
|
|
eh = (struct ether_header *)req->buf;
|
|
|
|
lladdr = req->lladdr;
|
|
|
|
req->lladdr_off = 0;
|
|
|
|
|
|
|
|
switch (req->family) {
|
|
|
|
case AF_INET:
|
|
|
|
etype = htons(ETHERTYPE_IP);
|
|
|
|
break;
|
|
|
|
case AF_INET6:
|
|
|
|
etype = htons(ETHERTYPE_IPV6);
|
|
|
|
break;
|
|
|
|
case AF_ARP:
|
|
|
|
ah = (struct arphdr *)req->hdata;
|
|
|
|
ah->ar_hrd = htons(ARPHRD_ETHER);
|
|
|
|
|
|
|
|
switch(ntohs(ah->ar_op)) {
|
|
|
|
case ARPOP_REVREQUEST:
|
|
|
|
case ARPOP_REVREPLY:
|
|
|
|
etype = htons(ETHERTYPE_REVARP);
|
|
|
|
break;
|
|
|
|
case ARPOP_REQUEST:
|
|
|
|
case ARPOP_REPLY:
|
|
|
|
default:
|
|
|
|
etype = htons(ETHERTYPE_ARP);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (req->flags & IFENCAP_FLAG_BROADCAST)
|
|
|
|
lladdr = ifp->if_broadcastaddr;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return (EAFNOSUPPORT);
|
|
|
|
}
|
|
|
|
|
|
|
|
memcpy(&eh->ether_type, &etype, sizeof(eh->ether_type));
|
|
|
|
memcpy(eh->ether_dhost, lladdr, ETHER_ADDR_LEN);
|
|
|
|
memcpy(eh->ether_shost, IF_LLADDR(ifp), ETHER_ADDR_LEN);
|
|
|
|
req->bufsize = sizeof(struct ether_header);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
static int
|
|
|
|
ether_resolve_addr(struct ifnet *ifp, struct mbuf *m,
|
|
|
|
const struct sockaddr *dst, struct route *ro, u_char *phdr,
|
|
|
|
uint32_t *pflags)
|
|
|
|
{
|
|
|
|
struct ether_header *eh;
|
|
|
|
uint32_t lleflags = 0;
|
|
|
|
int error = 0;
|
|
|
|
#if defined(INET) || defined(INET6)
|
|
|
|
uint16_t etype;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
eh = (struct ether_header *)phdr;
|
|
|
|
|
|
|
|
switch (dst->sa_family) {
|
|
|
|
#ifdef INET
|
|
|
|
case AF_INET:
|
|
|
|
if ((m->m_flags & (M_BCAST | M_MCAST)) == 0)
|
|
|
|
error = arpresolve(ifp, 0, m, dst, phdr, &lleflags);
|
|
|
|
else {
|
|
|
|
if (m->m_flags & M_BCAST)
|
|
|
|
memcpy(eh->ether_dhost, ifp->if_broadcastaddr,
|
|
|
|
ETHER_ADDR_LEN);
|
|
|
|
else {
|
|
|
|
const struct in_addr *a;
|
|
|
|
a = &(((const struct sockaddr_in *)dst)->sin_addr);
|
|
|
|
ETHER_MAP_IP_MULTICAST(a, eh->ether_dhost);
|
|
|
|
}
|
|
|
|
etype = htons(ETHERTYPE_IP);
|
|
|
|
memcpy(&eh->ether_type, &etype, sizeof(etype));
|
|
|
|
memcpy(eh->ether_shost, IF_LLADDR(ifp), ETHER_ADDR_LEN);
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
|
|
|
if ((m->m_flags & M_MCAST) == 0)
|
|
|
|
error = nd6_resolve(ifp, 0, m, dst, phdr, &lleflags);
|
|
|
|
else {
|
|
|
|
const struct in6_addr *a6;
|
|
|
|
a6 = &(((const struct sockaddr_in6 *)dst)->sin6_addr);
|
|
|
|
ETHER_MAP_IPV6_MULTICAST(a6, eh->ether_dhost);
|
|
|
|
etype = htons(ETHERTYPE_IPV6);
|
|
|
|
memcpy(&eh->ether_type, &etype, sizeof(etype));
|
|
|
|
memcpy(eh->ether_shost, IF_LLADDR(ifp), ETHER_ADDR_LEN);
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
default:
|
|
|
|
if_printf(ifp, "can't handle af%d\n", dst->sa_family);
|
|
|
|
if (m != NULL)
|
|
|
|
m_freem(m);
|
|
|
|
return (EAFNOSUPPORT);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (error == EHOSTDOWN) {
|
2016-01-09 16:34:37 +00:00
|
|
|
if (ro != NULL && (ro->ro_flags & RT_HAS_GW) != 0)
|
2015-12-31 05:03:27 +00:00
|
|
|
error = EHOSTUNREACH;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
*pflags = RT_MAY_LOOP;
|
|
|
|
if (lleflags & LLE_IFADDR)
|
|
|
|
*pflags |= RT_L2_ME;
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Ethernet output routine.
|
|
|
|
* Encapsulate a packet of type family for the local net.
|
|
|
|
* Use trailer local net encapsulation if enough data in first
|
|
|
|
* packet leaves a multiple of 512 bytes of data in remainder.
|
|
|
|
*/
|
|
|
|
int
|
2003-10-23 13:49:10 +00:00
|
|
|
ether_output(struct ifnet *ifp, struct mbuf *m,
|
2013-04-26 12:50:32 +00:00
|
|
|
const struct sockaddr *dst, struct route *ro)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2015-12-31 05:03:27 +00:00
|
|
|
int error = 0;
|
|
|
|
char linkhdr[ETHER_HDR_LEN], *phdr;
|
2003-03-03 00:21:52 +00:00
|
|
|
struct ether_header *eh;
|
2007-07-03 12:46:08 +00:00
|
|
|
struct pf_mtag *t;
|
2005-12-22 12:16:20 +00:00
|
|
|
int loop_copy = 1;
|
2003-10-12 20:51:26 +00:00
|
|
|
int hlen; /* link layer header length */
|
2015-12-31 05:03:27 +00:00
|
|
|
uint32_t pflags;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2015-12-31 05:03:27 +00:00
|
|
|
phdr = NULL;
|
|
|
|
pflags = 0;
|
2009-04-16 20:30:28 +00:00
|
|
|
if (ro != NULL) {
|
2015-12-31 05:03:27 +00:00
|
|
|
phdr = ro->ro_prepend;
|
|
|
|
hlen = ro->ro_plen;
|
|
|
|
pflags = ro->ro_flags;
|
2009-04-16 20:30:28 +00:00
|
|
|
}
|
2002-07-31 16:22:02 +00:00
|
|
|
#ifdef MAC
|
2007-10-24 19:04:04 +00:00
|
|
|
error = mac_ifnet_check_transmit(ifp, m);
|
2002-07-31 16:22:02 +00:00
|
|
|
if (error)
|
|
|
|
senderr(error);
|
|
|
|
#endif
|
|
|
|
|
2008-04-29 21:23:21 +00:00
|
|
|
M_PROFILE(m);
|
2002-09-27 18:57:47 +00:00
|
|
|
if (ifp->if_flags & IFF_MONITOR)
|
|
|
|
senderr(ENETDOWN);
|
2005-08-09 10:20:02 +00:00
|
|
|
if (!((ifp->if_flags & IFF_UP) &&
|
|
|
|
(ifp->if_drv_flags & IFF_DRV_RUNNING)))
|
1994-05-24 10:09:53 +00:00
|
|
|
senderr(ENETDOWN);
|
2003-03-02 21:34:37 +00:00
|
|
|
|
2015-12-31 05:03:27 +00:00
|
|
|
if (phdr == NULL) {
|
|
|
|
/* No prepend data supplied. Try to calculate ourselves. */
|
|
|
|
phdr = linkhdr;
|
|
|
|
hlen = ETHER_HDR_LEN;
|
|
|
|
error = ether_resolve_addr(ifp, m, dst, ro, phdr, &pflags);
|
|
|
|
if (error != 0)
|
2015-09-16 14:26:28 +00:00
|
|
|
return (error == EWOULDBLOCK ? 0 : error);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2015-12-31 05:03:27 +00:00
|
|
|
if ((pflags & RT_L2_ME) != 0) {
|
2013-05-18 08:14:21 +00:00
|
|
|
update_mbuf_csumflags(m, m);
|
This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,
The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.
Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:
- Kip Macy revised the locking code completely, thus completing
the last piece of the puzzle, Kip has also been conducting
active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
me maintaining that branch before the svn conversion
2008-12-15 06:10:57 +00:00
|
|
|
return (if_simloop(ifp, m, dst->sa_family, 0));
|
|
|
|
}
|
2015-12-31 05:03:27 +00:00
|
|
|
loop_copy = pflags & RT_MAY_LOOP;
|
This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,
The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.
Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:
- Kip Macy revised the locking code completely, thus completing
the last piece of the puzzle, Kip has also been conducting
active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
me maintaining that branch before the svn conversion
2008-12-15 06:10:57 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Add local net header. If no space in first mbuf,
|
|
|
|
* allocate another.
|
2016-01-01 10:15:06 +00:00
|
|
|
*
|
|
|
|
* Note that we do prepend regardless of RT_HAS_HEADER flag.
|
|
|
|
* This is done because BPF code shifts m_data pointer
|
|
|
|
* to the end of ethernet header prior to calling if_output().
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2015-12-31 05:03:27 +00:00
|
|
|
M_PREPEND(m, hlen, M_NOWAIT);
|
2003-08-29 19:12:18 +00:00
|
|
|
if (m == NULL)
|
1994-05-24 10:09:53 +00:00
|
|
|
senderr(ENOBUFS);
|
2015-12-31 05:03:27 +00:00
|
|
|
if ((pflags & RT_HAS_HEADER) == 0) {
|
|
|
|
eh = mtod(m, struct ether_header *);
|
|
|
|
memcpy(eh, phdr, hlen);
|
2014-11-27 21:29:19 +00:00
|
|
|
}
|
1998-06-12 03:48:19 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If a simplex interface, and the packet is being sent to our
|
|
|
|
* Ethernet address or a broadcast address, loopback a copy.
|
|
|
|
* XXX To make a simplex device behave exactly like a duplex
|
|
|
|
* device, we should copy in the case of sending to our own
|
|
|
|
* ethernet address (thus letting the original actually appear
|
|
|
|
* on the wire). However, we don't do that here for security
|
|
|
|
* reasons and compatibility with the original behavior.
|
|
|
|
*/
|
2015-12-31 05:03:27 +00:00
|
|
|
if ((m->m_flags & M_BCAST) && loop_copy && (ifp->if_flags & IFF_SIMPLEX) &&
|
2007-07-03 12:46:08 +00:00
|
|
|
((t = pf_find_mtag(m)) == NULL || !t->routed)) {
|
2015-12-31 05:03:27 +00:00
|
|
|
struct mbuf *n;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Because if_simloop() modifies the packet, we need a
|
|
|
|
* writable copy through m_dup() instead of a readonly
|
|
|
|
* one as m_copy[m] would give us. The alternative would
|
|
|
|
* be to modify if_simloop() to handle the readonly mbuf,
|
|
|
|
* but performancewise it is mostly equivalent (trading
|
|
|
|
* extra data copying vs. extra locking).
|
|
|
|
*
|
|
|
|
* XXX This is a local workaround. A number of less
|
|
|
|
* often used kernel parts suffer from the same bug.
|
|
|
|
* See PR kern/105943 for a proposed general solution.
|
|
|
|
*/
|
|
|
|
if ((n = m_dup(m, M_NOWAIT)) != NULL) {
|
|
|
|
update_mbuf_csumflags(m, n);
|
|
|
|
(void)if_simloop(ifp, n, dst->sa_family, hlen);
|
|
|
|
} else
|
|
|
|
if_inc_counter(ifp, IFCOUNTER_IQDROPS, 1);
|
1998-06-12 03:48:19 +00:00
|
|
|
}
|
2000-05-14 02:18:43 +00:00
|
|
|
|
2006-08-25 20:16:39 +00:00
|
|
|
/*
|
|
|
|
* Bridges require special output handling.
|
|
|
|
*/
|
|
|
|
if (ifp->if_bridge) {
|
|
|
|
BRIDGE_OUTPUT(ifp, m, error);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2009-06-11 10:26:38 +00:00
|
|
|
#if defined(INET) || defined(INET6)
|
2005-02-22 13:04:05 +00:00
|
|
|
if (ifp->if_carp &&
|
A major overhaul of the CARP implementation. The ip_carp.c was started
from scratch, copying needed functionality from the old implemenation
on demand, with a thorough review of all code. The main change is that
interface layer has been removed from the CARP. Now redundant addresses
are configured exactly on the interfaces, they run on.
The CARP configuration itself is, as before, configured and read via
SIOCSVH/SIOCGVH ioctls. A new prefix created with SIOCAIFADDR or
SIOCAIFADDR_IN6 may now be configured to a particular virtual host id,
which makes the prefix redundant.
ifconfig(8) semantics has been changed too: now one doesn't need
to clone carpXX interface, he/she should directly configure a vhid
on a Ethernet interface.
To supply vhid data from the kernel to an application the getifaddrs(8)
function had been changed to pass ifam_data with each address. [1]
The new implementation definitely closes all PRs related to carp(4)
being an interface, and may close several others. It also allows
to run a single redundant IP per interface.
Big thanks to Bjoern Zeeb for his help with inet6 part of patch, for
idea on using ifam_data and for several rounds of reviewing!
PR: kern/117000, kern/126945, kern/126714, kern/120130, kern/117448
Reviewed by: bz
Submitted by: bz [1]
2011-12-16 12:16:56 +00:00
|
|
|
(error = (*carp_output_p)(ifp, m, dst)))
|
2005-02-22 13:04:05 +00:00
|
|
|
goto bad;
|
|
|
|
#endif
|
|
|
|
|
2000-06-26 23:34:54 +00:00
|
|
|
/* Handle ng_ether(4) processing, if any */
|
2014-11-07 15:14:10 +00:00
|
|
|
if (ifp->if_l2com != NULL) {
|
2005-07-21 09:00:51 +00:00
|
|
|
KASSERT(ng_ether_output_p != NULL,
|
|
|
|
("ng_ether_output_p is NULL"));
|
2000-06-26 23:34:54 +00:00
|
|
|
if ((error = (*ng_ether_output_p)(ifp, &m)) != 0) {
|
|
|
|
bad: if (m != NULL)
|
|
|
|
m_freem(m);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
if (m == NULL)
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Continue with link-layer output */
|
|
|
|
return ether_output_frame(ifp, m);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Ethernet link layer output routine to send a raw frame to the device.
|
|
|
|
*
|
|
|
|
* This assumes that the 14 byte Ethernet header is present and contiguous
|
|
|
|
* in the first mbuf (if BRIDGE'ing).
|
|
|
|
*/
|
|
|
|
int
|
2002-11-14 23:35:06 +00:00
|
|
|
ether_output_frame(struct ifnet *ifp, struct mbuf *m)
|
2000-06-26 23:34:54 +00:00
|
|
|
{
|
2012-09-04 19:43:26 +00:00
|
|
|
int i;
|
Remove (almost all) global variables that were used to hold
packet forwarding state ("annotations") during ip processing.
The code is considerably cleaner now.
The variables removed by this change are:
ip_divert_cookie used by divert sockets
ip_fw_fwd_addr used for transparent ip redirection
last_pkt used by dynamic pipes in dummynet
Removal of the first two has been done by carrying the annotations
into volatile structs prepended to the mbuf chains, and adding
appropriate code to add/remove annotations in the routines which
make use of them, i.e. ip_input(), ip_output(), tcp_input(),
bdg_forward(), ether_demux(), ether_output_frame(), div_output().
On passing, remove a bug in divert handling of fragmented packet.
Now it is the fragment at offset 0 which sets the divert status of
the whole packet, whereas formerly it was the last incoming fragment
to decide.
Removal of last_pkt required a change in the interface of ip_fw_chk()
and dummynet_io(). On passing, use the same mechanism for dummynet
annotations and for divert/forward annotations.
option IPFIREWALL_FORWARD is effectively useless, the code to
implement it is very small and is now in by default to avoid the
obfuscation of conditionally compiled code.
NOTES:
* there is at least one global variable left, sro_fwd, in ip_output().
I am not sure if/how this can be removed.
* I have deliberately avoided gratuitous style changes in this commit
to avoid cluttering the diffs. Minor stule cleanup will likely be
necessary
* this commit only focused on the IP layer. I am sure there is a
number of global variables used in the TCP and maybe UDP stack.
* despite the number of files touched, there are absolutely no API's
or data structures changed by this commit (except the interfaces of
ip_fw_chk() and dummynet_io(), which are internal anyways), so
an MFC is quite safe and unintrusive (and desirable, given the
improved readability of the code).
MFC after: 10 days
2002-06-22 11:51:02 +00:00
|
|
|
|
2012-09-04 19:43:26 +00:00
|
|
|
if (PFIL_HOOKED(&V_link_pfil_hook)) {
|
|
|
|
i = pfil_run_hooks(&V_link_pfil_hook, &m, ifp, PFIL_OUT, NULL);
|
|
|
|
|
|
|
|
if (i != 0)
|
|
|
|
return (EACCES);
|
|
|
|
|
|
|
|
if (m == NULL)
|
|
|
|
return (0);
|
2002-05-13 10:37:19 +00:00
|
|
|
}
|
Remove (almost all) global variables that were used to hold
packet forwarding state ("annotations") during ip processing.
The code is considerably cleaner now.
The variables removed by this change are:
ip_divert_cookie used by divert sockets
ip_fw_fwd_addr used for transparent ip redirection
last_pkt used by dynamic pipes in dummynet
Removal of the first two has been done by carrying the annotations
into volatile structs prepended to the mbuf chains, and adding
appropriate code to add/remove annotations in the routines which
make use of them, i.e. ip_input(), ip_output(), tcp_input(),
bdg_forward(), ether_demux(), ether_output_frame(), div_output().
On passing, remove a bug in divert handling of fragmented packet.
Now it is the fragment at offset 0 which sets the divert status of
the whole packet, whereas formerly it was the last incoming fragment
to decide.
Removal of last_pkt required a change in the interface of ip_fw_chk()
and dummynet_io(). On passing, use the same mechanism for dummynet
annotations and for divert/forward annotations.
option IPFIREWALL_FORWARD is effectively useless, the code to
implement it is very small and is now in by default to avoid the
obfuscation of conditionally compiled code.
NOTES:
* there is at least one global variable left, sro_fwd, in ip_output().
I am not sure if/how this can be removed.
* I have deliberately avoided gratuitous style changes in this commit
to avoid cluttering the diffs. Minor stule cleanup will likely be
necessary
* this commit only focused on the IP layer. I am sure there is a
number of global variables used in the TCP and maybe UDP stack.
* despite the number of files touched, there are absolutely no API's
or data structures changed by this commit (except the interfaces of
ip_fw_chk() and dummynet_io(), which are internal anyways), so
an MFC is quite safe and unintrusive (and desirable, given the
improved readability of the code).
MFC after: 10 days
2002-06-22 11:51:02 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
Lock down the network interface queues. The queue mutex must be obtained
before adding/removing packets from the queue. Also, the if_obytes and
if_omcasts fields should only be manipulated under protection of the mutex.
IF_ENQUEUE, IF_PREPEND, and IF_DEQUEUE perform all necessary locking on
the queue. An IF_LOCK macro is provided, as well as the old (mutex-less)
versions of the macros in the form _IF_ENQUEUE, _IF_QFULL, for code which
needs them, but their use is discouraged.
Two new macros are introduced: IF_DRAIN() to drain a queue, and IF_HANDOFF,
which takes care of locking/enqueue, and also statistics updating/start
if necessary.
2000-11-25 07:35:38 +00:00
|
|
|
* Queue message on interface, update output statistics if
|
|
|
|
* successful, and start output if interface not yet active.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2008-11-22 07:35:45 +00:00
|
|
|
return ((ifp->if_transmit)(ifp, m));
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2004-06-24 10:58:08 +00:00
|
|
|
#if defined(INET) || defined(INET6)
|
|
|
|
#endif
|
2002-05-13 10:37:19 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2002-11-14 23:35:06 +00:00
|
|
|
* Process a received Ethernet packet; the packet is in the
|
|
|
|
* mbuf chain m with the ethernet header at the front.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2002-11-14 23:35:06 +00:00
|
|
|
static void
|
Add an optional netisr dispatch point at ether_input(), but set the
default dispatch method to NETISR_DISPATCH_DIRECT in order to force
direct dispatch. This adds a fairly negligble overhead without
changing default behavior, but in the future will allow deferred or
hybrid dispatch to other worker threads before link layer processing
has taken place.
For example, this could allow redistribution using RSS hashes
without ethernet header cache line hits, if the NIC was unable to
adequately implement load balancing to too small a number of input
queues -- perhaps due to hard queueset counts of 1, 3, or 8, but in
a modern system with 16-128 threads. This can happen on highly
threaded systems, where you want want an ithread per core,
redistributing work to other queues, but also on virtualised systems
where hardware hashing is (or is not) available, but only a single
queue has been directed to one VCPU on a VM.
Note: this adds a previously non-present assertion about the
equivalence of the ifnet from which the packet is received, and the
ifnet stamped in the mbuf header. I believe this assertion to
generally be true, but we'll find out soon -- if it's not, we might
have to add additional overhead in some cases to add an m_tag with
the originating ifnet pointer stored in it.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-06-01 20:00:25 +00:00
|
|
|
ether_input_internal(struct ifnet *ifp, struct mbuf *m)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2002-11-14 23:35:06 +00:00
|
|
|
struct ether_header *eh;
|
|
|
|
u_short etype;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2007-03-19 18:39:36 +00:00
|
|
|
if ((ifp->if_flags & IFF_UP) == 0) {
|
|
|
|
m_freem(m);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
#ifdef DIAGNOSTIC
|
|
|
|
if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) {
|
|
|
|
if_printf(ifp, "discard frame at !IFF_DRV_RUNNING\n");
|
|
|
|
m_freem(m);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
#endif
|
2003-03-03 05:04:57 +00:00
|
|
|
if (m->m_len < ETHER_HDR_LEN) {
|
2002-11-14 23:35:06 +00:00
|
|
|
/* XXX maybe should pullup? */
|
|
|
|
if_printf(ifp, "discard frame w/o leading ethernet "
|
|
|
|
"header (len %u pkt len %u)\n",
|
|
|
|
m->m_len, m->m_pkthdr.len);
|
2014-09-19 10:39:58 +00:00
|
|
|
if_inc_counter(ifp, IFCOUNTER_IERRORS, 1);
|
2002-11-14 23:35:06 +00:00
|
|
|
m_freem(m);
|
|
|
|
return;
|
2000-05-14 02:18:43 +00:00
|
|
|
}
|
2002-11-14 23:35:06 +00:00
|
|
|
eh = mtod(m, struct ether_header *);
|
|
|
|
etype = ntohs(eh->ether_type);
|
Huge cleanup of random(4) code.
* GENERAL
- Update copyright.
- Make kernel options for RANDOM_YARROW and RANDOM_DUMMY. Set
neither to ON, which means we want Fortuna
- If there is no 'device random' in the kernel, there will be NO
random(4) device in the kernel, and the KERN_ARND sysctl will
return nothing. With RANDOM_DUMMY there will be a random(4) that
always blocks.
- Repair kern.arandom (KERN_ARND sysctl). The old version went
through arc4random(9) and was a bit weird.
- Adjust arc4random stirring a bit - the existing code looks a little
suspect.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Redo read_random(9) so as to duplicate random(4)'s read internals.
This makes it a first-class citizen rather than a hack.
- Move stuff out of locked regions when it does not need to be
there.
- Trim RANDOM_DEBUG printfs. Some are excess to requirement, some
behind boot verbose.
- Use SYSINIT to sequence the startup.
- Fix init/deinit sysctl stuff.
- Make relevant sysctls also tunables.
- Add different harvesting "styles" to allow for different requirements
(direct, queue, fast).
- Add harvesting of FFS atime events. This needs to be checked for
weighing down the FS code.
- Add harvesting of slab allocator events. This needs to be checked for
weighing down the allocator code.
- Fix the random(9) manpage.
- Loadable modules are not present for now. These will be re-engineered
when the dust settles.
- Use macros for locks.
- Fix comments.
* src/share/man/...
- Update the man pages.
* src/etc/...
- The startup/shutdown work is done in D2924.
* src/UPDATING
- Add UPDATING announcement.
* src/sys/dev/random/build.sh
- Add copyright.
- Add libz for unit tests.
* src/sys/dev/random/dummy.c
- Remove; no longer needed. Functionality incorporated into randomdev.*.
* live_entropy_sources.c live_entropy_sources.h
- Remove; content moved.
- move content to randomdev.[ch] and optimise.
* src/sys/dev/random/random_adaptors.c src/sys/dev/random/random_adaptors.h
- Remove; plugability is no longer used. Compile-time algorithm
selection is the way to go.
* src/sys/dev/random/random_harvestq.c src/sys/dev/random/random_harvestq.h
- Add early (re)boot-time randomness caching.
* src/sys/dev/random/randomdev_soft.c src/sys/dev/random/randomdev_soft.h
- Remove; no longer needed.
* src/sys/dev/random/uint128.h
- Provide a fake uint128_t; if a real one ever arrived, we can use
that instead. All that is needed here is N=0, N++, N==0, and some
localised trickery is used to manufacture a 128-bit 0ULLL.
* src/sys/dev/random/unit_test.c src/sys/dev/random/unit_test.h
- Improve unit tests; previously the testing human needed clairvoyance;
now the test will do a basic check of compressibility. Clairvoyant
talent is still a good idea.
- This is still a long way off a proper unit test.
* src/sys/dev/random/fortuna.c src/sys/dev/random/fortuna.h
- Improve messy union to just uint128_t.
- Remove unneeded 'static struct fortuna_start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
* src/sys/dev/random/yarrow.c src/sys/dev/random/yarrow.h
- Improve messy union to just uint128_t.
- Remove unneeded 'staic struct start_cache'.
- Tighten up up arithmetic.
- Provide a method to allow eternal junk to be introduced; harden
it against blatant by compress/hashing.
- Assert that locks are held correctly.
- Fix the nasty pre- and post-read overloading by providing explictit
functions to do these tasks.
- Turn into self-sufficient module (no longer requires randomdev_soft.[ch])
- Fix some magic numbers elsewhere used as FAST and SLOW.
Differential Revision: https://reviews.freebsd.org/D2025
Reviewed by: vsevolod,delphij,rwatson,trasz,jmg
Approved by: so (delphij)
2015-06-30 17:00:45 +00:00
|
|
|
random_harvest_queue(m, sizeof(*m), 2, RANDOM_NET_ETHER);
|
|
|
|
|
Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.
This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.
The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.
The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.
This change also introduces a DDB subcommand to show the list of all
vnet instances.
Approved by: julian (mentor)
2009-05-05 10:56:12 +00:00
|
|
|
CURVNET_SET_QUIET(ifp->if_vnet);
|
|
|
|
|
2007-03-19 18:39:36 +00:00
|
|
|
if (ETHER_IS_MULTICAST(eh->ether_dhost)) {
|
|
|
|
if (ETHER_IS_BROADCAST(eh->ether_dhost))
|
|
|
|
m->m_flags |= M_BCAST;
|
|
|
|
else
|
|
|
|
m->m_flags |= M_MCAST;
|
2014-09-19 10:39:58 +00:00
|
|
|
if_inc_counter(ifp, IFCOUNTER_IMCASTS, 1);
|
2007-03-19 18:39:36 +00:00
|
|
|
}
|
|
|
|
|
2003-07-13 20:32:58 +00:00
|
|
|
#ifdef MAC
|
|
|
|
/*
|
|
|
|
* Tag the mbuf with an appropriate MAC label before any other
|
|
|
|
* consumers can get to it.
|
|
|
|
*/
|
2007-10-24 19:04:04 +00:00
|
|
|
mac_ifnet_create_mbuf(ifp, m);
|
2003-07-13 20:32:58 +00:00
|
|
|
#endif
|
|
|
|
|
2002-11-14 23:35:06 +00:00
|
|
|
/*
|
2007-03-19 18:39:36 +00:00
|
|
|
* Give bpf a chance at the packet.
|
2002-11-14 23:35:06 +00:00
|
|
|
*/
|
2007-02-22 14:50:31 +00:00
|
|
|
ETHER_BPF_MTAP(ifp, m);
|
2000-05-14 02:18:43 +00:00
|
|
|
|
2007-03-19 18:39:36 +00:00
|
|
|
/*
|
|
|
|
* If the CRC is still on the packet, trim it off. We do this once
|
|
|
|
* and once only in case we are re-entered. Nothing else on the
|
|
|
|
* Ethernet receive path expects to see the FCS.
|
|
|
|
*/
|
2002-11-14 23:35:06 +00:00
|
|
|
if (m->m_flags & M_HASFCS) {
|
|
|
|
m_adj(m, -ETHER_CRC_LEN);
|
|
|
|
m->m_flags &= ~M_HASFCS;
|
|
|
|
}
|
|
|
|
|
2013-10-09 19:04:40 +00:00
|
|
|
if (!(ifp->if_capenable & IFCAP_HWSTATS))
|
2014-09-19 10:39:58 +00:00
|
|
|
if_inc_counter(ifp, IFCOUNTER_IBYTES, m->m_pkthdr.len);
|
2001-12-14 04:41:07 +00:00
|
|
|
|
2007-03-19 18:39:36 +00:00
|
|
|
/* Allow monitor mode to claim this frame, after stats are updated. */
|
2006-03-03 17:21:08 +00:00
|
|
|
if (ifp->if_flags & IFF_MONITOR) {
|
|
|
|
m_freem(m);
|
Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.
This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.
The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.
The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.
This change also introduces a DDB subcommand to show the list of all
vnet instances.
Approved by: julian (mentor)
2009-05-05 10:56:12 +00:00
|
|
|
CURVNET_RESTORE();
|
2006-03-03 17:21:08 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2007-04-17 00:35:11 +00:00
|
|
|
/* Handle input from a lagg(4) port */
|
2007-04-10 00:27:25 +00:00
|
|
|
if (ifp->if_type == IFT_IEEE8023ADLAG) {
|
2007-04-17 00:35:11 +00:00
|
|
|
KASSERT(lagg_input_p != NULL,
|
|
|
|
("%s: if_lagg not loaded!", __func__));
|
|
|
|
m = (*lagg_input_p)(ifp, m);
|
2007-04-10 00:27:25 +00:00
|
|
|
if (m != NULL)
|
|
|
|
ifp = m->m_pkthdr.rcvif;
|
2012-03-04 11:11:03 +00:00
|
|
|
else {
|
|
|
|
CURVNET_RESTORE();
|
2007-04-10 00:27:25 +00:00
|
|
|
return;
|
2012-03-04 11:11:03 +00:00
|
|
|
}
|
2007-04-10 00:27:25 +00:00
|
|
|
}
|
|
|
|
|
2007-03-19 18:39:36 +00:00
|
|
|
/*
|
|
|
|
* If the hardware did not process an 802.1Q tag, do this now,
|
|
|
|
* to allow 802.1P priority frames to be passed to the main input
|
|
|
|
* path correctly.
|
|
|
|
* TODO: Deal with Q-in-Q frames, but not arbitrary nesting levels.
|
|
|
|
*/
|
|
|
|
if ((m->m_flags & M_VLANTAG) == 0 && etype == ETHERTYPE_VLAN) {
|
|
|
|
struct ether_vlan_header *evl;
|
|
|
|
|
|
|
|
if (m->m_len < sizeof(*evl) &&
|
|
|
|
(m = m_pullup(m, sizeof(*evl))) == NULL) {
|
2007-03-20 14:29:54 +00:00
|
|
|
#ifdef DIAGNOSTIC
|
2007-03-19 18:39:36 +00:00
|
|
|
if_printf(ifp, "cannot pullup VLAN header\n");
|
2007-03-20 14:29:54 +00:00
|
|
|
#endif
|
2014-09-19 10:39:58 +00:00
|
|
|
if_inc_counter(ifp, IFCOUNTER_IERRORS, 1);
|
2012-03-04 11:11:03 +00:00
|
|
|
CURVNET_RESTORE();
|
2007-03-19 18:39:36 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
evl = mtod(m, struct ether_vlan_header *);
|
|
|
|
m->m_pkthdr.ether_vtag = ntohs(evl->evl_tag);
|
|
|
|
m->m_flags |= M_VLANTAG;
|
|
|
|
|
|
|
|
bcopy((char *)evl, (char *)evl + ETHER_VLAN_ENCAP_LEN,
|
|
|
|
ETHER_HDR_LEN - ETHER_TYPE_LEN);
|
|
|
|
m_adj(m, ETHER_VLAN_ENCAP_LEN);
|
2012-11-26 19:45:01 +00:00
|
|
|
eh = mtod(m, struct ether_header *);
|
2007-03-19 18:39:36 +00:00
|
|
|
}
|
|
|
|
|
2011-07-03 16:08:38 +00:00
|
|
|
M_SETFIB(m, ifp->if_fib);
|
|
|
|
|
2007-03-19 18:39:36 +00:00
|
|
|
/* Allow ng_ether(4) to claim this frame. */
|
2014-11-07 15:14:10 +00:00
|
|
|
if (ifp->if_l2com != NULL) {
|
2005-07-21 09:00:51 +00:00
|
|
|
KASSERT(ng_ether_input_p != NULL,
|
2007-03-19 18:39:36 +00:00
|
|
|
("%s: ng_ether_input_p is NULL", __func__));
|
|
|
|
m->m_flags &= ~M_PROMISC;
|
2002-11-14 23:35:06 +00:00
|
|
|
(*ng_ether_input_p)(ifp, &m);
|
Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.
This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.
The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.
The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.
This change also introduces a DDB subcommand to show the list of all
vnet instances.
Approved by: julian (mentor)
2009-05-05 10:56:12 +00:00
|
|
|
if (m == NULL) {
|
|
|
|
CURVNET_RESTORE();
|
2000-06-26 23:34:54 +00:00
|
|
|
return;
|
Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.
This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.
The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.
The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.
This change also introduces a DDB subcommand to show the list of all
vnet instances.
Approved by: julian (mentor)
2009-05-05 10:56:12 +00:00
|
|
|
}
|
2012-11-27 06:35:26 +00:00
|
|
|
eh = mtod(m, struct ether_header *);
|
2000-06-26 23:34:54 +00:00
|
|
|
}
|
|
|
|
|
2005-06-05 03:13:13 +00:00
|
|
|
/*
|
2007-03-19 18:39:36 +00:00
|
|
|
* Allow if_bridge(4) to claim this frame.
|
|
|
|
* The BRIDGE_INPUT() macro will update ifp if the bridge changed it
|
|
|
|
* and the frame should be delivered locally.
|
2005-06-05 03:49:23 +00:00
|
|
|
*/
|
2007-03-19 18:39:36 +00:00
|
|
|
if (ifp->if_bridge != NULL) {
|
|
|
|
m->m_flags &= ~M_PROMISC;
|
2005-10-14 02:38:47 +00:00
|
|
|
BRIDGE_INPUT(ifp, m);
|
Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.
This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.
The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.
The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.
This change also introduces a DDB subcommand to show the list of all
vnet instances.
Approved by: julian (mentor)
2009-05-05 10:56:12 +00:00
|
|
|
if (m == NULL) {
|
|
|
|
CURVNET_RESTORE();
|
2005-06-05 03:13:13 +00:00
|
|
|
return;
|
Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.
This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.
The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.
The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.
This change also introduces a DDB subcommand to show the list of all
vnet instances.
Approved by: julian (mentor)
2009-05-05 10:56:12 +00:00
|
|
|
}
|
2012-11-27 06:35:26 +00:00
|
|
|
eh = mtod(m, struct ether_header *);
|
2005-06-05 03:13:13 +00:00
|
|
|
}
|
|
|
|
|
2009-06-11 10:26:38 +00:00
|
|
|
#if defined(INET) || defined(INET6)
|
2007-03-19 18:39:36 +00:00
|
|
|
/*
|
|
|
|
* Clear M_PROMISC on frame so that carp(4) will see it when the
|
|
|
|
* mbuf flows up to Layer 3.
|
|
|
|
* FreeBSD's implementation of carp(4) uses the inprotosw
|
|
|
|
* to dispatch IPPROTO_CARP. carp(4) also allocates its own
|
|
|
|
* Ethernet addresses of the form 00:00:5e:00:01:xx, which
|
|
|
|
* is outside the scope of the M_PROMISC test below.
|
|
|
|
* TODO: Maintain a hash table of ethernet addresses other than
|
|
|
|
* ether_dhost which may be active on this ifp.
|
|
|
|
*/
|
2010-08-11 00:51:50 +00:00
|
|
|
if (ifp->if_carp && (*carp_forus_p)(ifp, eh->ether_dhost)) {
|
2007-03-19 18:39:36 +00:00
|
|
|
m->m_flags &= ~M_PROMISC;
|
|
|
|
} else
|
|
|
|
#endif
|
|
|
|
{
|
|
|
|
/*
|
2007-03-22 19:08:39 +00:00
|
|
|
* If the frame received was not for our MAC address, set the
|
2007-03-19 18:39:36 +00:00
|
|
|
* M_PROMISC flag on the mbuf chain. The frame may need to
|
|
|
|
* be seen by the rest of the Ethernet input path in case of
|
|
|
|
* re-entry (e.g. bridge, vlan, netgraph) but should not be
|
|
|
|
* seen by upper protocol layers.
|
|
|
|
*/
|
|
|
|
if (!ETHER_IS_MULTICAST(eh->ether_dhost) &&
|
2007-03-22 19:08:39 +00:00
|
|
|
bcmp(IF_LLADDR(ifp), eh->ether_dhost, ETHER_ADDR_LEN) != 0)
|
2007-03-19 18:39:36 +00:00
|
|
|
m->m_flags |= M_PROMISC;
|
|
|
|
}
|
|
|
|
|
2004-10-11 10:21:34 +00:00
|
|
|
ether_demux(ifp, m);
|
Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.
This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.
The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.
The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.
This change also introduces a DDB subcommand to show the list of all
vnet instances.
Approved by: julian (mentor)
2009-05-05 10:56:12 +00:00
|
|
|
CURVNET_RESTORE();
|
2000-06-26 23:34:54 +00:00
|
|
|
}
|
|
|
|
|
Add an optional netisr dispatch point at ether_input(), but set the
default dispatch method to NETISR_DISPATCH_DIRECT in order to force
direct dispatch. This adds a fairly negligble overhead without
changing default behavior, but in the future will allow deferred or
hybrid dispatch to other worker threads before link layer processing
has taken place.
For example, this could allow redistribution using RSS hashes
without ethernet header cache line hits, if the NIC was unable to
adequately implement load balancing to too small a number of input
queues -- perhaps due to hard queueset counts of 1, 3, or 8, but in
a modern system with 16-128 threads. This can happen on highly
threaded systems, where you want want an ithread per core,
redistributing work to other queues, but also on virtualised systems
where hardware hashing is (or is not) available, but only a single
queue has been directed to one VCPU on a VM.
Note: this adds a previously non-present assertion about the
equivalence of the ifnet from which the packet is received, and the
ifnet stamped in the mbuf header. I believe this assertion to
generally be true, but we'll find out soon -- if it's not, we might
have to add additional overhead in some cases to add an m_tag with
the originating ifnet pointer stored in it.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-06-01 20:00:25 +00:00
|
|
|
/*
|
|
|
|
* Ethernet input dispatch; by default, direct dispatch here regardless of
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
* global configuration. However, if RSS is enabled, hook up RSS affinity
|
|
|
|
* so that when deferred or hybrid dispatch is enabled, we can redistribute
|
|
|
|
* load based on RSS.
|
|
|
|
*
|
|
|
|
* XXXRW: Would be nice if the ifnet passed up a flag indicating whether or
|
|
|
|
* not it had already done work distribution via multi-queue. Then we could
|
|
|
|
* direct dispatch in the event load balancing was already complete and
|
|
|
|
* handle the case of interfaces with different capabilities better.
|
|
|
|
*
|
|
|
|
* XXXRW: Sort of want an M_DISTRIBUTED flag to avoid multiple distributions
|
|
|
|
* at multiple layers?
|
|
|
|
*
|
|
|
|
* XXXRW: For now, enable all this only if RSS is compiled in, although it
|
|
|
|
* works fine without RSS. Need to characterise the performance overhead
|
|
|
|
* of the detour through the netisr code in the event the result is always
|
|
|
|
* direct dispatch.
|
Add an optional netisr dispatch point at ether_input(), but set the
default dispatch method to NETISR_DISPATCH_DIRECT in order to force
direct dispatch. This adds a fairly negligble overhead without
changing default behavior, but in the future will allow deferred or
hybrid dispatch to other worker threads before link layer processing
has taken place.
For example, this could allow redistribution using RSS hashes
without ethernet header cache line hits, if the NIC was unable to
adequately implement load balancing to too small a number of input
queues -- perhaps due to hard queueset counts of 1, 3, or 8, but in
a modern system with 16-128 threads. This can happen on highly
threaded systems, where you want want an ithread per core,
redistributing work to other queues, but also on virtualised systems
where hardware hashing is (or is not) available, but only a single
queue has been directed to one VCPU on a VM.
Note: this adds a previously non-present assertion about the
equivalence of the ifnet from which the packet is received, and the
ifnet stamped in the mbuf header. I believe this assertion to
generally be true, but we'll find out soon -- if it's not, we might
have to add additional overhead in some cases to add an m_tag with
the originating ifnet pointer stored in it.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-06-01 20:00:25 +00:00
|
|
|
*/
|
|
|
|
static void
|
|
|
|
ether_nh_input(struct mbuf *m)
|
|
|
|
{
|
|
|
|
|
2015-09-16 13:17:00 +00:00
|
|
|
M_ASSERTPKTHDR(m);
|
|
|
|
KASSERT(m->m_pkthdr.rcvif != NULL,
|
|
|
|
("%s: NULL interface pointer", __func__));
|
Add an optional netisr dispatch point at ether_input(), but set the
default dispatch method to NETISR_DISPATCH_DIRECT in order to force
direct dispatch. This adds a fairly negligble overhead without
changing default behavior, but in the future will allow deferred or
hybrid dispatch to other worker threads before link layer processing
has taken place.
For example, this could allow redistribution using RSS hashes
without ethernet header cache line hits, if the NIC was unable to
adequately implement load balancing to too small a number of input
queues -- perhaps due to hard queueset counts of 1, 3, or 8, but in
a modern system with 16-128 threads. This can happen on highly
threaded systems, where you want want an ithread per core,
redistributing work to other queues, but also on virtualised systems
where hardware hashing is (or is not) available, but only a single
queue has been directed to one VCPU on a VM.
Note: this adds a previously non-present assertion about the
equivalence of the ifnet from which the packet is received, and the
ifnet stamped in the mbuf header. I believe this assertion to
generally be true, but we'll find out soon -- if it's not, we might
have to add additional overhead in some cases to add an m_tag with
the originating ifnet pointer stored in it.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-06-01 20:00:25 +00:00
|
|
|
ether_input_internal(m->m_pkthdr.rcvif, m);
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct netisr_handler ether_nh = {
|
|
|
|
.nh_name = "ether",
|
|
|
|
.nh_handler = ether_nh_input,
|
|
|
|
.nh_proto = NETISR_ETHER,
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
#ifdef RSS
|
|
|
|
.nh_policy = NETISR_POLICY_CPU,
|
|
|
|
.nh_dispatch = NETISR_DISPATCH_DIRECT,
|
|
|
|
.nh_m2cpuid = rss_m2cpuid,
|
|
|
|
#else
|
Add an optional netisr dispatch point at ether_input(), but set the
default dispatch method to NETISR_DISPATCH_DIRECT in order to force
direct dispatch. This adds a fairly negligble overhead without
changing default behavior, but in the future will allow deferred or
hybrid dispatch to other worker threads before link layer processing
has taken place.
For example, this could allow redistribution using RSS hashes
without ethernet header cache line hits, if the NIC was unable to
adequately implement load balancing to too small a number of input
queues -- perhaps due to hard queueset counts of 1, 3, or 8, but in
a modern system with 16-128 threads. This can happen on highly
threaded systems, where you want want an ithread per core,
redistributing work to other queues, but also on virtualised systems
where hardware hashing is (or is not) available, but only a single
queue has been directed to one VCPU on a VM.
Note: this adds a previously non-present assertion about the
equivalence of the ifnet from which the packet is received, and the
ifnet stamped in the mbuf header. I believe this assertion to
generally be true, but we'll find out soon -- if it's not, we might
have to add additional overhead in some cases to add an m_tag with
the originating ifnet pointer stored in it.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-06-01 20:00:25 +00:00
|
|
|
.nh_policy = NETISR_POLICY_SOURCE,
|
|
|
|
.nh_dispatch = NETISR_DISPATCH_DIRECT,
|
Several years after initial development, merge prototype support for
linking NIC Receive Side Scaling (RSS) to the network stack's
connection-group implementation. This prototype (and derived patches)
are in use at Juniper and several other FreeBSD-using companies, so
despite some reservations about its maturity, merge the patch to the
base tree so that it can be iteratively refined in collaboration rather
than maintained as a set of gradually diverging patch sets.
(1) Merge a software implementation of the Toeplitz hash specified in
RSS implemented by David Malone. This is used to allow suitable
pcbgroup placement of connections before the first packet is
received from the NIC. Software hashing is generally avoided,
however, due to high cost of the hash on general-purpose CPUs.
(2) In in_rss.c, maintain authoritative versions of RSS state intended
to be pushed to each NIC, including keying material, hash
algorithm/ configuration, and buckets. Provide software-facing
interfaces to hash 2- and 4-tuples for IPv4 and IPv6 using both
the RSS standardised Toeplitz and a 'naive' variation with a hash
efficient in software but with poor distribution properties.
Implement rss_m2cpuid()to be used by netisr and other load
balancing code to look up the CPU on which an mbuf should be
processed.
(3) In the Ethernet link layer, allow netisr distribution using RSS as
a source of policy as an alternative to source ordering; continue
to default to direct dispatch (i.e., don't try and requeue packets
for processing on the 'right' CPU if they arrive in a directly
dispatchable context).
(4) Allow RSS to control tuning of connection groups in order to align
groups with RSS buckets. If a packet arrives on a protocol using
connection groups, and contains a suitable hardware-generated
hash, use that hash value to select the connection group for pcb
lookup for both IPv4 and IPv6. If no hardware-generated Toeplitz
hash is available, we fall back on regular PCB lookup risking
contention rather than pay the cost of Toeplitz in software --
this is a less scalable but, at my last measurement, faster
approach. As core counts go up, we may want to revise this
strategy despite CPU overhead.
Where device drivers suitably configure NICs, and connection groups /
RSS are enabled, this should avoid both lock and line contention during
connection lookup for TCP. This commit does not modify any device
drivers to tune device RSS configuration to the global RSS
configuration; patches are in circulation to do this for at least
Chelsio T3 and Intel 1G/10G drivers. Currently, the KPI for device
drivers is not particularly robust, nor aware of more advanced features
such as runtime reconfiguration/rebalancing. This will hopefully prove
a useful starting point for refinement.
No MFC is scheduled as we will first want to nail down a more mature
and maintainable KPI/KBI for device drivers.
Sponsored by: Juniper Networks (original work)
Sponsored by: EMC/Isilon (patch update and merge)
2014-03-15 00:57:50 +00:00
|
|
|
#endif
|
Add an optional netisr dispatch point at ether_input(), but set the
default dispatch method to NETISR_DISPATCH_DIRECT in order to force
direct dispatch. This adds a fairly negligble overhead without
changing default behavior, but in the future will allow deferred or
hybrid dispatch to other worker threads before link layer processing
has taken place.
For example, this could allow redistribution using RSS hashes
without ethernet header cache line hits, if the NIC was unable to
adequately implement load balancing to too small a number of input
queues -- perhaps due to hard queueset counts of 1, 3, or 8, but in
a modern system with 16-128 threads. This can happen on highly
threaded systems, where you want want an ithread per core,
redistributing work to other queues, but also on virtualised systems
where hardware hashing is (or is not) available, but only a single
queue has been directed to one VCPU on a VM.
Note: this adds a previously non-present assertion about the
equivalence of the ifnet from which the packet is received, and the
ifnet stamped in the mbuf header. I believe this assertion to
generally be true, but we'll find out soon -- if it's not, we might
have to add additional overhead in some cases to add an m_tag with
the originating ifnet pointer stored in it.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-06-01 20:00:25 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
static void
|
|
|
|
ether_init(__unused void *arg)
|
|
|
|
{
|
|
|
|
|
|
|
|
netisr_register(ðer_nh);
|
|
|
|
}
|
|
|
|
SYSINIT(ether, SI_SUB_INIT_IF, SI_ORDER_ANY, ether_init, NULL);
|
|
|
|
|
2012-09-04 19:43:26 +00:00
|
|
|
static void
|
|
|
|
vnet_ether_init(__unused void *arg)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
/* Initialize packet filter hooks. */
|
|
|
|
V_link_pfil_hook.ph_type = PFIL_TYPE_AF;
|
|
|
|
V_link_pfil_hook.ph_af = AF_LINK;
|
|
|
|
if ((i = pfil_head_register(&V_link_pfil_hook)) != 0)
|
|
|
|
printf("%s: WARNING: unable to register pfil link hook, "
|
|
|
|
"error %d\n", __func__, i);
|
|
|
|
}
|
|
|
|
VNET_SYSINIT(vnet_ether_init, SI_SUB_PROTO_IF, SI_ORDER_ANY,
|
|
|
|
vnet_ether_init, NULL);
|
|
|
|
|
|
|
|
static void
|
|
|
|
vnet_ether_destroy(__unused void *arg)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if ((i = pfil_head_unregister(&V_link_pfil_hook)) != 0)
|
|
|
|
printf("%s: WARNING: unable to unregister pfil link hook, "
|
|
|
|
"error %d\n", __func__, i);
|
|
|
|
}
|
|
|
|
VNET_SYSUNINIT(vnet_ether_uninit, SI_SUB_PROTO_IF, SI_ORDER_ANY,
|
|
|
|
vnet_ether_destroy, NULL);
|
|
|
|
|
|
|
|
|
|
|
|
|
Add an optional netisr dispatch point at ether_input(), but set the
default dispatch method to NETISR_DISPATCH_DIRECT in order to force
direct dispatch. This adds a fairly negligble overhead without
changing default behavior, but in the future will allow deferred or
hybrid dispatch to other worker threads before link layer processing
has taken place.
For example, this could allow redistribution using RSS hashes
without ethernet header cache line hits, if the NIC was unable to
adequately implement load balancing to too small a number of input
queues -- perhaps due to hard queueset counts of 1, 3, or 8, but in
a modern system with 16-128 threads. This can happen on highly
threaded systems, where you want want an ithread per core,
redistributing work to other queues, but also on virtualised systems
where hardware hashing is (or is not) available, but only a single
queue has been directed to one VCPU on a VM.
Note: this adds a previously non-present assertion about the
equivalence of the ifnet from which the packet is received, and the
ifnet stamped in the mbuf header. I believe this assertion to
generally be true, but we'll find out soon -- if it's not, we might
have to add additional overhead in some cases to add an m_tag with
the originating ifnet pointer stored in it.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-06-01 20:00:25 +00:00
|
|
|
static void
|
|
|
|
ether_input(struct ifnet *ifp, struct mbuf *m)
|
|
|
|
{
|
|
|
|
|
2013-11-18 22:58:14 +00:00
|
|
|
struct mbuf *mn;
|
|
|
|
|
Add an optional netisr dispatch point at ether_input(), but set the
default dispatch method to NETISR_DISPATCH_DIRECT in order to force
direct dispatch. This adds a fairly negligble overhead without
changing default behavior, but in the future will allow deferred or
hybrid dispatch to other worker threads before link layer processing
has taken place.
For example, this could allow redistribution using RSS hashes
without ethernet header cache line hits, if the NIC was unable to
adequately implement load balancing to too small a number of input
queues -- perhaps due to hard queueset counts of 1, 3, or 8, but in
a modern system with 16-128 threads. This can happen on highly
threaded systems, where you want want an ithread per core,
redistributing work to other queues, but also on virtualised systems
where hardware hashing is (or is not) available, but only a single
queue has been directed to one VCPU on a VM.
Note: this adds a previously non-present assertion about the
equivalence of the ifnet from which the packet is received, and the
ifnet stamped in the mbuf header. I believe this assertion to
generally be true, but we'll find out soon -- if it's not, we might
have to add additional overhead in some cases to add an m_tag with
the originating ifnet pointer stored in it.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-06-01 20:00:25 +00:00
|
|
|
/*
|
2013-11-18 22:58:14 +00:00
|
|
|
* The drivers are allowed to pass in a chain of packets linked with
|
|
|
|
* m_nextpkt. We split them up into separate packets here and pass
|
|
|
|
* them up. This allows the drivers to amortize the receive lock.
|
Add an optional netisr dispatch point at ether_input(), but set the
default dispatch method to NETISR_DISPATCH_DIRECT in order to force
direct dispatch. This adds a fairly negligble overhead without
changing default behavior, but in the future will allow deferred or
hybrid dispatch to other worker threads before link layer processing
has taken place.
For example, this could allow redistribution using RSS hashes
without ethernet header cache line hits, if the NIC was unable to
adequately implement load balancing to too small a number of input
queues -- perhaps due to hard queueset counts of 1, 3, or 8, but in
a modern system with 16-128 threads. This can happen on highly
threaded systems, where you want want an ithread per core,
redistributing work to other queues, but also on virtualised systems
where hardware hashing is (or is not) available, but only a single
queue has been directed to one VCPU on a VM.
Note: this adds a previously non-present assertion about the
equivalence of the ifnet from which the packet is received, and the
ifnet stamped in the mbuf header. I believe this assertion to
generally be true, but we'll find out soon -- if it's not, we might
have to add additional overhead in some cases to add an m_tag with
the originating ifnet pointer stored in it.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-06-01 20:00:25 +00:00
|
|
|
*/
|
2013-11-18 22:58:14 +00:00
|
|
|
while (m) {
|
|
|
|
mn = m->m_nextpkt;
|
|
|
|
m->m_nextpkt = NULL;
|
Add an optional netisr dispatch point at ether_input(), but set the
default dispatch method to NETISR_DISPATCH_DIRECT in order to force
direct dispatch. This adds a fairly negligble overhead without
changing default behavior, but in the future will allow deferred or
hybrid dispatch to other worker threads before link layer processing
has taken place.
For example, this could allow redistribution using RSS hashes
without ethernet header cache line hits, if the NIC was unable to
adequately implement load balancing to too small a number of input
queues -- perhaps due to hard queueset counts of 1, 3, or 8, but in
a modern system with 16-128 threads. This can happen on highly
threaded systems, where you want want an ithread per core,
redistributing work to other queues, but also on virtualised systems
where hardware hashing is (or is not) available, but only a single
queue has been directed to one VCPU on a VM.
Note: this adds a previously non-present assertion about the
equivalence of the ifnet from which the packet is received, and the
ifnet stamped in the mbuf header. I believe this assertion to
generally be true, but we'll find out soon -- if it's not, we might
have to add additional overhead in some cases to add an m_tag with
the originating ifnet pointer stored in it.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-06-01 20:00:25 +00:00
|
|
|
|
2013-11-18 22:58:14 +00:00
|
|
|
/*
|
|
|
|
* We will rely on rcvif being set properly in the deferred context,
|
|
|
|
* so assert it is correct here.
|
|
|
|
*/
|
|
|
|
KASSERT(m->m_pkthdr.rcvif == ifp, ("%s: ifnet mismatch", __func__));
|
|
|
|
netisr_dispatch(NETISR_ETHER, m);
|
|
|
|
m = mn;
|
|
|
|
}
|
Add an optional netisr dispatch point at ether_input(), but set the
default dispatch method to NETISR_DISPATCH_DIRECT in order to force
direct dispatch. This adds a fairly negligble overhead without
changing default behavior, but in the future will allow deferred or
hybrid dispatch to other worker threads before link layer processing
has taken place.
For example, this could allow redistribution using RSS hashes
without ethernet header cache line hits, if the NIC was unable to
adequately implement load balancing to too small a number of input
queues -- perhaps due to hard queueset counts of 1, 3, or 8, but in
a modern system with 16-128 threads. This can happen on highly
threaded systems, where you want want an ithread per core,
redistributing work to other queues, but also on virtualised systems
where hardware hashing is (or is not) available, but only a single
queue has been directed to one VCPU on a VM.
Note: this adds a previously non-present assertion about the
equivalence of the ifnet from which the packet is received, and the
ifnet stamped in the mbuf header. I believe this assertion to
generally be true, but we'll find out soon -- if it's not, we might
have to add additional overhead in some cases to add an m_tag with
the originating ifnet pointer stored in it.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
2011-06-01 20:00:25 +00:00
|
|
|
}
|
|
|
|
|
2000-06-26 23:34:54 +00:00
|
|
|
/*
|
|
|
|
* Upper layer processing for a received Ethernet packet.
|
|
|
|
*/
|
|
|
|
void
|
2002-11-14 23:35:06 +00:00
|
|
|
ether_demux(struct ifnet *ifp, struct mbuf *m)
|
2000-06-26 23:34:54 +00:00
|
|
|
{
|
2002-11-14 23:35:06 +00:00
|
|
|
struct ether_header *eh;
|
2012-09-04 19:43:26 +00:00
|
|
|
int i, isr;
|
2000-06-26 23:34:54 +00:00
|
|
|
u_short ether_type;
|
2002-11-14 23:35:06 +00:00
|
|
|
|
2007-03-19 18:39:36 +00:00
|
|
|
KASSERT(ifp != NULL, ("%s: NULL interface pointer", __func__));
|
2002-11-14 23:35:06 +00:00
|
|
|
|
2012-09-04 19:43:26 +00:00
|
|
|
/* Do not grab PROMISC frames in case we are re-entered. */
|
|
|
|
if (PFIL_HOOKED(&V_link_pfil_hook) && !(m->m_flags & M_PROMISC)) {
|
|
|
|
i = pfil_run_hooks(&V_link_pfil_hook, &m, ifp, PFIL_IN, NULL);
|
|
|
|
|
|
|
|
if (i != 0 || m == NULL)
|
|
|
|
return;
|
2002-05-13 10:37:19 +00:00
|
|
|
}
|
2012-09-04 19:43:26 +00:00
|
|
|
|
2007-03-19 18:39:36 +00:00
|
|
|
eh = mtod(m, struct ether_header *);
|
|
|
|
ether_type = ntohs(eh->ether_type);
|
2002-05-13 10:37:19 +00:00
|
|
|
|
2002-11-14 23:35:06 +00:00
|
|
|
/*
|
2007-03-19 18:39:36 +00:00
|
|
|
* If this frame has a VLAN tag other than 0, call vlan_input()
|
|
|
|
* if its module is loaded. Otherwise, drop.
|
2002-11-14 23:35:06 +00:00
|
|
|
*/
|
2007-03-19 18:39:36 +00:00
|
|
|
if ((m->m_flags & M_VLANTAG) &&
|
|
|
|
EVL_VLANOFTAG(m->m_pkthdr.ether_vtag) != 0) {
|
Merge the //depot/user/yar/vlan branch into CVS. It contains some collective
work by yar, thompsa and myself. The checksum offloading part also involves
work done by Mihail Balikov.
The most important changes:
o Instead of global linked list of all vlan softc use a per-trunk
hash. The size of hash is dynamically adjusted, depending on
number of entries. This changes struct ifnet, replacing counter
of vlans with a pointer to trunk structure. This change is an
improvement for setups with big number of VLANs, several interfaces
and several CPUs. It is a small regression for a setup with a single
VLAN interface.
An alternative to dynamic hash is a per-trunk static array with
4096 entries, which is a compile time option - VLAN_ARRAY. In my
experiments the array is not an improvement, probably because such
a big trunk structure doesn't fit into CPU cache.
o Introduce an UMA zone for VLAN tags. Since drivers depend on it,
the zone is declared in kern_mbuf.c, not in optional vlan(4) driver.
This change is a big improvement for any setup utilizing vlan(4).
o Use rwlock(9) instead of mutex(9) for locking. We are the first
ones to do this! :)
o Some drivers can do hardware VLAN tagging + hardware checksum
offloading. Add an infrastructure for this. Whenever vlan(4) is
attached to a parent or parent configuration is changed, the flags
on vlan(4) interface are updated.
In collaboration with: yar, thompsa
In collaboration with: Mihail Balikov <mihail.balikov interbgc.com>
2006-01-30 13:45:15 +00:00
|
|
|
if (ifp->if_vlantrunk == NULL) {
|
2014-09-19 10:39:58 +00:00
|
|
|
if_inc_counter(ifp, IFCOUNTER_NOPROTO, 1);
|
2005-02-14 08:29:42 +00:00
|
|
|
m_freem(m);
|
|
|
|
return;
|
|
|
|
}
|
2007-03-19 18:39:36 +00:00
|
|
|
KASSERT(vlan_input_p != NULL,("%s: VLAN not loaded!",
|
|
|
|
__func__));
|
|
|
|
/* Clear before possibly re-entering ether_input(). */
|
|
|
|
m->m_flags &= ~M_PROMISC;
|
2002-11-14 23:35:06 +00:00
|
|
|
(*vlan_input_p)(ifp, m);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2007-03-19 18:39:36 +00:00
|
|
|
* Pass promiscuously received frames to the upper layer if the user
|
|
|
|
* requested this by setting IFF_PPROMISC. Otherwise, drop them.
|
2002-11-14 23:35:06 +00:00
|
|
|
*/
|
2007-03-19 18:39:36 +00:00
|
|
|
if ((ifp->if_flags & IFF_PPROMISC) == 0 && (m->m_flags & M_PROMISC)) {
|
|
|
|
m_freem(m);
|
2002-11-14 23:35:06 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2007-03-19 18:39:36 +00:00
|
|
|
/*
|
|
|
|
* Reset layer specific mbuf flags to avoid confusing upper layers.
|
|
|
|
* Strip off Ethernet header.
|
|
|
|
*/
|
|
|
|
m->m_flags &= ~M_VLANTAG;
|
2013-08-19 13:27:32 +00:00
|
|
|
m_clrprotoflags(m);
|
2007-03-19 18:39:36 +00:00
|
|
|
m_adj(m, ETHER_HDR_LEN);
|
2005-11-18 16:23:26 +00:00
|
|
|
|
2007-03-19 18:39:36 +00:00
|
|
|
/*
|
|
|
|
* Dispatch frame to upper layer.
|
|
|
|
*/
|
1994-11-24 14:29:38 +00:00
|
|
|
switch (ether_type) {
|
1994-05-24 10:09:53 +00:00
|
|
|
#ifdef INET
|
|
|
|
case ETHERTYPE_IP:
|
2003-03-04 23:19:55 +00:00
|
|
|
isr = NETISR_IP;
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
case ETHERTYPE_ARP:
|
2001-06-15 21:00:32 +00:00
|
|
|
if (ifp->if_flags & IFF_NOARP) {
|
|
|
|
/* Discard packet if ARP is disabled on interface */
|
|
|
|
m_freem(m);
|
|
|
|
return;
|
|
|
|
}
|
2003-03-04 23:19:55 +00:00
|
|
|
isr = NETISR_ARP;
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
#endif
|
1999-11-22 02:45:11 +00:00
|
|
|
#ifdef INET6
|
|
|
|
case ETHERTYPE_IPV6:
|
2003-03-04 23:19:55 +00:00
|
|
|
isr = NETISR_IPV6;
|
1999-11-22 02:45:11 +00:00
|
|
|
break;
|
|
|
|
#endif
|
1994-05-24 10:09:53 +00:00
|
|
|
default:
|
2003-03-04 23:19:55 +00:00
|
|
|
goto discard;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2003-03-04 23:19:55 +00:00
|
|
|
netisr_dispatch(isr, m);
|
2002-11-14 23:35:06 +00:00
|
|
|
return;
|
2003-03-04 23:19:55 +00:00
|
|
|
|
2002-11-14 23:35:06 +00:00
|
|
|
discard:
|
|
|
|
/*
|
|
|
|
* Packet is to be discarded. If netgraph is present,
|
|
|
|
* hand the packet to it for last chance processing;
|
|
|
|
* otherwise dispose of it.
|
|
|
|
*/
|
2014-11-07 15:14:10 +00:00
|
|
|
if (ifp->if_l2com != NULL) {
|
2005-07-21 09:00:51 +00:00
|
|
|
KASSERT(ng_ether_input_orphan_p != NULL,
|
|
|
|
("ng_ether_input_orphan_p is NULL"));
|
2002-11-14 23:35:06 +00:00
|
|
|
/*
|
|
|
|
* Put back the ethernet header so netgraph has a
|
|
|
|
* consistent view of inbound packets.
|
|
|
|
*/
|
2012-12-05 08:04:20 +00:00
|
|
|
M_PREPEND(m, ETHER_HDR_LEN, M_NOWAIT);
|
2002-11-14 23:35:06 +00:00
|
|
|
(*ng_ether_input_orphan_p)(ifp, m);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
m_freem(m);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Convert Ethernet address to printable (loggable) representation.
|
|
|
|
* This routine is for compatibility; it's better to just use
|
|
|
|
*
|
|
|
|
* printf("%6D", <pointer to address>, ":");
|
|
|
|
*
|
|
|
|
* since there's no static buffer involved.
|
|
|
|
*/
|
|
|
|
char *
|
|
|
|
ether_sprintf(const u_char *ap)
|
|
|
|
{
|
|
|
|
static char etherbuf[18];
|
|
|
|
snprintf(etherbuf, sizeof (etherbuf), "%6D", ap, ":");
|
|
|
|
return (etherbuf);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Perform common duties while attaching to interface list
|
|
|
|
*/
|
|
|
|
void
|
2005-11-11 07:36:14 +00:00
|
|
|
ether_ifattach(struct ifnet *ifp, const u_int8_t *lla)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2004-07-02 19:44:59 +00:00
|
|
|
int i;
|
2003-03-03 00:21:52 +00:00
|
|
|
struct ifaddr *ifa;
|
|
|
|
struct sockaddr_dl *sdl;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2002-11-14 23:35:06 +00:00
|
|
|
ifp->if_addrlen = ETHER_ADDR_LEN;
|
|
|
|
ifp->if_hdrlen = ETHER_HDR_LEN;
|
2001-10-11 05:37:59 +00:00
|
|
|
if_attach(ifp);
|
1994-05-24 10:09:53 +00:00
|
|
|
ifp->if_mtu = ETHERMTU;
|
2002-11-14 23:35:06 +00:00
|
|
|
ifp->if_output = ether_output;
|
|
|
|
ifp->if_input = ether_input;
|
1997-01-07 19:15:32 +00:00
|
|
|
ifp->if_resolvemulti = ether_resolvemulti;
|
2015-12-31 05:03:27 +00:00
|
|
|
ifp->if_requestencap = ether_requestencap;
|
2010-08-13 18:17:32 +00:00
|
|
|
#ifdef VIMAGE
|
|
|
|
ifp->if_reassign = ether_reassign;
|
|
|
|
#endif
|
1996-06-01 23:25:10 +00:00
|
|
|
if (ifp->if_baudrate == 0)
|
2002-11-14 23:35:06 +00:00
|
|
|
ifp->if_baudrate = IF_Mbps(10); /* just a default */
|
2001-10-14 20:17:53 +00:00
|
|
|
ifp->if_broadcastaddr = etherbroadcastaddr;
|
2002-11-14 23:35:06 +00:00
|
|
|
|
2005-11-11 16:04:59 +00:00
|
|
|
ifa = ifp->if_addr;
|
2001-12-10 08:09:49 +00:00
|
|
|
KASSERT(ifa != NULL, ("%s: no lladdr!\n", __func__));
|
1996-12-13 21:29:07 +00:00
|
|
|
sdl = (struct sockaddr_dl *)ifa->ifa_addr;
|
|
|
|
sdl->sdl_type = IFT_ETHER;
|
|
|
|
sdl->sdl_alen = ifp->if_addrlen;
|
2005-11-11 07:36:14 +00:00
|
|
|
bcopy(lla, LLADDR(sdl), ifp->if_addrlen);
|
2002-11-14 23:35:06 +00:00
|
|
|
|
2003-03-03 05:04:57 +00:00
|
|
|
bpfattach(ifp, DLT_EN10MB, ETHER_HDR_LEN);
|
2000-06-26 23:34:54 +00:00
|
|
|
if (ng_ether_attach_p != NULL)
|
|
|
|
(*ng_ether_attach_p)(ifp);
|
2004-03-14 07:12:25 +00:00
|
|
|
|
2004-07-02 19:44:59 +00:00
|
|
|
/* Announce Ethernet MAC address if non-zero. */
|
|
|
|
for (i = 0; i < ifp->if_addrlen; i++)
|
2005-11-11 07:36:14 +00:00
|
|
|
if (lla[i] != 0)
|
2004-07-02 19:44:59 +00:00
|
|
|
break;
|
|
|
|
if (i != ifp->if_addrlen)
|
2005-11-11 07:36:14 +00:00
|
|
|
if_printf(ifp, "Ethernet address: %6D\n", lla, ":");
|
2013-07-24 04:24:21 +00:00
|
|
|
|
|
|
|
uuid_ether_add(LLADDR(sdl));
|
2000-06-26 23:34:54 +00:00
|
|
|
}
|
|
|
|
|
2000-07-13 22:54:34 +00:00
|
|
|
/*
|
|
|
|
* Perform common duties while detaching an Ethernet interface
|
|
|
|
*/
|
|
|
|
void
|
2002-11-14 23:35:06 +00:00
|
|
|
ether_ifdetach(struct ifnet *ifp)
|
2000-07-13 22:54:34 +00:00
|
|
|
{
|
2013-07-24 04:24:21 +00:00
|
|
|
struct sockaddr_dl *sdl;
|
|
|
|
|
|
|
|
sdl = (struct sockaddr_dl *)(ifp->if_addr->ifa_addr);
|
|
|
|
uuid_ether_del(LLADDR(sdl));
|
|
|
|
|
2014-11-07 15:14:10 +00:00
|
|
|
if (ifp->if_l2com != NULL) {
|
2005-07-21 09:00:51 +00:00
|
|
|
KASSERT(ng_ether_detach_p != NULL,
|
|
|
|
("ng_ether_detach_p is NULL"));
|
2000-07-13 22:54:34 +00:00
|
|
|
(*ng_ether_detach_p)(ifp);
|
2005-07-21 09:00:51 +00:00
|
|
|
}
|
2005-10-13 23:05:55 +00:00
|
|
|
|
2002-11-14 23:35:06 +00:00
|
|
|
bpfdetach(ifp);
|
2000-07-13 22:54:34 +00:00
|
|
|
if_detach(ifp);
|
|
|
|
}
|
|
|
|
|
2010-08-13 18:17:32 +00:00
|
|
|
#ifdef VIMAGE
|
|
|
|
void
|
|
|
|
ether_reassign(struct ifnet *ifp, struct vnet *new_vnet, char *unused __unused)
|
|
|
|
{
|
|
|
|
|
2014-11-07 15:14:10 +00:00
|
|
|
if (ifp->if_l2com != NULL) {
|
2010-08-13 18:17:32 +00:00
|
|
|
KASSERT(ng_ether_detach_p != NULL,
|
|
|
|
("ng_ether_detach_p is NULL"));
|
|
|
|
(*ng_ether_detach_p)(ifp);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (ng_ether_attach_p != NULL) {
|
|
|
|
CURVNET_SET_QUIET(new_vnet);
|
|
|
|
(*ng_ether_attach_p)(ifp);
|
|
|
|
CURVNET_RESTORE();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
1999-02-16 10:49:55 +00:00
|
|
|
SYSCTL_DECL(_net_link);
|
1995-12-20 21:53:53 +00:00
|
|
|
SYSCTL_NODE(_net_link, IFT_ETHER, ether, CTLFLAG_RW, 0, "Ethernet");
|
1996-08-04 10:54:13 +00:00
|
|
|
|
2004-06-02 21:34:14 +00:00
|
|
|
#if 0
|
|
|
|
/*
|
|
|
|
* This is for reference. We have a table-driven version
|
|
|
|
* of the little-endian crc32 generator, which is faster
|
|
|
|
* than the double-loop.
|
|
|
|
*/
|
|
|
|
uint32_t
|
|
|
|
ether_crc32_le(const uint8_t *buf, size_t len)
|
|
|
|
{
|
|
|
|
size_t i;
|
|
|
|
uint32_t crc;
|
|
|
|
int bit;
|
|
|
|
uint8_t data;
|
|
|
|
|
|
|
|
crc = 0xffffffff; /* initial value */
|
|
|
|
|
|
|
|
for (i = 0; i < len; i++) {
|
2008-05-10 18:33:38 +00:00
|
|
|
for (data = *buf++, bit = 0; bit < 8; bit++, data >>= 1) {
|
2004-06-02 21:34:14 +00:00
|
|
|
carry = (crc ^ data) & 1;
|
|
|
|
crc >>= 1;
|
|
|
|
if (carry)
|
|
|
|
crc = (crc ^ ETHER_CRC_POLY_LE);
|
2008-05-10 18:33:38 +00:00
|
|
|
}
|
2004-06-02 21:34:14 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return (crc);
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
uint32_t
|
|
|
|
ether_crc32_le(const uint8_t *buf, size_t len)
|
|
|
|
{
|
|
|
|
static const uint32_t crctab[] = {
|
|
|
|
0x00000000, 0x1db71064, 0x3b6e20c8, 0x26d930ac,
|
|
|
|
0x76dc4190, 0x6b6b51f4, 0x4db26158, 0x5005713c,
|
|
|
|
0xedb88320, 0xf00f9344, 0xd6d6a3e8, 0xcb61b38c,
|
|
|
|
0x9b64c2b0, 0x86d3d2d4, 0xa00ae278, 0xbdbdf21c
|
|
|
|
};
|
|
|
|
size_t i;
|
|
|
|
uint32_t crc;
|
|
|
|
|
|
|
|
crc = 0xffffffff; /* initial value */
|
|
|
|
|
|
|
|
for (i = 0; i < len; i++) {
|
|
|
|
crc ^= buf[i];
|
|
|
|
crc = (crc >> 4) ^ crctab[crc & 0xf];
|
|
|
|
crc = (crc >> 4) ^ crctab[crc & 0xf];
|
|
|
|
}
|
|
|
|
|
|
|
|
return (crc);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
uint32_t
|
|
|
|
ether_crc32_be(const uint8_t *buf, size_t len)
|
|
|
|
{
|
|
|
|
size_t i;
|
|
|
|
uint32_t crc, carry;
|
|
|
|
int bit;
|
|
|
|
uint8_t data;
|
|
|
|
|
|
|
|
crc = 0xffffffff; /* initial value */
|
|
|
|
|
|
|
|
for (i = 0; i < len; i++) {
|
|
|
|
for (data = *buf++, bit = 0; bit < 8; bit++, data >>= 1) {
|
|
|
|
carry = ((crc & 0x80000000) ? 1 : 0) ^ (data & 0x01);
|
|
|
|
crc <<= 1;
|
|
|
|
if (carry)
|
|
|
|
crc = (crc ^ ETHER_CRC_POLY_BE) | carry;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return (crc);
|
|
|
|
}
|
|
|
|
|
1996-12-10 07:29:50 +00:00
|
|
|
int
|
2007-05-29 12:40:45 +00:00
|
|
|
ether_ioctl(struct ifnet *ifp, u_long command, caddr_t data)
|
1996-08-04 10:54:13 +00:00
|
|
|
{
|
|
|
|
struct ifaddr *ifa = (struct ifaddr *) data;
|
|
|
|
struct ifreq *ifr = (struct ifreq *) data;
|
1996-12-10 07:29:50 +00:00
|
|
|
int error = 0;
|
1996-08-04 10:54:13 +00:00
|
|
|
|
|
|
|
switch (command) {
|
|
|
|
case SIOCSIFADDR:
|
|
|
|
ifp->if_flags |= IFF_UP;
|
|
|
|
|
|
|
|
switch (ifa->ifa_addr->sa_family) {
|
|
|
|
#ifdef INET
|
|
|
|
case AF_INET:
|
2008-03-20 06:19:34 +00:00
|
|
|
ifp->if_init(ifp->if_softc); /* before arpwhohas */
|
2001-10-14 20:17:53 +00:00
|
|
|
arp_ifinit(ifp, ifa);
|
1996-08-04 10:54:13 +00:00
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
default:
|
|
|
|
ifp->if_init(ifp->if_softc);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
|
|
|
case SIOCGIFADDR:
|
|
|
|
{
|
|
|
|
struct sockaddr *sa;
|
|
|
|
|
|
|
|
sa = (struct sockaddr *) & ifr->ifr_data;
|
2005-11-11 16:04:59 +00:00
|
|
|
bcopy(IF_LLADDR(ifp),
|
1996-08-04 10:54:13 +00:00
|
|
|
(caddr_t) sa->sa_data, ETHER_ADDR_LEN);
|
|
|
|
}
|
|
|
|
break;
|
1996-12-10 07:29:50 +00:00
|
|
|
|
|
|
|
case SIOCSIFMTU:
|
|
|
|
/*
|
|
|
|
* Set the interface MTU.
|
|
|
|
*/
|
|
|
|
if (ifr->ifr_mtu > ETHERMTU) {
|
|
|
|
error = EINVAL;
|
|
|
|
} else {
|
|
|
|
ifp->if_mtu = ifr->ifr_mtu;
|
|
|
|
}
|
|
|
|
break;
|
2002-11-14 23:35:06 +00:00
|
|
|
default:
|
|
|
|
error = EINVAL; /* XXX netbsd has ENOTTY??? */
|
|
|
|
break;
|
1996-08-04 10:54:13 +00:00
|
|
|
}
|
1996-12-10 07:29:50 +00:00
|
|
|
return (error);
|
1996-08-04 10:54:13 +00:00
|
|
|
}
|
1997-01-07 19:15:32 +00:00
|
|
|
|
2002-09-28 17:15:38 +00:00
|
|
|
static int
|
2003-10-23 13:49:10 +00:00
|
|
|
ether_resolvemulti(struct ifnet *ifp, struct sockaddr **llsa,
|
|
|
|
struct sockaddr *sa)
|
1997-01-07 19:15:32 +00:00
|
|
|
{
|
|
|
|
struct sockaddr_dl *sdl;
|
2004-06-24 10:58:08 +00:00
|
|
|
#ifdef INET
|
1997-01-07 19:15:32 +00:00
|
|
|
struct sockaddr_in *sin;
|
2004-06-24 10:58:08 +00:00
|
|
|
#endif
|
1999-11-22 02:45:11 +00:00
|
|
|
#ifdef INET6
|
|
|
|
struct sockaddr_in6 *sin6;
|
|
|
|
#endif
|
1997-01-07 19:15:32 +00:00
|
|
|
u_char *e_addr;
|
|
|
|
|
|
|
|
switch(sa->sa_family) {
|
|
|
|
case AF_LINK:
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
1997-07-15 23:25:32 +00:00
|
|
|
* No mapping needed. Just check that it's a valid MC address.
|
|
|
|
*/
|
1997-01-07 19:15:32 +00:00
|
|
|
sdl = (struct sockaddr_dl *)sa;
|
|
|
|
e_addr = LLADDR(sdl);
|
2004-07-09 05:26:27 +00:00
|
|
|
if (!ETHER_IS_MULTICAST(e_addr))
|
1997-01-07 19:15:32 +00:00
|
|
|
return EADDRNOTAVAIL;
|
|
|
|
*llsa = 0;
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
#ifdef INET
|
|
|
|
case AF_INET:
|
|
|
|
sin = (struct sockaddr_in *)sa;
|
|
|
|
if (!IN_MULTICAST(ntohl(sin->sin_addr.s_addr)))
|
|
|
|
return EADDRNOTAVAIL;
|
2014-01-18 23:24:51 +00:00
|
|
|
sdl = link_init_sdl(ifp, *llsa, IFT_ETHER);
|
1997-01-07 19:15:32 +00:00
|
|
|
sdl->sdl_alen = ETHER_ADDR_LEN;
|
|
|
|
e_addr = LLADDR(sdl);
|
|
|
|
ETHER_MAP_IP_MULTICAST(&sin->sin_addr, e_addr);
|
|
|
|
*llsa = (struct sockaddr *)sdl;
|
|
|
|
return 0;
|
|
|
|
#endif
|
1999-11-22 02:45:11 +00:00
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
|
|
|
sin6 = (struct sockaddr_in6 *)sa;
|
2000-07-09 11:17:17 +00:00
|
|
|
if (IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr)) {
|
|
|
|
/*
|
|
|
|
* An IP6 address of 0 means listen to all
|
|
|
|
* of the Ethernet multicast address used for IP6.
|
|
|
|
* (This is used for multicast routers.)
|
|
|
|
*/
|
|
|
|
ifp->if_flags |= IFF_ALLMULTI;
|
|
|
|
*llsa = 0;
|
|
|
|
return 0;
|
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
if (!IN6_IS_ADDR_MULTICAST(&sin6->sin6_addr))
|
|
|
|
return EADDRNOTAVAIL;
|
2014-01-18 23:24:51 +00:00
|
|
|
sdl = link_init_sdl(ifp, *llsa, IFT_ETHER);
|
1999-11-22 02:45:11 +00:00
|
|
|
sdl->sdl_alen = ETHER_ADDR_LEN;
|
|
|
|
e_addr = LLADDR(sdl);
|
|
|
|
ETHER_MAP_IPV6_MULTICAST(&sin6->sin6_addr, e_addr);
|
|
|
|
*llsa = (struct sockaddr *)sdl;
|
|
|
|
return 0;
|
|
|
|
#endif
|
1997-01-07 19:15:32 +00:00
|
|
|
|
|
|
|
default:
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
1997-01-07 19:15:32 +00:00
|
|
|
* Well, the text isn't quite right, but it's the name
|
|
|
|
* that counts...
|
|
|
|
*/
|
|
|
|
return EAFNOSUPPORT;
|
|
|
|
}
|
|
|
|
}
|
2003-03-15 15:38:02 +00:00
|
|
|
|
|
|
|
static moduledata_t ether_mod = {
|
2014-11-07 15:14:10 +00:00
|
|
|
.name = "ether",
|
2003-03-15 15:38:02 +00:00
|
|
|
};
|
2003-10-23 13:49:10 +00:00
|
|
|
|
2006-11-18 23:17:22 +00:00
|
|
|
void
|
|
|
|
ether_vlan_mtap(struct bpf_if *bp, struct mbuf *m, void *data, u_int dlen)
|
|
|
|
{
|
|
|
|
struct ether_vlan_header vlan;
|
|
|
|
struct mbuf mv, mb;
|
|
|
|
|
|
|
|
KASSERT((m->m_flags & M_VLANTAG) != 0,
|
|
|
|
("%s: vlan information not present", __func__));
|
|
|
|
KASSERT(m->m_len >= sizeof(struct ether_header),
|
|
|
|
("%s: mbuf not large enough for header", __func__));
|
|
|
|
bcopy(mtod(m, char *), &vlan, sizeof(struct ether_header));
|
|
|
|
vlan.evl_proto = vlan.evl_encap_proto;
|
|
|
|
vlan.evl_encap_proto = htons(ETHERTYPE_VLAN);
|
|
|
|
vlan.evl_tag = htons(m->m_pkthdr.ether_vtag);
|
|
|
|
m->m_len -= sizeof(struct ether_header);
|
|
|
|
m->m_data += sizeof(struct ether_header);
|
|
|
|
/*
|
|
|
|
* If a data link has been supplied by the caller, then we will need to
|
|
|
|
* re-create a stack allocated mbuf chain with the following structure:
|
|
|
|
*
|
|
|
|
* (1) mbuf #1 will contain the supplied data link
|
|
|
|
* (2) mbuf #2 will contain the vlan header
|
|
|
|
* (3) mbuf #3 will contain the original mbuf's packet data
|
|
|
|
*
|
|
|
|
* Otherwise, submit the packet and vlan header via bpf_mtap2().
|
|
|
|
*/
|
|
|
|
if (data != NULL) {
|
|
|
|
mv.m_next = m;
|
|
|
|
mv.m_data = (caddr_t)&vlan;
|
|
|
|
mv.m_len = sizeof(vlan);
|
|
|
|
mb.m_next = &mv;
|
|
|
|
mb.m_data = data;
|
|
|
|
mb.m_len = dlen;
|
|
|
|
bpf_mtap(bp, &mb);
|
|
|
|
} else
|
|
|
|
bpf_mtap2(bp, &vlan, sizeof(vlan), m);
|
|
|
|
m->m_len += sizeof(struct ether_header);
|
|
|
|
m->m_data -= sizeof(struct ether_header);
|
|
|
|
}
|
|
|
|
|
2007-10-18 21:22:15 +00:00
|
|
|
struct mbuf *
|
2007-10-18 21:52:31 +00:00
|
|
|
ether_vlanencap(struct mbuf *m, uint16_t tag)
|
2007-10-18 21:22:15 +00:00
|
|
|
{
|
|
|
|
struct ether_vlan_header *evl;
|
|
|
|
|
2012-12-05 08:04:20 +00:00
|
|
|
M_PREPEND(m, ETHER_VLAN_ENCAP_LEN, M_NOWAIT);
|
2007-10-18 21:22:15 +00:00
|
|
|
if (m == NULL)
|
|
|
|
return (NULL);
|
|
|
|
/* M_PREPEND takes care of m_len, m_pkthdr.len for us */
|
|
|
|
|
|
|
|
if (m->m_len < sizeof(*evl)) {
|
|
|
|
m = m_pullup(m, sizeof(*evl));
|
|
|
|
if (m == NULL)
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Transform the Ethernet header into an Ethernet header
|
|
|
|
* with 802.1Q encapsulation.
|
|
|
|
*/
|
|
|
|
evl = mtod(m, struct ether_vlan_header *);
|
|
|
|
bcopy((char *)evl + ETHER_VLAN_ENCAP_LEN,
|
|
|
|
(char *)evl, ETHER_HDR_LEN - ETHER_TYPE_LEN);
|
|
|
|
evl->evl_encap_proto = htons(ETHERTYPE_VLAN);
|
|
|
|
evl->evl_tag = htons(tag);
|
|
|
|
return (m);
|
|
|
|
}
|
|
|
|
|
2005-06-10 16:49:24 +00:00
|
|
|
DECLARE_MODULE(ether, ether_mod, SI_SUB_INIT_IF, SI_ORDER_ANY);
|
2003-03-15 15:38:02 +00:00
|
|
|
MODULE_VERSION(ether, 1);
|