2005-01-07 01:45:51 +00:00
|
|
|
/*-
|
2017-11-20 19:43:44 +00:00
|
|
|
* SPDX-License-Identifier: BSD-3-Clause
|
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Copyright (c) 1982, 1986, 1988, 1990, 1993
|
|
|
|
* The Regents of the University of California. All rights reserved.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
2017-02-28 23:42:47 +00:00
|
|
|
* 3. Neither the name of the University nor the names of its contributors
|
1994-05-24 10:09:53 +00:00
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
|
|
|
* @(#)ip_output.c 8.3 (Berkeley) 1/21/94
|
|
|
|
*/
|
|
|
|
|
2007-10-07 20:44:24 +00:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
2014-02-07 15:18:23 +00:00
|
|
|
#include "opt_inet.h"
|
1999-12-22 19:13:38 +00:00
|
|
|
#include "opt_ipsec.h"
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#include "opt_kern_tls.h"
|
2003-04-12 06:11:46 +00:00
|
|
|
#include "opt_mbuf_stress_test.h"
|
This patch provides the back end support for equal-cost multi-path
(ECMP) for both IPv4 and IPv6. Previously, multipath route insertion
is disallowed. For example,
route add -net 192.103.54.0/24 10.9.44.1
route add -net 192.103.54.0/24 10.9.44.2
The second route insertion will trigger an error message of
"add net 192.103.54.0/24: gateway 10.2.5.2: route already in table"
Multiple default routes can also be inserted. Here is the netstat
output:
default 10.2.5.1 UGS 0 3074 bge0 =>
default 10.2.5.2 UGS 0 0 bge0
When multipath routes exist, the "route delete" command requires
a specific gateway to be specified or else an error message would
be displayed. For example,
route delete default
would fail and trigger the following error message:
"route: writing to routing socket: No such process"
"delete net default: not in table"
On the other hand,
route delete default 10.2.5.2
would be successful: "delete net default: gateway 10.2.5.2"
One does not have to specify a gateway if there is only a single
route for a particular destination.
I need to perform more testings on address aliases and multiple
interfaces that have the same IP prefixes. This patch as it
stands today is not yet ready for prime time. Therefore, the ECMP
code fragments are fully guarded by the RADIX_MPATH macro.
Include the "options RADIX_MPATH" in the kernel configuration
to enable this feature.
Reviewed by: robert, sam, gnn, julian, kmacy
2008-04-13 05:45:14 +00:00
|
|
|
#include "opt_mpath.h"
|
2019-06-11 22:07:39 +00:00
|
|
|
#include "opt_ratelimit.h"
|
2013-08-25 21:54:41 +00:00
|
|
|
#include "opt_route.h"
|
2014-05-18 22:37:31 +00:00
|
|
|
#include "opt_rss.h"
|
2019-06-11 22:07:39 +00:00
|
|
|
#include "opt_sctp.h"
|
1997-11-05 20:17:23 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/param.h>
|
1994-05-25 09:21:21 +00:00
|
|
|
#include <sys/systm.h>
|
1998-12-21 21:36:40 +00:00
|
|
|
#include <sys/kernel.h>
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#include <sys/ktls.h>
|
2015-07-29 08:12:05 +00:00
|
|
|
#include <sys/lock.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/malloc.h>
|
|
|
|
#include <sys/mbuf.h>
|
2006-11-06 13:42:10 +00:00
|
|
|
#include <sys/priv.h>
|
2008-02-02 14:11:31 +00:00
|
|
|
#include <sys/proc.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/protosw.h>
|
2015-07-29 08:12:05 +00:00
|
|
|
#include <sys/rmlock.h>
|
2013-08-25 21:54:41 +00:00
|
|
|
#include <sys/sdt.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/socket.h>
|
|
|
|
#include <sys/socketvar.h>
|
2003-03-25 05:45:05 +00:00
|
|
|
#include <sys/sysctl.h>
|
2008-02-02 14:11:31 +00:00
|
|
|
#include <sys/ucred.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
#include <net/if.h>
|
2013-10-26 17:58:36 +00:00
|
|
|
#include <net/if_var.h>
|
2009-08-14 23:44:59 +00:00
|
|
|
#include <net/if_llatbl.h>
|
2004-08-17 22:05:54 +00:00
|
|
|
#include <net/netisr.h>
|
2004-08-27 15:16:24 +00:00
|
|
|
#include <net/pfil.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <net/route.h>
|
This patch provides the back end support for equal-cost multi-path
(ECMP) for both IPv4 and IPv6. Previously, multipath route insertion
is disallowed. For example,
route add -net 192.103.54.0/24 10.9.44.1
route add -net 192.103.54.0/24 10.9.44.2
The second route insertion will trigger an error message of
"add net 192.103.54.0/24: gateway 10.2.5.2: route already in table"
Multiple default routes can also be inserted. Here is the netstat
output:
default 10.2.5.1 UGS 0 3074 bge0 =>
default 10.2.5.2 UGS 0 0 bge0
When multipath routes exist, the "route delete" command requires
a specific gateway to be specified or else an error message would
be displayed. For example,
route delete default
would fail and trigger the following error message:
"route: writing to routing socket: No such process"
"delete net default: not in table"
On the other hand,
route delete default 10.2.5.2
would be successful: "delete net default: gateway 10.2.5.2"
One does not have to specify a gateway if there is only a single
route for a particular destination.
I need to perform more testings on address aliases and multiple
interfaces that have the same IP prefixes. This patch as it
stands today is not yet ready for prime time. Therefore, the ECMP
code fragments are fully guarded by the RADIX_MPATH macro.
Include the "options RADIX_MPATH" in the kernel configuration
to enable this feature.
Reviewed by: robert, sam, gnn, julian, kmacy
2008-04-13 05:45:14 +00:00
|
|
|
#ifdef RADIX_MPATH
|
|
|
|
#include <net/radix_mpath.h>
|
|
|
|
#endif
|
2015-01-18 18:06:40 +00:00
|
|
|
#include <net/rss_config.h>
|
2008-12-02 21:37:28 +00:00
|
|
|
#include <net/vnet.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
#include <netinet/in.h>
|
2019-05-08 23:39:24 +00:00
|
|
|
#include <netinet/in_fib.h>
|
2013-08-25 21:54:41 +00:00
|
|
|
#include <netinet/in_kdtrace.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <netinet/in_systm.h>
|
|
|
|
#include <netinet/ip.h>
|
|
|
|
#include <netinet/in_pcb.h>
|
2014-05-18 22:37:31 +00:00
|
|
|
#include <netinet/in_rss.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <netinet/in_var.h>
|
|
|
|
#include <netinet/ip_var.h>
|
2005-11-18 20:12:40 +00:00
|
|
|
#include <netinet/ip_options.h>
|
2018-06-06 07:04:40 +00:00
|
|
|
|
|
|
|
#include <netinet/udp.h>
|
|
|
|
#include <netinet/udp_var.h>
|
|
|
|
|
2009-02-03 11:00:43 +00:00
|
|
|
#ifdef SCTP
|
|
|
|
#include <netinet/sctp.h>
|
|
|
|
#include <netinet/sctp_crc32.h>
|
|
|
|
#endif
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2017-02-06 08:49:57 +00:00
|
|
|
#include <netipsec/ipsec_support.h>
|
2006-02-01 13:55:03 +00:00
|
|
|
|
|
|
|
#include <machine/in_cksum.h>
|
|
|
|
|
2006-10-22 11:52:19 +00:00
|
|
|
#include <security/mac/mac_framework.h>
|
|
|
|
|
2003-04-12 06:11:46 +00:00
|
|
|
#ifdef MBUF_STRESS_TEST
|
2011-04-14 09:47:09 +00:00
|
|
|
static int mbuf_frag_size = 0;
|
2003-03-25 05:45:05 +00:00
|
|
|
SYSCTL_INT(_net_inet_ip, OID_AUTO, mbuf_frag_size, CTLFLAG_RW,
|
|
|
|
&mbuf_frag_size, 0, "Fragment outgoing mbufs to this size");
|
|
|
|
#endif
|
|
|
|
|
2015-08-08 15:58:35 +00:00
|
|
|
static void ip_mloopback(struct ifnet *, const struct mbuf *, int);
|
1997-02-10 11:45:37 +00:00
|
|
|
|
|
|
|
|
2009-03-04 03:45:34 +00:00
|
|
|
extern int in_mcast_loop;
|
1996-07-10 19:44:30 +00:00
|
|
|
extern struct protosw inetsw[];
|
|
|
|
|
2015-07-29 18:04:01 +00:00
|
|
|
static inline int
|
2019-06-21 07:58:08 +00:00
|
|
|
ip_output_pfil(struct mbuf **mp, struct ifnet *ifp, int flags,
|
|
|
|
struct inpcb *inp, struct sockaddr_in *dst, int *fibnum, int *error)
|
2015-07-29 18:04:01 +00:00
|
|
|
{
|
|
|
|
struct m_tag *fwd_tag = NULL;
|
2015-08-03 17:47:02 +00:00
|
|
|
struct mbuf *m;
|
2015-07-29 18:04:01 +00:00
|
|
|
struct in_addr odst;
|
|
|
|
struct ip *ip;
|
2019-06-21 07:58:08 +00:00
|
|
|
int pflags = PFIL_OUT;
|
|
|
|
|
|
|
|
if (flags & IP_FORWARDING)
|
|
|
|
pflags |= PFIL_FWD;
|
2015-07-29 18:04:01 +00:00
|
|
|
|
2015-08-03 17:47:02 +00:00
|
|
|
m = *mp;
|
2015-07-29 18:04:01 +00:00
|
|
|
ip = mtod(m, struct ip *);
|
|
|
|
|
|
|
|
/* Run through list of hooks for output packets. */
|
|
|
|
odst.s_addr = ip->ip_dst.s_addr;
|
2019-06-21 07:58:08 +00:00
|
|
|
switch (pfil_run_hooks(V_inet_pfil_head, mp, ifp, pflags, inp)) {
|
New pfil(9) KPI together with newborn pfil API and control utility.
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented. The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.
In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.
New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.
Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.
Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.
Differential Revision: https://reviews.freebsd.org/D18951
2019-01-31 23:01:03 +00:00
|
|
|
case PFIL_DROPPED:
|
|
|
|
*error = EPERM;
|
|
|
|
/* FALLTHROUGH */
|
|
|
|
case PFIL_CONSUMED:
|
2015-07-29 18:04:01 +00:00
|
|
|
return 1; /* Finished */
|
New pfil(9) KPI together with newborn pfil API and control utility.
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented. The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.
In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.
New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.
Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.
Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.
Differential Revision: https://reviews.freebsd.org/D18951
2019-01-31 23:01:03 +00:00
|
|
|
case PFIL_PASS:
|
|
|
|
*error = 0;
|
|
|
|
}
|
|
|
|
m = *mp;
|
2015-07-29 18:04:01 +00:00
|
|
|
ip = mtod(m, struct ip *);
|
|
|
|
|
|
|
|
/* See if destination IP address was changed by packet filter. */
|
|
|
|
if (odst.s_addr != ip->ip_dst.s_addr) {
|
|
|
|
m->m_flags |= M_SKIP_FIREWALL;
|
|
|
|
/* If destination is now ourself drop to ip_input(). */
|
|
|
|
if (in_localip(ip->ip_dst)) {
|
|
|
|
m->m_flags |= M_FASTFWD_OURS;
|
|
|
|
if (m->m_pkthdr.rcvif == NULL)
|
|
|
|
m->m_pkthdr.rcvif = V_loif;
|
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_DELAY_DATA) {
|
|
|
|
m->m_pkthdr.csum_flags |=
|
|
|
|
CSUM_DATA_VALID | CSUM_PSEUDO_HDR;
|
|
|
|
m->m_pkthdr.csum_data = 0xffff;
|
|
|
|
}
|
|
|
|
m->m_pkthdr.csum_flags |=
|
|
|
|
CSUM_IP_CHECKED | CSUM_IP_VALID;
|
|
|
|
#ifdef SCTP
|
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_SCTP)
|
|
|
|
m->m_pkthdr.csum_flags |= CSUM_SCTP_VALID;
|
|
|
|
#endif
|
|
|
|
*error = netisr_queue(NETISR_IP, m);
|
|
|
|
return 1; /* Finished */
|
|
|
|
}
|
|
|
|
|
|
|
|
bzero(dst, sizeof(*dst));
|
|
|
|
dst->sin_family = AF_INET;
|
|
|
|
dst->sin_len = sizeof(*dst);
|
|
|
|
dst->sin_addr = ip->ip_dst;
|
|
|
|
|
|
|
|
return -1; /* Reloop */
|
|
|
|
}
|
|
|
|
/* See if fib was changed by packet filter. */
|
|
|
|
if ((*fibnum) != M_GETFIB(m)) {
|
|
|
|
m->m_flags |= M_SKIP_FIREWALL;
|
|
|
|
*fibnum = M_GETFIB(m);
|
|
|
|
return -1; /* Reloop for FIB change */
|
|
|
|
}
|
|
|
|
|
|
|
|
/* See if local, if yes, send it to netisr with IP_FASTFWD_OURS. */
|
|
|
|
if (m->m_flags & M_FASTFWD_OURS) {
|
|
|
|
if (m->m_pkthdr.rcvif == NULL)
|
|
|
|
m->m_pkthdr.rcvif = V_loif;
|
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_DELAY_DATA) {
|
|
|
|
m->m_pkthdr.csum_flags |=
|
|
|
|
CSUM_DATA_VALID | CSUM_PSEUDO_HDR;
|
|
|
|
m->m_pkthdr.csum_data = 0xffff;
|
|
|
|
}
|
|
|
|
#ifdef SCTP
|
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_SCTP)
|
|
|
|
m->m_pkthdr.csum_flags |= CSUM_SCTP_VALID;
|
|
|
|
#endif
|
|
|
|
m->m_pkthdr.csum_flags |=
|
|
|
|
CSUM_IP_CHECKED | CSUM_IP_VALID;
|
|
|
|
|
|
|
|
*error = netisr_queue(NETISR_IP, m);
|
|
|
|
return 1; /* Finished */
|
|
|
|
}
|
|
|
|
/* Or forward to some other address? */
|
|
|
|
if ((m->m_flags & M_IP_NEXTHOP) &&
|
|
|
|
((fwd_tag = m_tag_find(m, PACKET_TAG_IPFORWARD, NULL)) != NULL)) {
|
|
|
|
bcopy((fwd_tag+1), dst, sizeof(struct sockaddr_in));
|
|
|
|
m->m_flags |= M_SKIP_FIREWALL;
|
|
|
|
m->m_flags &= ~M_IP_NEXTHOP;
|
|
|
|
m_tag_delete(m, fwd_tag);
|
|
|
|
|
|
|
|
return -1; /* Reloop for CHANGE of dst */
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
|
|
|
static int
|
|
|
|
ip_output_send(struct inpcb *inp, struct ifnet *ifp, struct mbuf *m,
|
2019-09-24 18:18:11 +00:00
|
|
|
const struct sockaddr_in *gw, struct route *ro, bool stamp_tag)
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
|
|
|
{
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#ifdef KERN_TLS
|
|
|
|
struct ktls_session *tls = NULL;
|
|
|
|
#endif
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
|
|
|
struct m_snd_tag *mst;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
MPASS((m->m_pkthdr.csum_flags & CSUM_SND_TAG) == 0);
|
|
|
|
mst = NULL;
|
|
|
|
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#ifdef KERN_TLS
|
|
|
|
/*
|
|
|
|
* If this is an unencrypted TLS record, save a reference to
|
|
|
|
* the record. This local reference is used to call
|
|
|
|
* ktls_output_eagain after the mbuf has been freed (thus
|
|
|
|
* dropping the mbuf's reference) in if_output.
|
|
|
|
*/
|
|
|
|
if (m->m_next != NULL && mbuf_has_tls_session(m->m_next)) {
|
|
|
|
tls = ktls_hold(m->m_next->m_ext.ext_pgs->tls);
|
|
|
|
mst = tls->snd_tag;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If a TLS session doesn't have a valid tag, it must
|
|
|
|
* have had an earlier ifp mismatch, so drop this
|
|
|
|
* packet.
|
|
|
|
*/
|
|
|
|
if (mst == NULL) {
|
|
|
|
error = EAGAIN;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
|
|
|
#ifdef RATELIMIT
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
if (inp != NULL && mst == NULL) {
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
|
|
|
if ((inp->inp_flags2 & INP_RATE_LIMIT_CHANGED) != 0 ||
|
|
|
|
(inp->inp_snd_tag != NULL &&
|
|
|
|
inp->inp_snd_tag->ifp != ifp))
|
|
|
|
in_pcboutput_txrtlmt(inp, ifp, m);
|
|
|
|
|
|
|
|
if (inp->inp_snd_tag != NULL)
|
|
|
|
mst = inp->inp_snd_tag;
|
|
|
|
}
|
|
|
|
#endif
|
2019-09-24 18:18:11 +00:00
|
|
|
if (stamp_tag && mst != NULL) {
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
|
|
|
KASSERT(m->m_pkthdr.rcvif == NULL,
|
|
|
|
("trying to add a send tag to a forwarded packet"));
|
|
|
|
if (mst->ifp != ifp) {
|
|
|
|
error = EAGAIN;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* stamp send tag on mbuf */
|
|
|
|
m->m_pkthdr.snd_tag = m_snd_tag_ref(mst);
|
|
|
|
m->m_pkthdr.csum_flags |= CSUM_SND_TAG;
|
|
|
|
}
|
|
|
|
|
|
|
|
error = (*ifp->if_output)(ifp, m, (const struct sockaddr *)gw, ro);
|
|
|
|
|
|
|
|
done:
|
|
|
|
/* Check for route change invalidating send tags. */
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#ifdef KERN_TLS
|
|
|
|
if (tls != NULL) {
|
|
|
|
if (error == EAGAIN)
|
|
|
|
error = ktls_output_eagain(inp, tls);
|
|
|
|
ktls_free(tls);
|
|
|
|
}
|
|
|
|
#endif
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
|
|
|
#ifdef RATELIMIT
|
|
|
|
if (error == EAGAIN)
|
|
|
|
in_pcboutput_eagain(inp);
|
|
|
|
#endif
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* IP output. The packet in mbuf chain m contains a skeletal IP
|
|
|
|
* header (with len, off, ttl, proto, tos, src, dst).
|
|
|
|
* The mbuf chain containing the packet will be freed.
|
|
|
|
* The mbuf opt, if present, will not be freed.
|
2012-07-04 07:37:53 +00:00
|
|
|
* If route ro is present and has ro_rt initialized, route lookup would be
|
|
|
|
* skipped and ro->ro_rt would be used. If ro is present but ro->ro_rt is NULL,
|
|
|
|
* then result of route lookup is stored in ro->ro_rt.
|
|
|
|
*
|
2003-11-03 18:03:05 +00:00
|
|
|
* In the IP forwarding case, the packet will arrive with options already
|
|
|
|
* inserted, so must have a NULL opt pointer.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
int
|
2007-05-10 15:58:48 +00:00
|
|
|
ip_output(struct mbuf *m, struct mbuf *opt, struct route *ro, int flags,
|
|
|
|
struct ip_moptions *imo, struct inpcb *inp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2015-07-29 08:12:05 +00:00
|
|
|
struct rm_priotracker in_ifa_tracker;
|
2019-01-09 01:11:19 +00:00
|
|
|
struct epoch_tracker et;
|
2003-08-07 18:16:59 +00:00
|
|
|
struct ip *ip;
|
2009-12-28 21:14:18 +00:00
|
|
|
struct ifnet *ifp = NULL; /* keep compiler happy */
|
2004-02-25 19:55:29 +00:00
|
|
|
struct mbuf *m0;
|
1996-04-03 13:52:20 +00:00
|
|
|
int hlen = sizeof (struct ip);
|
2006-09-10 17:49:09 +00:00
|
|
|
int mtu;
|
2009-12-28 14:09:46 +00:00
|
|
|
int error = 0;
|
2019-05-08 23:39:24 +00:00
|
|
|
struct sockaddr_in *dst, sin;
|
2013-04-25 12:42:09 +00:00
|
|
|
const struct sockaddr_in *gw;
|
2012-07-18 08:58:30 +00:00
|
|
|
struct in_ifaddr *ia;
|
2019-05-08 23:39:24 +00:00
|
|
|
struct in_addr src;
|
2012-10-06 10:02:11 +00:00
|
|
|
int isbroadcast;
|
2012-10-26 21:06:33 +00:00
|
|
|
uint16_t ip_len, ip_off;
|
2014-10-02 00:25:57 +00:00
|
|
|
uint32_t fibnum;
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT)
|
2009-04-28 11:10:33 +00:00
|
|
|
int no_route_but_check_spd = 0;
|
2004-08-17 22:05:54 +00:00
|
|
|
#endif
|
2019-06-11 22:07:39 +00:00
|
|
|
|
2003-04-08 14:25:47 +00:00
|
|
|
M_ASSERTPKTHDR(m);
|
2005-09-26 20:25:16 +00:00
|
|
|
|
2008-11-19 19:19:30 +00:00
|
|
|
if (inp != NULL) {
|
2008-04-19 14:35:17 +00:00
|
|
|
INP_LOCK_ASSERT(inp);
|
2009-05-21 09:45:47 +00:00
|
|
|
M_SETFIB(m, inp->inp_inc.inc_fibnum);
|
2014-12-01 11:45:24 +00:00
|
|
|
if ((flags & IP_NODEFAULTFLOWID) == 0) {
|
2009-04-10 06:16:14 +00:00
|
|
|
m->m_pkthdr.flowid = inp->inp_flowid;
|
2014-05-18 22:37:31 +00:00
|
|
|
M_HASHTYPE_SET(m, inp->inp_flowtype);
|
2009-04-10 06:16:14 +00:00
|
|
|
}
|
2019-04-25 15:37:28 +00:00
|
|
|
#ifdef NUMA
|
|
|
|
m->m_pkthdr.numa_domain = inp->inp_numa_domain;
|
|
|
|
#endif
|
2008-11-19 19:19:30 +00:00
|
|
|
}
|
2009-05-21 09:45:47 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
if (opt) {
|
2009-12-28 14:09:46 +00:00
|
|
|
int len = 0;
|
1994-05-24 10:09:53 +00:00
|
|
|
m = ip_insertoptions(m, opt, &len);
|
2002-09-23 08:56:24 +00:00
|
|
|
if (len != 0)
|
2009-12-28 14:09:46 +00:00
|
|
|
hlen = len; /* ip->ip_hl is updated above */
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
ip = mtod(m, struct ip *);
|
2012-10-22 21:09:03 +00:00
|
|
|
ip_len = ntohs(ip->ip_len);
|
|
|
|
ip_off = ntohs(ip->ip_off);
|
2001-12-28 21:21:57 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
if ((flags & (IP_FORWARDING|IP_RAWOUTPUT)) == 0) {
|
2002-10-20 22:52:07 +00:00
|
|
|
ip->ip_v = IPVERSION;
|
|
|
|
ip->ip_hl = hlen >> 2;
|
2015-04-01 22:26:39 +00:00
|
|
|
ip_fillid(ip);
|
1994-05-24 10:09:53 +00:00
|
|
|
} else {
|
2009-12-28 14:09:46 +00:00
|
|
|
/* Header already set, fetch hlen from there */
|
2002-10-20 22:52:07 +00:00
|
|
|
hlen = ip->ip_hl << 2;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2018-10-07 11:26:15 +00:00
|
|
|
if ((flags & IP_FORWARDING) == 0)
|
|
|
|
IPSTAT_INC(ips_localout);
|
1996-04-18 15:49:06 +00:00
|
|
|
|
2014-01-16 11:50:00 +00:00
|
|
|
/*
|
|
|
|
* dst/gw handling:
|
|
|
|
*
|
2014-01-16 12:58:03 +00:00
|
|
|
* gw is readonly but can point either to dst OR rt_gateway,
|
|
|
|
* therefore we need restore gw if we're redoing lookup.
|
2014-01-16 11:50:00 +00:00
|
|
|
*/
|
2014-10-02 00:25:57 +00:00
|
|
|
fibnum = (inp != NULL) ? inp->inp_inc.inc_fibnum : M_GETFIB(m);
|
2019-05-08 23:39:24 +00:00
|
|
|
if (ro != NULL)
|
|
|
|
dst = (struct sockaddr_in *)&ro->ro_dst;
|
|
|
|
else
|
|
|
|
dst = &sin;
|
|
|
|
if (ro == NULL || ro->ro_rt == NULL) {
|
2002-01-21 20:04:22 +00:00
|
|
|
bzero(dst, sizeof(*dst));
|
1994-05-24 10:09:53 +00:00
|
|
|
dst->sin_family = AF_INET;
|
|
|
|
dst->sin_len = sizeof(*dst);
|
2004-08-17 22:05:54 +00:00
|
|
|
dst->sin_addr = ip->ip_dst;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2019-05-08 23:39:24 +00:00
|
|
|
gw = dst;
|
2019-01-09 01:11:19 +00:00
|
|
|
NET_EPOCH_ENTER(et);
|
2015-07-29 18:04:01 +00:00
|
|
|
again:
|
2016-03-24 07:54:56 +00:00
|
|
|
/*
|
|
|
|
* Validate route against routing table additions;
|
|
|
|
* a better/more specific route might have been added.
|
|
|
|
*/
|
2019-05-08 23:39:24 +00:00
|
|
|
if (inp != NULL && ro != NULL && ro->ro_rt != NULL)
|
2016-03-24 07:54:56 +00:00
|
|
|
RT_VALIDATE(ro, &inp->inp_rt_cookie, fibnum);
|
|
|
|
/*
|
|
|
|
* If there is a cached route,
|
|
|
|
* check that it is to the same destination
|
|
|
|
* and is still up. If not, free it and try again.
|
|
|
|
* The address family should also be checked in case of sharing the
|
|
|
|
* cache with IPv6.
|
|
|
|
* Also check whether routing cache needs invalidation.
|
|
|
|
*/
|
2019-05-08 23:39:24 +00:00
|
|
|
if (ro != NULL && ro->ro_rt != NULL &&
|
|
|
|
((ro->ro_rt->rt_flags & RTF_UP) == 0 ||
|
|
|
|
ro->ro_rt->rt_ifp == NULL || !RT_LINK_IS_UP(ro->ro_rt->rt_ifp) ||
|
|
|
|
dst->sin_family != AF_INET ||
|
|
|
|
dst->sin_addr.s_addr != ip->ip_dst.s_addr))
|
2018-01-23 03:15:39 +00:00
|
|
|
RO_INVALIDATE_CACHE(ro);
|
2015-07-29 18:04:01 +00:00
|
|
|
ia = NULL;
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2007-03-01 13:29:30 +00:00
|
|
|
* If routing to interface only, short circuit routing lookup.
|
|
|
|
* The use of an all-ones broadcast address implies this; an
|
|
|
|
* interface is specified by the broadcast address of an interface,
|
|
|
|
* or the destination address of a ptp interface.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2007-03-01 13:29:30 +00:00
|
|
|
if (flags & IP_SENDONES) {
|
2014-09-11 20:21:03 +00:00
|
|
|
if ((ia = ifatoia(ifa_ifwithbroadaddr(sintosa(dst),
|
2014-09-16 15:28:19 +00:00
|
|
|
M_GETFIB(m)))) == NULL &&
|
2014-09-11 20:21:03 +00:00
|
|
|
(ia = ifatoia(ifa_ifwithdstaddr(sintosa(dst),
|
2014-09-16 15:28:19 +00:00
|
|
|
M_GETFIB(m)))) == NULL) {
|
2009-04-11 23:35:20 +00:00
|
|
|
IPSTAT_INC(ips_noroute);
|
1994-05-24 10:09:53 +00:00
|
|
|
error = ENETUNREACH;
|
|
|
|
goto bad;
|
|
|
|
}
|
2007-03-01 13:29:30 +00:00
|
|
|
ip->ip_dst.s_addr = INADDR_BROADCAST;
|
|
|
|
dst->sin_addr = ip->ip_dst;
|
1994-05-24 10:09:53 +00:00
|
|
|
ifp = ia->ia_ifp;
|
2019-05-08 23:39:24 +00:00
|
|
|
mtu = ifp->if_mtu;
|
1994-05-24 10:09:53 +00:00
|
|
|
ip->ip_ttl = 1;
|
2007-03-01 13:29:30 +00:00
|
|
|
isbroadcast = 1;
|
2019-05-08 23:39:24 +00:00
|
|
|
src = IA_SIN(ia)->sin_addr;
|
2007-03-01 13:29:30 +00:00
|
|
|
} else if (flags & IP_ROUTETOIF) {
|
2014-09-11 20:21:03 +00:00
|
|
|
if ((ia = ifatoia(ifa_ifwithdstaddr(sintosa(dst),
|
2014-09-16 15:28:19 +00:00
|
|
|
M_GETFIB(m)))) == NULL &&
|
2014-09-11 20:21:03 +00:00
|
|
|
(ia = ifatoia(ifa_ifwithnet(sintosa(dst), 0,
|
2014-09-16 15:28:19 +00:00
|
|
|
M_GETFIB(m)))) == NULL) {
|
2009-04-11 23:35:20 +00:00
|
|
|
IPSTAT_INC(ips_noroute);
|
2006-09-06 17:12:10 +00:00
|
|
|
error = ENETUNREACH;
|
|
|
|
goto bad;
|
|
|
|
}
|
|
|
|
ifp = ia->ia_ifp;
|
2019-05-08 23:39:24 +00:00
|
|
|
mtu = ifp->if_mtu;
|
2006-09-06 17:12:10 +00:00
|
|
|
ip->ip_ttl = 1;
|
2016-10-24 22:11:33 +00:00
|
|
|
isbroadcast = ifp->if_flags & IFF_BROADCAST ?
|
|
|
|
in_ifaddr_broadcast(dst->sin_addr, ia) : 0;
|
2019-05-08 23:39:24 +00:00
|
|
|
src = IA_SIN(ia)->sin_addr;
|
2001-07-17 18:47:48 +00:00
|
|
|
} else if (IN_MULTICAST(ntohl(ip->ip_dst.s_addr)) &&
|
2001-07-23 16:50:01 +00:00
|
|
|
imo != NULL && imo->imo_multicast_ifp != NULL) {
|
2001-07-17 18:47:48 +00:00
|
|
|
/*
|
2001-07-23 16:50:01 +00:00
|
|
|
* Bypass the normal routing lookup for multicast
|
|
|
|
* packets if the interface is specified.
|
2001-07-17 18:47:48 +00:00
|
|
|
*/
|
2001-07-23 16:50:01 +00:00
|
|
|
ifp = imo->imo_multicast_ifp;
|
2019-05-08 23:39:24 +00:00
|
|
|
mtu = ifp->if_mtu;
|
2015-07-29 08:12:05 +00:00
|
|
|
IFP_TO_IA(ifp, ia, &in_ifa_tracker);
|
2001-07-23 16:50:01 +00:00
|
|
|
isbroadcast = 0; /* fool gcc */
|
2019-05-10 21:51:17 +00:00
|
|
|
/* Interface may have no addresses. */
|
|
|
|
if (ia != NULL)
|
|
|
|
src = IA_SIN(ia)->sin_addr;
|
|
|
|
else
|
|
|
|
src.s_addr = INADDR_ANY;
|
2019-05-08 23:39:24 +00:00
|
|
|
} else if (ro != NULL) {
|
|
|
|
if (ro->ro_rt == NULL) {
|
|
|
|
/*
|
|
|
|
* We want to do any cloning requested by the link
|
|
|
|
* layer, as this is probably required in all cases
|
|
|
|
* for correct operation (as it is for ARP).
|
|
|
|
*/
|
This patch provides the back end support for equal-cost multi-path
(ECMP) for both IPv4 and IPv6. Previously, multipath route insertion
is disallowed. For example,
route add -net 192.103.54.0/24 10.9.44.1
route add -net 192.103.54.0/24 10.9.44.2
The second route insertion will trigger an error message of
"add net 192.103.54.0/24: gateway 10.2.5.2: route already in table"
Multiple default routes can also be inserted. Here is the netstat
output:
default 10.2.5.1 UGS 0 3074 bge0 =>
default 10.2.5.2 UGS 0 0 bge0
When multipath routes exist, the "route delete" command requires
a specific gateway to be specified or else an error message would
be displayed. For example,
route delete default
would fail and trigger the following error message:
"route: writing to routing socket: No such process"
"delete net default: not in table"
On the other hand,
route delete default 10.2.5.2
would be successful: "delete net default: gateway 10.2.5.2"
One does not have to specify a gateway if there is only a single
route for a particular destination.
I need to perform more testings on address aliases and multiple
interfaces that have the same IP prefixes. This patch as it
stands today is not yet ready for prime time. Therefore, the ECMP
code fragments are fully guarded by the RADIX_MPATH macro.
Include the "options RADIX_MPATH" in the kernel configuration
to enable this feature.
Reviewed by: robert, sam, gnn, julian, kmacy
2008-04-13 05:45:14 +00:00
|
|
|
#ifdef RADIX_MPATH
|
Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)
Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.
From my notes:
-----
One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.
Constraints:
------------
I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.
One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".
One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.
This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.
Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.
To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.
The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.
The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.
In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.
One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).
You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.
This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.
Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.
Packets fall into one of a number of classes.
1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..
setfib -3 ping target.example.com # will use fib 3 for ping.
It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.
2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)
3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).
4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.
5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.
6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.
Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)
In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.
In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.
Early testing experience:
-------------------------
Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.
For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.
Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.
ipfw has grown 2 new keywords:
setfib N ip from anay to any
count ip from any to any fib N
In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.
SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.
Where to next:
--------------------
After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.
Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.
My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.
When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.
Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.
This work was sponsored by Ironport Systems/Cisco
Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco
2008-05-09 23:03:00 +00:00
|
|
|
rtalloc_mpath_fib(ro,
|
|
|
|
ntohl(ip->ip_src.s_addr ^ ip->ip_dst.s_addr),
|
2014-10-02 00:25:57 +00:00
|
|
|
fibnum);
|
This patch provides the back end support for equal-cost multi-path
(ECMP) for both IPv4 and IPv6. Previously, multipath route insertion
is disallowed. For example,
route add -net 192.103.54.0/24 10.9.44.1
route add -net 192.103.54.0/24 10.9.44.2
The second route insertion will trigger an error message of
"add net 192.103.54.0/24: gateway 10.2.5.2: route already in table"
Multiple default routes can also be inserted. Here is the netstat
output:
default 10.2.5.1 UGS 0 3074 bge0 =>
default 10.2.5.2 UGS 0 0 bge0
When multipath routes exist, the "route delete" command requires
a specific gateway to be specified or else an error message would
be displayed. For example,
route delete default
would fail and trigger the following error message:
"route: writing to routing socket: No such process"
"delete net default: not in table"
On the other hand,
route delete default 10.2.5.2
would be successful: "delete net default: gateway 10.2.5.2"
One does not have to specify a gateway if there is only a single
route for a particular destination.
I need to perform more testings on address aliases and multiple
interfaces that have the same IP prefixes. This patch as it
stands today is not yet ready for prime time. Therefore, the ECMP
code fragments are fully guarded by the RADIX_MPATH macro.
Include the "options RADIX_MPATH" in the kernel configuration
to enable this feature.
Reviewed by: robert, sam, gnn, julian, kmacy
2008-04-13 05:45:14 +00:00
|
|
|
#else
|
2014-10-02 00:25:57 +00:00
|
|
|
in_rtalloc_ign(ro, 0, fibnum);
|
This patch provides the back end support for equal-cost multi-path
(ECMP) for both IPv4 and IPv6. Previously, multipath route insertion
is disallowed. For example,
route add -net 192.103.54.0/24 10.9.44.1
route add -net 192.103.54.0/24 10.9.44.2
The second route insertion will trigger an error message of
"add net 192.103.54.0/24: gateway 10.2.5.2: route already in table"
Multiple default routes can also be inserted. Here is the netstat
output:
default 10.2.5.1 UGS 0 3074 bge0 =>
default 10.2.5.2 UGS 0 0 bge0
When multipath routes exist, the "route delete" command requires
a specific gateway to be specified or else an error message would
be displayed. For example,
route delete default
would fail and trigger the following error message:
"route: writing to routing socket: No such process"
"delete net default: not in table"
On the other hand,
route delete default 10.2.5.2
would be successful: "delete net default: gateway 10.2.5.2"
One does not have to specify a gateway if there is only a single
route for a particular destination.
I need to perform more testings on address aliases and multiple
interfaces that have the same IP prefixes. This patch as it
stands today is not yet ready for prime time. Therefore, the ECMP
code fragments are fully guarded by the RADIX_MPATH macro.
Include the "options RADIX_MPATH" in the kernel configuration
to enable this feature.
Reviewed by: robert, sam, gnn, julian, kmacy
2008-04-13 05:45:14 +00:00
|
|
|
#endif
|
2019-05-08 23:39:24 +00:00
|
|
|
if (ro->ro_rt == NULL ||
|
|
|
|
(ro->ro_rt->rt_flags & RTF_UP) == 0 ||
|
|
|
|
ro->ro_rt->rt_ifp == NULL ||
|
|
|
|
!RT_LINK_IS_UP(ro->ro_rt->rt_ifp)) {
|
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT)
|
|
|
|
/*
|
|
|
|
* There is no route for this packet, but it is
|
|
|
|
* possible that a matching SPD entry exists.
|
|
|
|
*/
|
|
|
|
no_route_but_check_spd = 1;
|
|
|
|
mtu = 0; /* Silence GCC warning. */
|
|
|
|
goto sendit;
|
|
|
|
#endif
|
|
|
|
IPSTAT_INC(ips_noroute);
|
|
|
|
error = EHOSTUNREACH;
|
|
|
|
goto bad;
|
|
|
|
}
|
2009-12-28 14:48:32 +00:00
|
|
|
}
|
2019-05-08 23:39:24 +00:00
|
|
|
ia = ifatoia(ro->ro_rt->rt_ifa);
|
|
|
|
ifp = ro->ro_rt->rt_ifp;
|
|
|
|
counter_u64_add(ro->ro_rt->rt_pksent, 1);
|
|
|
|
rt_update_ro_flags(ro);
|
|
|
|
if (ro->ro_rt->rt_flags & RTF_GATEWAY)
|
|
|
|
gw = (struct sockaddr_in *)ro->ro_rt->rt_gateway;
|
|
|
|
if (ro->ro_rt->rt_flags & RTF_HOST)
|
|
|
|
isbroadcast = (ro->ro_rt->rt_flags & RTF_BROADCAST);
|
|
|
|
else if (ifp->if_flags & IFF_BROADCAST)
|
|
|
|
isbroadcast = in_ifaddr_broadcast(gw->sin_addr, ia);
|
|
|
|
else
|
|
|
|
isbroadcast = 0;
|
|
|
|
if (ro->ro_rt->rt_flags & RTF_HOST)
|
|
|
|
mtu = ro->ro_rt->rt_mtu;
|
|
|
|
else
|
|
|
|
mtu = ifp->if_mtu;
|
|
|
|
src = IA_SIN(ia)->sin_addr;
|
|
|
|
} else {
|
|
|
|
struct nhop4_extended nh;
|
|
|
|
|
|
|
|
bzero(&nh, sizeof(nh));
|
|
|
|
if (fib4_lookup_nh_ext(M_GETFIB(m), ip->ip_dst, 0, 0, &nh) !=
|
|
|
|
0) {
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT)
|
2009-04-28 11:10:33 +00:00
|
|
|
/*
|
|
|
|
* There is no route for this packet, but it is
|
|
|
|
* possible that a matching SPD entry exists.
|
|
|
|
*/
|
|
|
|
no_route_but_check_spd = 1;
|
|
|
|
mtu = 0; /* Silence GCC warning. */
|
|
|
|
goto sendit;
|
|
|
|
#endif
|
2009-04-11 23:35:20 +00:00
|
|
|
IPSTAT_INC(ips_noroute);
|
1994-05-24 10:09:53 +00:00
|
|
|
error = EHOSTUNREACH;
|
|
|
|
goto bad;
|
|
|
|
}
|
2019-05-08 23:39:24 +00:00
|
|
|
ifp = nh.nh_ifp;
|
|
|
|
mtu = nh.nh_mtu;
|
|
|
|
/*
|
|
|
|
* We are rewriting here dst to be gw actually, contradicting
|
|
|
|
* comment at the beginning of the function. However, in this
|
|
|
|
* case we are always dealing with on stack dst.
|
|
|
|
* In case if pfil(9) sends us back to beginning of the
|
|
|
|
* function, the dst would be rewritten by ip_output_pfil().
|
|
|
|
*/
|
|
|
|
MPASS(dst == &sin);
|
|
|
|
dst->sin_addr = nh.nh_addr;
|
|
|
|
ia = nh.nh_ia;
|
|
|
|
src = nh.nh_src;
|
|
|
|
isbroadcast = (((nh.nh_flags & (NHF_HOST | NHF_BROADCAST)) ==
|
|
|
|
(NHF_HOST | NHF_BROADCAST)) ||
|
|
|
|
((ifp->if_flags & IFF_BROADCAST) &&
|
|
|
|
in_ifaddr_broadcast(dst->sin_addr, ia)));
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2015-07-29 18:04:01 +00:00
|
|
|
|
2010-12-31 21:47:11 +00:00
|
|
|
/* Catch a possible divide by zero later. */
|
2019-05-08 23:39:24 +00:00
|
|
|
KASSERT(mtu > 0, ("%s: mtu %d <= 0, ro=%p (rt_flags=0x%08x) ifp=%p",
|
|
|
|
__func__, mtu, ro,
|
|
|
|
(ro != NULL && ro->ro_rt != NULL) ? ro->ro_rt->rt_flags : 0, ifp));
|
2015-07-29 18:04:01 +00:00
|
|
|
|
2004-08-17 22:05:54 +00:00
|
|
|
if (IN_MULTICAST(ntohl(ip->ip_dst.s_addr))) {
|
1994-05-24 10:09:53 +00:00
|
|
|
m->m_flags |= M_MCAST;
|
2014-01-02 10:18:39 +00:00
|
|
|
/*
|
|
|
|
* IP destination address is multicast. Make sure "gw"
|
|
|
|
* still points to the address in "ro". (It may have been
|
|
|
|
* changed to point to a gateway address, above.)
|
|
|
|
*/
|
|
|
|
gw = dst;
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* See if the caller provided any multicast options
|
|
|
|
*/
|
|
|
|
if (imo != NULL) {
|
|
|
|
ip->ip_ttl = imo->imo_multicast_ttl;
|
1995-06-13 17:51:16 +00:00
|
|
|
if (imo->imo_multicast_vif != -1)
|
|
|
|
ip->ip_src.s_addr =
|
Massive cleanup of the ip_mroute code.
No functional changes, but:
+ the mrouting module now should behave the same as the compiled-in
version (it did not before, some of the rsvp code was not loaded
properly);
+ netinet/ip_mroute.c is now truly optional;
+ removed some redundant/unused code;
+ changed many instances of '0' to NULL and INADDR_ANY as appropriate;
+ removed several static variables to make the code more SMP-friendly;
+ fixed some minor bugs in the mrouting code (mostly, incorrect return
values from functions).
This commit is also a prerequisite to the addition of support for PIM,
which i would like to put in before DP2 (it does not change any of
the existing APIs, anyways).
Note, in the process we found out that some device drivers fail to
properly handle changes in IFF_ALLMULTI, leading to interesting
behaviour when a multicast router is started. This bug is not
corrected by this commit, and will be fixed with a separate commit.
Detailed changes:
--------------------
netinet/ip_mroute.c all the above.
conf/files make ip_mroute.c optional
net/route.c fix mrt_ioctl hook
netinet/ip_input.c fix ip_mforward hook, move rsvp_input() here
together with other rsvp code, and a couple
of indentation fixes.
netinet/ip_output.c fix ip_mforward and ip_mcast_src hooks
netinet/ip_var.h rsvp function hooks
netinet/raw_ip.c hooks for mrouting and rsvp functions, plus
interface cleanup.
netinet/ip_mroute.h remove an unused and optional field from a struct
Most of the code is from Pavlin Radoslavov and the XORP project
Reviewed by: sam
MFC after: 1 week
2002-11-15 22:53:53 +00:00
|
|
|
ip_mcast_src ?
|
|
|
|
ip_mcast_src(imo->imo_multicast_vif) :
|
|
|
|
INADDR_ANY;
|
1994-05-24 10:09:53 +00:00
|
|
|
} else
|
|
|
|
ip->ip_ttl = IP_DEFAULT_MULTICAST_TTL;
|
|
|
|
/*
|
|
|
|
* Confirm that the outgoing interface supports multicast.
|
|
|
|
*/
|
1995-06-13 17:51:16 +00:00
|
|
|
if ((imo == NULL) || (imo->imo_multicast_vif == -1)) {
|
|
|
|
if ((ifp->if_flags & IFF_MULTICAST) == 0) {
|
2009-04-11 23:35:20 +00:00
|
|
|
IPSTAT_INC(ips_noroute);
|
1995-06-13 17:51:16 +00:00
|
|
|
error = ENETUNREACH;
|
|
|
|
goto bad;
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
/*
|
|
|
|
* If source address not specified yet, use address
|
|
|
|
* of outgoing interface.
|
|
|
|
*/
|
2019-05-08 23:39:24 +00:00
|
|
|
if (ip->ip_src.s_addr == INADDR_ANY)
|
|
|
|
ip->ip_src = src;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2009-03-04 03:45:34 +00:00
|
|
|
if ((imo == NULL && in_mcast_loop) ||
|
|
|
|
(imo && imo->imo_multicast_loop)) {
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2009-03-04 03:45:34 +00:00
|
|
|
* Loop back multicast datagram if not expressly
|
|
|
|
* forbidden to do so, even if we are not a member
|
|
|
|
* of the group; ip_input() will filter it later,
|
|
|
|
* thus deferring a hash lookup and mutex acquisition
|
|
|
|
* at the expense of a cheap copy using m_copym().
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2015-08-08 15:58:35 +00:00
|
|
|
ip_mloopback(ifp, m, hlen);
|
2009-03-04 03:45:34 +00:00
|
|
|
} else {
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* If we are acting as a multicast router, perform
|
|
|
|
* multicast forwarding as if the packet had just
|
|
|
|
* arrived on the interface to which we are about
|
|
|
|
* to send. The multicast forwarding function
|
|
|
|
* recursively calls this function, using the
|
|
|
|
* IP_FORWARDING flag to prevent infinite recursion.
|
|
|
|
*
|
|
|
|
* Multicasts that are looped back by ip_mloopback(),
|
|
|
|
* above, will be forwarded by the ip_input() routine,
|
|
|
|
* if necessary.
|
|
|
|
*/
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if (V_ip_mrouter && (flags & IP_FORWARDING) == 0) {
|
1994-09-06 22:42:31 +00:00
|
|
|
/*
|
Massive cleanup of the ip_mroute code.
No functional changes, but:
+ the mrouting module now should behave the same as the compiled-in
version (it did not before, some of the rsvp code was not loaded
properly);
+ netinet/ip_mroute.c is now truly optional;
+ removed some redundant/unused code;
+ changed many instances of '0' to NULL and INADDR_ANY as appropriate;
+ removed several static variables to make the code more SMP-friendly;
+ fixed some minor bugs in the mrouting code (mostly, incorrect return
values from functions).
This commit is also a prerequisite to the addition of support for PIM,
which i would like to put in before DP2 (it does not change any of
the existing APIs, anyways).
Note, in the process we found out that some device drivers fail to
properly handle changes in IFF_ALLMULTI, leading to interesting
behaviour when a multicast router is started. This bug is not
corrected by this commit, and will be fixed with a separate commit.
Detailed changes:
--------------------
netinet/ip_mroute.c all the above.
conf/files make ip_mroute.c optional
net/route.c fix mrt_ioctl hook
netinet/ip_input.c fix ip_mforward hook, move rsvp_input() here
together with other rsvp code, and a couple
of indentation fixes.
netinet/ip_output.c fix ip_mforward and ip_mcast_src hooks
netinet/ip_var.h rsvp function hooks
netinet/raw_ip.c hooks for mrouting and rsvp functions, plus
interface cleanup.
netinet/ip_mroute.h remove an unused and optional field from a struct
Most of the code is from Pavlin Radoslavov and the XORP project
Reviewed by: sam
MFC after: 1 week
2002-11-15 22:53:53 +00:00
|
|
|
* If rsvp daemon is not running, do not
|
1994-09-06 22:42:31 +00:00
|
|
|
* set ip_moptions. This ensures that the packet
|
|
|
|
* is multicast and not just sent down one link
|
|
|
|
* as prescribed by rsvpd.
|
|
|
|
*/
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if (!V_rsvp_on)
|
Massive cleanup of the ip_mroute code.
No functional changes, but:
+ the mrouting module now should behave the same as the compiled-in
version (it did not before, some of the rsvp code was not loaded
properly);
+ netinet/ip_mroute.c is now truly optional;
+ removed some redundant/unused code;
+ changed many instances of '0' to NULL and INADDR_ANY as appropriate;
+ removed several static variables to make the code more SMP-friendly;
+ fixed some minor bugs in the mrouting code (mostly, incorrect return
values from functions).
This commit is also a prerequisite to the addition of support for PIM,
which i would like to put in before DP2 (it does not change any of
the existing APIs, anyways).
Note, in the process we found out that some device drivers fail to
properly handle changes in IFF_ALLMULTI, leading to interesting
behaviour when a multicast router is started. This bug is not
corrected by this commit, and will be fixed with a separate commit.
Detailed changes:
--------------------
netinet/ip_mroute.c all the above.
conf/files make ip_mroute.c optional
net/route.c fix mrt_ioctl hook
netinet/ip_input.c fix ip_mforward hook, move rsvp_input() here
together with other rsvp code, and a couple
of indentation fixes.
netinet/ip_output.c fix ip_mforward and ip_mcast_src hooks
netinet/ip_var.h rsvp function hooks
netinet/raw_ip.c hooks for mrouting and rsvp functions, plus
interface cleanup.
netinet/ip_mroute.h remove an unused and optional field from a struct
Most of the code is from Pavlin Radoslavov and the XORP project
Reviewed by: sam
MFC after: 1 week
2002-11-15 22:53:53 +00:00
|
|
|
imo = NULL;
|
|
|
|
if (ip_mforward &&
|
|
|
|
ip_mforward(ip, ifp, m, imo) != 0) {
|
1994-05-24 10:09:53 +00:00
|
|
|
m_freem(m);
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
1994-09-14 03:10:15 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Multicasts with a time-to-live of zero may be looped-
|
|
|
|
* back, above, but must not be transmitted on a network.
|
|
|
|
* Also, multicasts addressed to the loopback interface
|
|
|
|
* are not sent -- the above call to ip_mloopback() will
|
2009-03-04 03:45:34 +00:00
|
|
|
* loop back a copy. ip_input() will drop the copy if
|
|
|
|
* this host does not belong to the destination group on
|
|
|
|
* the loopback interface.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
1995-04-26 18:10:58 +00:00
|
|
|
if (ip->ip_ttl == 0 || ifp->if_flags & IFF_LOOPBACK) {
|
1994-05-24 10:09:53 +00:00
|
|
|
m_freem(m);
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
goto sendit;
|
|
|
|
}
|
2006-09-29 16:44:45 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
Remove (almost all) global variables that were used to hold
packet forwarding state ("annotations") during ip processing.
The code is considerably cleaner now.
The variables removed by this change are:
ip_divert_cookie used by divert sockets
ip_fw_fwd_addr used for transparent ip redirection
last_pkt used by dynamic pipes in dummynet
Removal of the first two has been done by carrying the annotations
into volatile structs prepended to the mbuf chains, and adding
appropriate code to add/remove annotations in the routines which
make use of them, i.e. ip_input(), ip_output(), tcp_input(),
bdg_forward(), ether_demux(), ether_output_frame(), div_output().
On passing, remove a bug in divert handling of fragmented packet.
Now it is the fragment at offset 0 which sets the divert status of
the whole packet, whereas formerly it was the last incoming fragment
to decide.
Removal of last_pkt required a change in the interface of ip_fw_chk()
and dummynet_io(). On passing, use the same mechanism for dummynet
annotations and for divert/forward annotations.
option IPFIREWALL_FORWARD is effectively useless, the code to
implement it is very small and is now in by default to avoid the
obfuscation of conditionally compiled code.
NOTES:
* there is at least one global variable left, sro_fwd, in ip_output().
I am not sure if/how this can be removed.
* I have deliberately avoided gratuitous style changes in this commit
to avoid cluttering the diffs. Minor stule cleanup will likely be
necessary
* this commit only focused on the IP layer. I am sure there is a
number of global variables used in the TCP and maybe UDP stack.
* despite the number of files touched, there are absolutely no API's
or data structures changed by this commit (except the interfaces of
ip_fw_chk() and dummynet_io(), which are internal anyways), so
an MFC is quite safe and unintrusive (and desirable, given the
improved readability of the code).
MFC after: 10 days
2002-06-22 11:51:02 +00:00
|
|
|
* If the source address is not specified yet, use the address
|
2004-09-06 15:48:38 +00:00
|
|
|
* of the outoing interface.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2019-05-08 23:39:24 +00:00
|
|
|
if (ip->ip_src.s_addr == INADDR_ANY)
|
|
|
|
ip->ip_src = src;
|
2006-09-29 16:44:45 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Look for broadcast address and
|
2002-01-21 13:59:42 +00:00
|
|
|
* verify user is allowed to send
|
1994-05-24 10:09:53 +00:00
|
|
|
* such a packet.
|
|
|
|
*/
|
1996-05-06 17:42:13 +00:00
|
|
|
if (isbroadcast) {
|
1994-05-24 10:09:53 +00:00
|
|
|
if ((ifp->if_flags & IFF_BROADCAST) == 0) {
|
|
|
|
error = EADDRNOTAVAIL;
|
|
|
|
goto bad;
|
|
|
|
}
|
|
|
|
if ((flags & IP_ALLOWBROADCAST) == 0) {
|
|
|
|
error = EACCES;
|
|
|
|
goto bad;
|
|
|
|
}
|
|
|
|
/* don't allow broadcast messages to be fragmented */
|
2012-10-22 21:09:03 +00:00
|
|
|
if (ip_len > mtu) {
|
1994-05-24 10:09:53 +00:00
|
|
|
error = EMSGSIZE;
|
|
|
|
goto bad;
|
|
|
|
}
|
|
|
|
m->m_flags |= M_BCAST;
|
1996-05-06 17:42:13 +00:00
|
|
|
} else {
|
1994-05-24 10:09:53 +00:00
|
|
|
m->m_flags &= ~M_BCAST;
|
1996-05-06 17:42:13 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1997-02-19 14:02:27 +00:00
|
|
|
sendit:
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT)
|
|
|
|
if (IPSEC_ENABLED(ipv4)) {
|
|
|
|
if ((error = IPSEC_OUTPUT(ipv4, m, inp)) != 0) {
|
|
|
|
if (error == EINPROGRESS)
|
|
|
|
error = 0;
|
|
|
|
goto done;
|
|
|
|
}
|
2001-06-11 12:39:29 +00:00
|
|
|
}
|
2009-04-28 11:10:33 +00:00
|
|
|
/*
|
|
|
|
* Check if there was a route for this packet; return error if not.
|
|
|
|
*/
|
|
|
|
if (no_route_but_check_spd) {
|
|
|
|
IPSTAT_INC(ips_noroute);
|
|
|
|
error = EHOSTUNREACH;
|
|
|
|
goto bad;
|
|
|
|
}
|
2006-02-01 13:55:03 +00:00
|
|
|
/* Update variables that are affected by ipsec4_output(). */
|
2001-06-11 12:39:29 +00:00
|
|
|
ip = mtod(m, struct ip *);
|
|
|
|
hlen = ip->ip_hl << 2;
|
2007-07-03 12:13:45 +00:00
|
|
|
#endif /* IPSEC */
|
2001-06-11 12:39:29 +00:00
|
|
|
|
2004-08-27 15:16:24 +00:00
|
|
|
/* Jump over all PFIL processing if hooks are not active. */
|
New pfil(9) KPI together with newborn pfil API and control utility.
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented. The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.
In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.
New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.
Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.
Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.
Differential Revision: https://reviews.freebsd.org/D18951
2019-01-31 23:01:03 +00:00
|
|
|
if (PFIL_HOOKED_OUT(V_inet_pfil_head)) {
|
2019-06-21 07:58:08 +00:00
|
|
|
switch (ip_output_pfil(&m, ifp, flags, inp, dst, &fibnum,
|
|
|
|
&error)) {
|
2015-07-29 18:04:01 +00:00
|
|
|
case 1: /* Finished */
|
|
|
|
goto done;
|
1996-08-21 21:37:07 +00:00
|
|
|
|
2015-07-29 18:04:01 +00:00
|
|
|
case 0: /* Continue normally */
|
|
|
|
ip = mtod(m, struct ip *);
|
|
|
|
break;
|
1998-12-14 18:09:13 +00:00
|
|
|
|
2015-07-29 18:04:01 +00:00
|
|
|
case -1: /* Need to try again */
|
|
|
|
/* Reset everything for a new round */
|
2019-05-08 23:39:24 +00:00
|
|
|
if (ro != NULL) {
|
|
|
|
RO_RTFREE(ro);
|
|
|
|
ro->ro_prepend = NULL;
|
|
|
|
}
|
2015-07-29 18:04:01 +00:00
|
|
|
gw = dst;
|
|
|
|
ip = mtod(m, struct ip *);
|
|
|
|
goto again;
|
Remove (almost all) global variables that were used to hold
packet forwarding state ("annotations") during ip processing.
The code is considerably cleaner now.
The variables removed by this change are:
ip_divert_cookie used by divert sockets
ip_fw_fwd_addr used for transparent ip redirection
last_pkt used by dynamic pipes in dummynet
Removal of the first two has been done by carrying the annotations
into volatile structs prepended to the mbuf chains, and adding
appropriate code to add/remove annotations in the routines which
make use of them, i.e. ip_input(), ip_output(), tcp_input(),
bdg_forward(), ether_demux(), ether_output_frame(), div_output().
On passing, remove a bug in divert handling of fragmented packet.
Now it is the fragment at offset 0 which sets the divert status of
the whole packet, whereas formerly it was the last incoming fragment
to decide.
Removal of last_pkt required a change in the interface of ip_fw_chk()
and dummynet_io(). On passing, use the same mechanism for dummynet
annotations and for divert/forward annotations.
option IPFIREWALL_FORWARD is effectively useless, the code to
implement it is very small and is now in by default to avoid the
obfuscation of conditionally compiled code.
NOTES:
* there is at least one global variable left, sro_fwd, in ip_output().
I am not sure if/how this can be removed.
* I have deliberately avoided gratuitous style changes in this commit
to avoid cluttering the diffs. Minor stule cleanup will likely be
necessary
* this commit only focused on the IP layer. I am sure there is a
number of global variables used in the TCP and maybe UDP stack.
* despite the number of files touched, there are absolutely no API's
or data structures changed by this commit (except the interfaces of
ip_fw_chk() and dummynet_io(), which are internal anyways), so
an MFC is quite safe and unintrusive (and desirable, given the
improved readability of the code).
MFC after: 10 days
2002-06-22 11:51:02 +00:00
|
|
|
|
1997-06-02 05:02:37 +00:00
|
|
|
}
|
2005-02-22 17:40:40 +00:00
|
|
|
}
|
1998-12-14 18:09:13 +00:00
|
|
|
|
2019-04-04 19:01:13 +00:00
|
|
|
/* IN_LOOPBACK must not appear on the wire - RFC1122. */
|
|
|
|
if (IN_LOOPBACK(ntohl(ip->ip_dst.s_addr)) ||
|
|
|
|
IN_LOOPBACK(ntohl(ip->ip_src.s_addr))) {
|
2002-02-15 12:19:03 +00:00
|
|
|
if ((ifp->if_flags & IFF_LOOPBACK) == 0) {
|
2009-04-11 23:35:20 +00:00
|
|
|
IPSTAT_INC(ips_badaddr);
|
2002-02-15 12:19:03 +00:00
|
|
|
error = EADDRNOTAVAIL;
|
|
|
|
goto bad;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
RFC768 (UDP) requires that "if the computed checksum is zero, it
is transmitted as all ones". This got broken after introduction
of delayed checksums as follows. Some guys (including Jonathan)
think that it is allowed to transmit all ones in place of a zero
checksum for TCP the same way as for UDP. (The discussion still
takes place on -net.) Thus, the 0 -> 0xffff checksum fixup was
first moved from udp_output() (see udp_usrreq.c, 1.64 -> 1.65)
to in_cksum_skip() (see sys/i386/i386/in_cksum.c, 1.17 -> 1.18,
INVERT expression). Besides that I disagree that it is valid for
TCP, there was no real problem until in_cksum.c,v 1.20, where the
in_cksum() was made just a special version of in_cksum_skip().
The side effect was that now every incoming IP datagram failed to
pass the checksum test (in_cksum() returned 0xffff when it should
actually return zero). It was fixed next day in revision 1.21,
by removing the INVERT expression. The latter also broke the
0 -> 0xffff fixup for UDP checksums.
Before this change:
: tcpdump: listening on lo0
: 127.0.0.1.33005 > 127.0.0.1.33006: udp 0 (ttl 64, id 1)
: 4500 001c 0001 0000 4011 7cce 7f00 0001
: 7f00 0001 80ed 80ee 0008 0000
After this change:
: tcpdump: listening on lo0
: 127.0.0.1.33005 > 127.0.0.1.33006: udp 0 (ttl 64, id 1)
: 4500 001c 0001 0000 4011 7cce 7f00 0001
: 7f00 0001 80ed 80ee 0008 ffff
2001-03-13 17:07:06 +00:00
|
|
|
m->m_pkthdr.csum_flags |= CSUM_IP;
|
2012-10-26 21:06:33 +00:00
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_DELAY_DATA & ~ifp->if_hwassist) {
|
Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
2019-06-29 00:48:33 +00:00
|
|
|
m = mb_unmapped_to_ext(m);
|
|
|
|
if (m == NULL) {
|
|
|
|
IPSTAT_INC(ips_odropped);
|
|
|
|
error = ENOBUFS;
|
|
|
|
goto bad;
|
|
|
|
}
|
2000-03-27 19:14:27 +00:00
|
|
|
in_delayed_cksum(m);
|
2012-10-26 21:06:33 +00:00
|
|
|
m->m_pkthdr.csum_flags &= ~CSUM_DELAY_DATA;
|
Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
2019-06-29 00:48:33 +00:00
|
|
|
} else if ((ifp->if_capenable & IFCAP_NOMAP) == 0) {
|
|
|
|
m = mb_unmapped_to_ext(m);
|
|
|
|
if (m == NULL) {
|
|
|
|
IPSTAT_INC(ips_odropped);
|
|
|
|
error = ENOBUFS;
|
|
|
|
goto bad;
|
|
|
|
}
|
2000-03-27 19:14:27 +00:00
|
|
|
}
|
2009-02-03 11:00:43 +00:00
|
|
|
#ifdef SCTP
|
2012-10-26 21:06:33 +00:00
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_SCTP & ~ifp->if_hwassist) {
|
Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
2019-06-29 00:48:33 +00:00
|
|
|
m = mb_unmapped_to_ext(m);
|
|
|
|
if (m == NULL) {
|
|
|
|
IPSTAT_INC(ips_odropped);
|
|
|
|
error = ENOBUFS;
|
|
|
|
goto bad;
|
|
|
|
}
|
2010-03-12 22:58:52 +00:00
|
|
|
sctp_delayed_cksum(m, (uint32_t)(ip->ip_hl << 2));
|
2012-10-26 21:06:33 +00:00
|
|
|
m->m_pkthdr.csum_flags &= ~CSUM_SCTP;
|
2009-02-03 11:00:43 +00:00
|
|
|
}
|
|
|
|
#endif
|
2000-03-27 19:14:27 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2000-03-27 19:14:27 +00:00
|
|
|
* If small enough for interface, or the interface will take
|
2006-09-06 21:51:59 +00:00
|
|
|
* care of the fragmentation for us, we can just send directly.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2012-10-06 10:02:11 +00:00
|
|
|
if (ip_len <= mtu ||
|
2014-09-03 08:30:18 +00:00
|
|
|
(m->m_pkthdr.csum_flags & ifp->if_hwassist & CSUM_TSO) != 0) {
|
1994-05-24 10:09:53 +00:00
|
|
|
ip->ip_sum = 0;
|
2012-10-26 21:06:33 +00:00
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_IP & ~ifp->if_hwassist) {
|
2002-10-20 22:52:07 +00:00
|
|
|
ip->ip_sum = in_cksum(m, hlen);
|
2012-10-26 21:06:33 +00:00
|
|
|
m->m_pkthdr.csum_flags &= ~CSUM_IP;
|
|
|
|
}
|
2000-10-19 23:15:54 +00:00
|
|
|
|
2006-09-06 21:51:59 +00:00
|
|
|
/*
|
|
|
|
* Record statistics for this interface address.
|
|
|
|
* With CSUM_TSO the byte/packet count will be slightly
|
|
|
|
* incorrect because we count the IP+TCP headers only
|
|
|
|
* once instead of for every generated packet.
|
|
|
|
*/
|
2001-07-23 16:50:01 +00:00
|
|
|
if (!(flags & IP_FORWARDING) && ia) {
|
2006-09-06 21:51:59 +00:00
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_TSO)
|
2013-10-15 11:37:57 +00:00
|
|
|
counter_u64_add(ia->ia_ifa.ifa_opackets,
|
|
|
|
m->m_pkthdr.len / m->m_pkthdr.tso_segsz);
|
2006-09-06 21:51:59 +00:00
|
|
|
else
|
2013-10-15 11:37:57 +00:00
|
|
|
counter_u64_add(ia->ia_ifa.ifa_opackets, 1);
|
|
|
|
|
|
|
|
counter_u64_add(ia->ia_ifa.ifa_obytes, m->m_pkthdr.len);
|
2000-10-19 23:15:54 +00:00
|
|
|
}
|
2003-04-12 06:11:46 +00:00
|
|
|
#ifdef MBUF_STRESS_TEST
|
2003-09-01 05:55:37 +00:00
|
|
|
if (mbuf_frag_size && m->m_pkthdr.len > mbuf_frag_size)
|
2012-12-05 08:04:20 +00:00
|
|
|
m = m_fragment(m, M_NOWAIT, mbuf_frag_size);
|
2003-03-25 05:45:05 +00:00
|
|
|
#endif
|
2005-11-18 16:23:26 +00:00
|
|
|
/*
|
|
|
|
* Reset layer specific mbuf flags
|
|
|
|
* to avoid confusing lower layers.
|
|
|
|
*/
|
2013-08-19 13:27:32 +00:00
|
|
|
m_clrprotoflags(m);
|
2013-08-25 21:54:41 +00:00
|
|
|
IP_PROBE(send, NULL, NULL, ip, ifp, ip, NULL);
|
2019-09-24 18:18:11 +00:00
|
|
|
error = ip_output_send(inp, ifp, m, gw, ro,
|
|
|
|
(flags & IP_NO_SND_TAG_RL) ? false : true);
|
1994-05-24 10:09:53 +00:00
|
|
|
goto done;
|
|
|
|
}
|
2003-08-07 18:16:59 +00:00
|
|
|
|
2006-09-06 21:51:59 +00:00
|
|
|
/* Balk when DF bit is set or the interface didn't support TSO. */
|
2012-10-06 10:02:11 +00:00
|
|
|
if ((ip_off & IP_DF) || (m->m_pkthdr.csum_flags & CSUM_TSO)) {
|
1994-05-24 10:09:53 +00:00
|
|
|
error = EMSGSIZE;
|
2009-04-11 23:35:20 +00:00
|
|
|
IPSTAT_INC(ips_cantfrag);
|
1994-05-24 10:09:53 +00:00
|
|
|
goto bad;
|
|
|
|
}
|
2003-08-07 18:16:59 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Too large for interface; fragment if possible. If successful,
|
|
|
|
* on return, m will point to a list of packets to be sent.
|
|
|
|
*/
|
2012-10-26 21:06:33 +00:00
|
|
|
error = ip_fragment(ip, &m, mtu, ifp->if_hwassist);
|
2003-08-07 18:16:59 +00:00
|
|
|
if (error)
|
1994-05-24 10:09:53 +00:00
|
|
|
goto bad;
|
2003-08-07 18:16:59 +00:00
|
|
|
for (; m; m = m0) {
|
|
|
|
m0 = m->m_nextpkt;
|
|
|
|
m->m_nextpkt = 0;
|
|
|
|
if (error == 0) {
|
|
|
|
/* Record statistics for this interface address. */
|
|
|
|
if (ia != NULL) {
|
2013-10-15 11:37:57 +00:00
|
|
|
counter_u64_add(ia->ia_ifa.ifa_opackets, 1);
|
|
|
|
counter_u64_add(ia->ia_ifa.ifa_obytes,
|
|
|
|
m->m_pkthdr.len);
|
2003-08-07 18:16:59 +00:00
|
|
|
}
|
2005-11-18 16:23:26 +00:00
|
|
|
/*
|
|
|
|
* Reset layer specific mbuf flags
|
|
|
|
* to avoid confusing upper layers.
|
|
|
|
*/
|
2013-08-19 13:27:32 +00:00
|
|
|
m_clrprotoflags(m);
|
2005-11-18 16:23:26 +00:00
|
|
|
|
2016-12-29 19:57:46 +00:00
|
|
|
IP_PROBE(send, NULL, NULL, mtod(m, struct ip *), ifp,
|
|
|
|
mtod(m, struct ip *), NULL);
|
2019-09-24 18:18:11 +00:00
|
|
|
error = ip_output_send(inp, ifp, m, gw, ro, true);
|
2003-08-07 18:16:59 +00:00
|
|
|
} else
|
|
|
|
m_freem(m);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2003-08-07 18:16:59 +00:00
|
|
|
if (error == 0)
|
2009-04-11 23:35:20 +00:00
|
|
|
IPSTAT_INC(ips_fragmented);
|
2003-08-07 18:16:59 +00:00
|
|
|
|
|
|
|
done:
|
2019-01-09 01:11:19 +00:00
|
|
|
NET_EPOCH_EXIT(et);
|
2003-08-07 18:16:59 +00:00
|
|
|
return (error);
|
2018-05-23 21:02:14 +00:00
|
|
|
bad:
|
2003-08-07 18:16:59 +00:00
|
|
|
m_freem(m);
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create a chain of fragments which fit the given mtu. m_frag points to the
|
|
|
|
* mbuf to be fragmented; on return it points to the chain with the fragments.
|
|
|
|
* Return 0 if no error. If error, m_frag may contain a partially built
|
|
|
|
* chain of fragments that should be freed by the caller.
|
|
|
|
*
|
|
|
|
* if_hwassist_flags is the hw offload capabilities (see if_data.ifi_hwassist)
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
ip_fragment(struct ip *ip, struct mbuf **m_frag, int mtu,
|
2012-10-26 21:06:33 +00:00
|
|
|
u_long if_hwassist_flags)
|
2003-08-07 18:16:59 +00:00
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
int hlen = ip->ip_hl << 2;
|
|
|
|
int len = (mtu - hlen) & ~7; /* size of payload in each fragment */
|
|
|
|
int off;
|
|
|
|
struct mbuf *m0 = *m_frag; /* the original packet */
|
|
|
|
int firstlen;
|
|
|
|
struct mbuf **mnext;
|
|
|
|
int nfrags;
|
2012-10-06 10:02:11 +00:00
|
|
|
uint16_t ip_len, ip_off;
|
|
|
|
|
|
|
|
ip_len = ntohs(ip->ip_len);
|
|
|
|
ip_off = ntohs(ip->ip_off);
|
2003-08-07 18:16:59 +00:00
|
|
|
|
2012-10-06 10:02:11 +00:00
|
|
|
if (ip_off & IP_DF) { /* Fragmentation not allowed */
|
2009-04-11 23:35:20 +00:00
|
|
|
IPSTAT_INC(ips_cantfrag);
|
2003-08-07 18:16:59 +00:00
|
|
|
return EMSGSIZE;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Must be able to put at least 8 bytes per fragment.
|
|
|
|
*/
|
|
|
|
if (len < 8)
|
|
|
|
return EMSGSIZE;
|
|
|
|
|
2000-03-27 19:14:27 +00:00
|
|
|
/*
|
2003-08-07 18:16:59 +00:00
|
|
|
* If the interface will not calculate checksums on
|
2000-03-27 19:14:27 +00:00
|
|
|
* fragmented packets, then do it here.
|
|
|
|
*/
|
2012-11-27 19:31:49 +00:00
|
|
|
if (m0->m_pkthdr.csum_flags & CSUM_DELAY_DATA) {
|
Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
2019-06-29 00:48:33 +00:00
|
|
|
m0 = mb_unmapped_to_ext(m0);
|
|
|
|
if (m0 == NULL) {
|
|
|
|
error = ENOBUFS;
|
|
|
|
IPSTAT_INC(ips_odropped);
|
|
|
|
goto done;
|
|
|
|
}
|
2003-08-07 18:16:59 +00:00
|
|
|
in_delayed_cksum(m0);
|
|
|
|
m0->m_pkthdr.csum_flags &= ~CSUM_DELAY_DATA;
|
2000-03-27 19:14:27 +00:00
|
|
|
}
|
2009-02-03 11:00:43 +00:00
|
|
|
#ifdef SCTP
|
2012-11-27 19:31:49 +00:00
|
|
|
if (m0->m_pkthdr.csum_flags & CSUM_SCTP) {
|
Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
2019-06-29 00:48:33 +00:00
|
|
|
m0 = mb_unmapped_to_ext(m0);
|
|
|
|
if (m0 == NULL) {
|
|
|
|
error = ENOBUFS;
|
|
|
|
IPSTAT_INC(ips_odropped);
|
|
|
|
goto done;
|
|
|
|
}
|
2010-03-12 22:58:52 +00:00
|
|
|
sctp_delayed_cksum(m0, hlen);
|
2009-02-03 11:00:43 +00:00
|
|
|
m0->m_pkthdr.csum_flags &= ~CSUM_SCTP;
|
|
|
|
}
|
|
|
|
#endif
|
At long last, commit the zero copy sockets code.
MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.
ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.
man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.
jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.
zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.
NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.
conf/files: Add uipc_jumbo.c and uipc_cow.c.
conf/options: Add the 5 options mentioned above.
kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.
uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.
uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.
uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.
Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.
uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)
if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.
The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).
ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.
if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.
Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.
Add header splitting support to the ti(4) driver.
Tweak some of the default interrupt coalescing
parameters to more useful defaults.
Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.
if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.
Add defines needed for debugging.
Remove the ti_stats structure, it is now defined in
sys/tiio.h.
ti_fw.h: 12.4.11 firmware.
ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)
sys/jumbo.h: Jumbo buffer allocator interface.
sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.
socketvar.h: Add prototype for socow_setup.
tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.
uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.
ufs_readwrite.c:Update for new prototype of uiomoveco().
vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.
vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.
This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)
vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.
vm_object.h: Add prototype for vm_object_allocate_wait().
vm_page.c: Add page-based copy on write setup, clear and fault
routines.
vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.
Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.
2002-06-26 03:37:47 +00:00
|
|
|
if (len > PAGE_SIZE) {
|
2014-01-16 12:58:03 +00:00
|
|
|
/*
|
|
|
|
* Fragment large datagrams such that each segment
|
|
|
|
* contains a multiple of PAGE_SIZE amount of data,
|
|
|
|
* plus headers. This enables a receiver to perform
|
At long last, commit the zero copy sockets code.
MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.
ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.
man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.
jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.
zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.
NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.
conf/files: Add uipc_jumbo.c and uipc_cow.c.
conf/options: Add the 5 options mentioned above.
kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.
uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.
uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.
uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.
Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.
uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)
if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.
The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).
ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.
if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.
Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.
Add header splitting support to the ti(4) driver.
Tweak some of the default interrupt coalescing
parameters to more useful defaults.
Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.
if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.
Add defines needed for debugging.
Remove the ti_stats structure, it is now defined in
sys/tiio.h.
ti_fw.h: 12.4.11 firmware.
ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)
sys/jumbo.h: Jumbo buffer allocator interface.
sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.
socketvar.h: Add prototype for socow_setup.
tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.
uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.
ufs_readwrite.c:Update for new prototype of uiomoveco().
vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.
vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.
This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)
vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.
vm_object.h: Add prototype for vm_object_allocate_wait().
vm_page.c: Add page-based copy on write setup, clear and fault
routines.
vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.
Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.
2002-06-26 03:37:47 +00:00
|
|
|
* page-flipping zero-copy optimizations.
|
2003-08-07 18:16:59 +00:00
|
|
|
*
|
|
|
|
* XXX When does this help given that sender and receiver
|
|
|
|
* could have different page sizes, and also mtu could
|
|
|
|
* be less than the receiver's page size ?
|
At long last, commit the zero copy sockets code.
MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.
ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.
man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.
jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.
zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.
NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.
conf/files: Add uipc_jumbo.c and uipc_cow.c.
conf/options: Add the 5 options mentioned above.
kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.
uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.
uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.
uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.
Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.
uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)
if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.
The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).
ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.
if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.
Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.
Add header splitting support to the ti(4) driver.
Tweak some of the default interrupt coalescing
parameters to more useful defaults.
Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.
if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.
Add defines needed for debugging.
Remove the ti_stats structure, it is now defined in
sys/tiio.h.
ti_fw.h: 12.4.11 firmware.
ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)
sys/jumbo.h: Jumbo buffer allocator interface.
sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.
socketvar.h: Add prototype for socow_setup.
tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.
uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.
ufs_readwrite.c:Update for new prototype of uiomoveco().
vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.
vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.
This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)
vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.
vm_object.h: Add prototype for vm_object_allocate_wait().
vm_page.c: Add page-based copy on write setup, clear and fault
routines.
vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.
Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.
2002-06-26 03:37:47 +00:00
|
|
|
*/
|
|
|
|
int newlen;
|
2003-08-07 18:16:59 +00:00
|
|
|
|
2015-02-25 13:58:43 +00:00
|
|
|
off = MIN(mtu, m0->m_pkthdr.len);
|
At long last, commit the zero copy sockets code.
MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.
ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.
man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.
jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.
zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.
NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.
conf/files: Add uipc_jumbo.c and uipc_cow.c.
conf/options: Add the 5 options mentioned above.
kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.
uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.
uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.
uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.
Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.
uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)
if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.
The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).
ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.
if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.
Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.
Add header splitting support to the ti(4) driver.
Tweak some of the default interrupt coalescing
parameters to more useful defaults.
Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.
if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.
Add defines needed for debugging.
Remove the ti_stats structure, it is now defined in
sys/tiio.h.
ti_fw.h: 12.4.11 firmware.
ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)
sys/jumbo.h: Jumbo buffer allocator interface.
sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.
socketvar.h: Add prototype for socow_setup.
tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.
uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.
ufs_readwrite.c:Update for new prototype of uiomoveco().
vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.
vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.
This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)
vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.
vm_object.h: Add prototype for vm_object_allocate_wait().
vm_page.c: Add page-based copy on write setup, clear and fault
routines.
vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.
Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.
2002-06-26 03:37:47 +00:00
|
|
|
|
|
|
|
/*
|
2014-01-16 12:58:03 +00:00
|
|
|
* firstlen (off - hlen) must be aligned on an
|
At long last, commit the zero copy sockets code.
MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.
ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.
man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.
jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.
zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.
NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.
conf/files: Add uipc_jumbo.c and uipc_cow.c.
conf/options: Add the 5 options mentioned above.
kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.
uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.
uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.
uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.
Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.
uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)
if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.
The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).
ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.
if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.
Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.
Add header splitting support to the ti(4) driver.
Tweak some of the default interrupt coalescing
parameters to more useful defaults.
Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.
if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.
Add defines needed for debugging.
Remove the ti_stats structure, it is now defined in
sys/tiio.h.
ti_fw.h: 12.4.11 firmware.
ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)
sys/jumbo.h: Jumbo buffer allocator interface.
sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.
socketvar.h: Add prototype for socow_setup.
tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.
uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.
ufs_readwrite.c:Update for new prototype of uiomoveco().
vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.
vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.
This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)
vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.
vm_object.h: Add prototype for vm_object_allocate_wait().
vm_page.c: Add page-based copy on write setup, clear and fault
routines.
vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.
Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.
2002-06-26 03:37:47 +00:00
|
|
|
* 8-byte boundary
|
|
|
|
*/
|
|
|
|
if (off < hlen)
|
|
|
|
goto smart_frag_failure;
|
|
|
|
off = ((off - hlen) & ~7) + hlen;
|
2003-08-07 18:16:59 +00:00
|
|
|
newlen = (~PAGE_MASK) & mtu;
|
|
|
|
if ((newlen + sizeof (struct ip)) > mtu) {
|
At long last, commit the zero copy sockets code.
MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.
ti.4: Update the ti(4) man page to include information on the
TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
and also include information about the new character
device interface and the associated ioctls.
man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated
links.
jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.
zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.
NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.
conf/files: Add uipc_jumbo.c and uipc_cow.c.
conf/options: Add the 5 options mentioned above.
kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.
uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network
stack.
uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy
receive.
uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.
Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.
uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)
if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.
The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).
ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.
if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.
Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.
Add header splitting support to the ti(4) driver.
Tweak some of the default interrupt coalescing
parameters to more useful defaults.
Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.
if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.
Add defines needed for debugging.
Remove the ti_stats structure, it is now defined in
sys/tiio.h.
ti_fw.h: 12.4.11 firmware.
ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)
sys/jumbo.h: Jumbo buffer allocator interface.
sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.
socketvar.h: Add prototype for socow_setup.
tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.
uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.
ufs_readwrite.c:Update for new prototype of uiomoveco().
vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.
vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.
This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)
vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to
M_WAITOK.
vm_object.h: Add prototype for vm_object_allocate_wait().
vm_page.c: Add page-based copy on write setup, clear and fault
routines.
vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.
Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.
2002-06-26 03:37:47 +00:00
|
|
|
/* we failed, go back the default */
|
|
|
|
smart_frag_failure:
|
|
|
|
newlen = len;
|
|
|
|
off = hlen + len;
|
|
|
|
}
|
|
|
|
len = newlen;
|
|
|
|
|
|
|
|
} else {
|
|
|
|
off = hlen + len;
|
|
|
|
}
|
|
|
|
|
2003-08-07 18:16:59 +00:00
|
|
|
firstlen = off - hlen;
|
|
|
|
mnext = &m0->m_nextpkt; /* pointer to next packet */
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Loop through length of segment after first fragment,
|
|
|
|
* make new header and copy data of each part and link onto chain.
|
2003-08-07 18:16:59 +00:00
|
|
|
* Here, m0 is the original packet, m is the fragment being created.
|
|
|
|
* The fragments are linked off the m_nextpkt of the original
|
|
|
|
* packet, which after processing serves as the first fragment.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2012-10-06 10:02:11 +00:00
|
|
|
for (nfrags = 1; off < ip_len; off += len, nfrags++) {
|
2003-08-07 18:16:59 +00:00
|
|
|
struct ip *mhip; /* ip header on the fragment */
|
|
|
|
struct mbuf *m;
|
|
|
|
int mhlen = sizeof (struct ip);
|
|
|
|
|
2013-03-15 12:55:30 +00:00
|
|
|
m = m_gethdr(M_NOWAIT, MT_DATA);
|
2004-08-11 10:46:15 +00:00
|
|
|
if (m == NULL) {
|
1994-05-24 10:09:53 +00:00
|
|
|
error = ENOBUFS;
|
2009-04-11 23:35:20 +00:00
|
|
|
IPSTAT_INC(ips_odropped);
|
2003-08-07 18:16:59 +00:00
|
|
|
goto done;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2015-04-02 15:47:37 +00:00
|
|
|
/*
|
|
|
|
* Make sure the complete packet header gets copied
|
|
|
|
* from the originating mbuf to the newly created
|
|
|
|
* mbuf. This also ensures that existing firewall
|
|
|
|
* classification(s), VLAN tags and so on get copied
|
|
|
|
* to the resulting fragmented packet(s):
|
|
|
|
*/
|
|
|
|
if (m_dup_pkthdr(m, m0, M_NOWAIT) == 0) {
|
|
|
|
m_free(m);
|
|
|
|
error = ENOBUFS;
|
|
|
|
IPSTAT_INC(ips_odropped);
|
|
|
|
goto done;
|
|
|
|
}
|
2003-08-07 18:16:59 +00:00
|
|
|
/*
|
|
|
|
* In the first mbuf, leave room for the link header, then
|
|
|
|
* copy the original IP header including options. The payload
|
2009-03-04 03:45:34 +00:00
|
|
|
* goes into an additional mbuf chain returned by m_copym().
|
2003-08-07 18:16:59 +00:00
|
|
|
*/
|
1994-05-24 10:09:53 +00:00
|
|
|
m->m_data += max_linkhdr;
|
|
|
|
mhip = mtod(m, struct ip *);
|
|
|
|
*mhip = *ip;
|
|
|
|
if (hlen > sizeof (struct ip)) {
|
|
|
|
mhlen = ip_optcopy(ip, mhip) + sizeof (struct ip);
|
2002-10-20 22:52:07 +00:00
|
|
|
mhip->ip_v = IPVERSION;
|
|
|
|
mhip->ip_hl = mhlen >> 2;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
m->m_len = mhlen;
|
2012-10-06 10:02:11 +00:00
|
|
|
/* XXX do we need to add ip_off below ? */
|
|
|
|
mhip->ip_off = ((off - hlen) >> 3) + ip_off;
|
2013-08-19 10:30:15 +00:00
|
|
|
if (off + len >= ip_len)
|
2012-10-06 10:02:11 +00:00
|
|
|
len = ip_len - off;
|
2013-08-19 10:30:15 +00:00
|
|
|
else
|
1994-05-24 10:09:53 +00:00
|
|
|
mhip->ip_off |= IP_MF;
|
|
|
|
mhip->ip_len = htons((u_short)(len + mhlen));
|
2012-12-05 08:04:20 +00:00
|
|
|
m->m_next = m_copym(m0, off, len, M_NOWAIT);
|
2004-08-11 10:46:15 +00:00
|
|
|
if (m->m_next == NULL) { /* copy failed */
|
2003-08-07 18:16:59 +00:00
|
|
|
m_free(m);
|
1994-05-24 10:09:53 +00:00
|
|
|
error = ENOBUFS; /* ??? */
|
2009-04-11 23:35:20 +00:00
|
|
|
IPSTAT_INC(ips_odropped);
|
2003-08-07 18:16:59 +00:00
|
|
|
goto done;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
m->m_pkthdr.len = mhlen + len;
|
2002-07-31 17:21:01 +00:00
|
|
|
#ifdef MAC
|
2007-10-24 19:04:04 +00:00
|
|
|
mac_netinet_fragment(m0, m);
|
2002-07-31 17:21:01 +00:00
|
|
|
#endif
|
2002-02-18 20:35:27 +00:00
|
|
|
mhip->ip_off = htons(mhip->ip_off);
|
1994-05-24 10:09:53 +00:00
|
|
|
mhip->ip_sum = 0;
|
2012-10-26 21:06:33 +00:00
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_IP & ~if_hwassist_flags) {
|
2002-10-20 22:52:07 +00:00
|
|
|
mhip->ip_sum = in_cksum(m, mhlen);
|
2012-10-26 21:06:33 +00:00
|
|
|
m->m_pkthdr.csum_flags &= ~CSUM_IP;
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
*mnext = m;
|
|
|
|
mnext = &m->m_nextpkt;
|
|
|
|
}
|
2009-04-11 23:35:20 +00:00
|
|
|
IPSTAT_ADD(ips_ofragments, nfrags);
|
2000-03-27 19:14:27 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2002-11-20 18:56:25 +00:00
|
|
|
* Update first fragment by trimming what's been copied out
|
2003-08-07 18:16:59 +00:00
|
|
|
* and updating header.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2012-10-06 10:02:11 +00:00
|
|
|
m_adj(m0, hlen + firstlen - ip_len);
|
2003-08-07 18:16:59 +00:00
|
|
|
m0->m_pkthdr.len = hlen + firstlen;
|
|
|
|
ip->ip_len = htons((u_short)m0->m_pkthdr.len);
|
2012-10-06 10:02:11 +00:00
|
|
|
ip->ip_off = htons(ip_off | IP_MF);
|
1994-05-24 10:09:53 +00:00
|
|
|
ip->ip_sum = 0;
|
2012-10-26 21:06:33 +00:00
|
|
|
if (m0->m_pkthdr.csum_flags & CSUM_IP & ~if_hwassist_flags) {
|
2003-08-07 18:16:59 +00:00
|
|
|
ip->ip_sum = in_cksum(m0, hlen);
|
2012-10-26 21:06:33 +00:00
|
|
|
m0->m_pkthdr.csum_flags &= ~CSUM_IP;
|
|
|
|
}
|
2002-11-20 18:56:25 +00:00
|
|
|
|
|
|
|
done:
|
2003-08-07 18:16:59 +00:00
|
|
|
*m_frag = m0;
|
|
|
|
return error;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
2000-05-21 21:26:06 +00:00
|
|
|
void
|
2000-03-27 19:14:27 +00:00
|
|
|
in_delayed_cksum(struct mbuf *m)
|
|
|
|
{
|
|
|
|
struct ip *ip;
|
2018-06-06 07:04:40 +00:00
|
|
|
struct udphdr *uh;
|
|
|
|
uint16_t cklen, csum, offset;
|
2000-03-27 19:14:27 +00:00
|
|
|
|
|
|
|
ip = mtod(m, struct ip *);
|
2002-10-20 22:52:07 +00:00
|
|
|
offset = ip->ip_hl << 2 ;
|
2018-06-06 07:04:40 +00:00
|
|
|
|
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_UDP) {
|
|
|
|
/* if udp header is not in the first mbuf copy udplen */
|
2018-10-05 12:51:30 +00:00
|
|
|
if (offset + sizeof(struct udphdr) > m->m_len) {
|
2018-06-06 13:01:53 +00:00
|
|
|
m_copydata(m, offset + offsetof(struct udphdr,
|
|
|
|
uh_ulen), sizeof(cklen), (caddr_t)&cklen);
|
2018-10-05 12:51:30 +00:00
|
|
|
cklen = ntohs(cklen);
|
|
|
|
} else {
|
2018-06-06 13:01:53 +00:00
|
|
|
uh = (struct udphdr *)mtodo(m, offset);
|
2018-06-06 07:04:40 +00:00
|
|
|
cklen = ntohs(uh->uh_ulen);
|
|
|
|
}
|
|
|
|
csum = in_cksum_skip(m, cklen + offset, offset);
|
|
|
|
if (csum == 0)
|
|
|
|
csum = 0xffff;
|
|
|
|
} else {
|
|
|
|
cklen = ntohs(ip->ip_len);
|
|
|
|
csum = in_cksum_skip(m, cklen, offset);
|
|
|
|
}
|
2000-03-27 19:14:27 +00:00
|
|
|
offset += m->m_pkthdr.csum_data; /* checksum offset */
|
|
|
|
|
2018-06-06 13:01:53 +00:00
|
|
|
if (offset + sizeof(csum) > m->m_len)
|
|
|
|
m_copyback(m, offset, sizeof(csum), (caddr_t)&csum);
|
|
|
|
else
|
|
|
|
*(u_short *)mtodo(m, offset) = csum;
|
2000-03-27 19:14:27 +00:00
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* IP socket option processing.
|
|
|
|
*/
|
|
|
|
int
|
2007-05-10 15:58:48 +00:00
|
|
|
ip_ctloutput(struct socket *so, struct sockopt *sopt)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
Widen NET_EPOCH coverage.
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.
However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.
Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.
On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().
This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.
Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.
This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.
Reviewed by: gallatin, hselasky, cy, adrian, kristof
Differential Revision: https://reviews.freebsd.org/D19111
2019-10-07 22:40:05 +00:00
|
|
|
struct inpcb *inp = sotoinpcb(so);
|
1998-08-23 03:07:17 +00:00
|
|
|
int error, optval;
|
2014-06-27 19:07:00 +00:00
|
|
|
#ifdef RSS
|
|
|
|
uint32_t rss_bucket;
|
|
|
|
int retval;
|
|
|
|
#endif
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1998-08-23 03:07:17 +00:00
|
|
|
error = optval = 0;
|
|
|
|
if (sopt->sopt_level != IPPROTO_IP) {
|
2011-11-06 10:47:20 +00:00
|
|
|
error = EINVAL;
|
|
|
|
|
|
|
|
if (sopt->sopt_level == SOL_SOCKET &&
|
|
|
|
sopt->sopt_dir == SOPT_SET) {
|
|
|
|
switch (sopt->sopt_name) {
|
|
|
|
case SO_REUSEADDR:
|
|
|
|
INP_WLOCK(inp);
|
2013-07-04 18:38:00 +00:00
|
|
|
if ((so->so_options & SO_REUSEADDR) != 0)
|
|
|
|
inp->inp_flags2 |= INP_REUSEADDR;
|
|
|
|
else
|
|
|
|
inp->inp_flags2 &= ~INP_REUSEADDR;
|
2011-11-06 10:47:20 +00:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = 0;
|
|
|
|
break;
|
|
|
|
case SO_REUSEPORT:
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
if ((so->so_options & SO_REUSEPORT) != 0)
|
|
|
|
inp->inp_flags2 |= INP_REUSEPORT;
|
|
|
|
else
|
|
|
|
inp->inp_flags2 &= ~INP_REUSEPORT;
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = 0;
|
|
|
|
break;
|
2018-06-06 15:45:57 +00:00
|
|
|
case SO_REUSEPORT_LB:
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
if ((so->so_options & SO_REUSEPORT_LB) != 0)
|
|
|
|
inp->inp_flags2 |= INP_REUSEPORT_LB;
|
|
|
|
else
|
|
|
|
inp->inp_flags2 &= ~INP_REUSEPORT_LB;
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = 0;
|
|
|
|
break;
|
2011-11-06 10:47:20 +00:00
|
|
|
case SO_SETFIB:
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
inp->inp_inc.inc_fibnum = so->so_fibnum;
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = 0;
|
|
|
|
break;
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
case SO_MAX_PACING_RATE:
|
|
|
|
#ifdef RATELIMIT
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
inp->inp_flags2 |= INP_RATE_LIMIT_CHANGED;
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = 0;
|
|
|
|
#else
|
|
|
|
error = EOPNOTSUPP;
|
|
|
|
#endif
|
|
|
|
break;
|
2011-11-06 10:47:20 +00:00
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
2008-11-19 19:19:30 +00:00
|
|
|
}
|
2011-11-06 10:47:20 +00:00
|
|
|
return (error);
|
1998-08-23 03:07:17 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1998-08-23 03:07:17 +00:00
|
|
|
switch (sopt->sopt_dir) {
|
|
|
|
case SOPT_SET:
|
|
|
|
switch (sopt->sopt_name) {
|
1994-05-24 10:09:53 +00:00
|
|
|
case IP_OPTIONS:
|
|
|
|
#ifdef notyet
|
|
|
|
case IP_RETOPTS:
|
|
|
|
#endif
|
1998-08-23 03:07:17 +00:00
|
|
|
{
|
|
|
|
struct mbuf *m;
|
|
|
|
if (sopt->sopt_valsize > MLEN) {
|
|
|
|
error = EMSGSIZE;
|
|
|
|
break;
|
|
|
|
}
|
2013-03-15 12:55:30 +00:00
|
|
|
m = m_get(sopt->sopt_td ? M_WAITOK : M_NOWAIT, MT_DATA);
|
2004-08-11 10:46:15 +00:00
|
|
|
if (m == NULL) {
|
1998-08-23 03:07:17 +00:00
|
|
|
error = ENOBUFS;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
m->m_len = sopt->sopt_valsize;
|
|
|
|
error = sooptcopyin(sopt, mtod(m, char *), m->m_len,
|
|
|
|
m->m_len);
|
2006-05-21 17:52:08 +00:00
|
|
|
if (error) {
|
|
|
|
m_free(m);
|
|
|
|
break;
|
|
|
|
}
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2004-12-05 19:11:09 +00:00
|
|
|
error = ip_pcbopts(inp, sopt->sopt_name, m);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2004-12-05 19:11:09 +00:00
|
|
|
return (error);
|
1998-08-23 03:07:17 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2009-06-01 10:30:00 +00:00
|
|
|
case IP_BINDANY:
|
|
|
|
if (sopt->sopt_td != NULL) {
|
|
|
|
error = priv_check(sopt->sopt_td,
|
|
|
|
PRIV_NETINET_BINDANY);
|
|
|
|
if (error)
|
|
|
|
break;
|
2009-01-09 17:21:22 +00:00
|
|
|
}
|
|
|
|
/* FALLTHROUGH */
|
2014-07-10 03:10:56 +00:00
|
|
|
case IP_BINDMULTI:
|
|
|
|
#ifdef RSS
|
|
|
|
case IP_RSS_LISTEN_BUCKET:
|
|
|
|
#endif
|
1994-05-24 10:09:53 +00:00
|
|
|
case IP_TOS:
|
|
|
|
case IP_TTL:
|
2005-08-22 16:13:08 +00:00
|
|
|
case IP_MINTTL:
|
1994-05-24 10:09:53 +00:00
|
|
|
case IP_RECVOPTS:
|
|
|
|
case IP_RECVRETOPTS:
|
2017-03-06 04:01:58 +00:00
|
|
|
case IP_ORIGDSTADDR:
|
1994-05-24 10:09:53 +00:00
|
|
|
case IP_RECVDSTADDR:
|
2003-04-29 21:36:18 +00:00
|
|
|
case IP_RECVTTL:
|
1996-11-11 04:56:32 +00:00
|
|
|
case IP_RECVIF:
|
2003-08-20 14:46:40 +00:00
|
|
|
case IP_ONESBCAST:
|
2005-09-26 20:25:16 +00:00
|
|
|
case IP_DONTFRAG:
|
2012-06-12 14:02:38 +00:00
|
|
|
case IP_RECVTOS:
|
2014-09-09 01:45:39 +00:00
|
|
|
case IP_RECVFLOWID:
|
|
|
|
#ifdef RSS
|
|
|
|
case IP_RECVRSSBUCKETID:
|
|
|
|
#endif
|
1998-08-23 03:07:17 +00:00
|
|
|
error = sooptcopyin(sopt, &optval, sizeof optval,
|
|
|
|
sizeof optval);
|
|
|
|
if (error)
|
|
|
|
break;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1998-08-23 03:07:17 +00:00
|
|
|
switch (sopt->sopt_name) {
|
|
|
|
case IP_TOS:
|
|
|
|
inp->inp_ip_tos = optval;
|
|
|
|
break;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1998-08-23 03:07:17 +00:00
|
|
|
case IP_TTL:
|
|
|
|
inp->inp_ip_ttl = optval;
|
|
|
|
break;
|
2005-08-22 16:13:08 +00:00
|
|
|
|
|
|
|
case IP_MINTTL:
|
2009-01-03 11:35:31 +00:00
|
|
|
if (optval >= 0 && optval <= MAXTTL)
|
2005-08-22 16:13:08 +00:00
|
|
|
inp->inp_ip_minttl = optval;
|
|
|
|
else
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
|
2004-06-24 02:05:47 +00:00
|
|
|
#define OPTSET(bit) do { \
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp); \
|
2004-06-24 02:05:47 +00:00
|
|
|
if (optval) \
|
|
|
|
inp->inp_flags |= bit; \
|
|
|
|
else \
|
|
|
|
inp->inp_flags &= ~bit; \
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp); \
|
2004-06-24 02:05:47 +00:00
|
|
|
} while (0)
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2014-07-10 03:10:56 +00:00
|
|
|
#define OPTSET2(bit, val) do { \
|
|
|
|
INP_WLOCK(inp); \
|
|
|
|
if (val) \
|
|
|
|
inp->inp_flags2 |= bit; \
|
|
|
|
else \
|
|
|
|
inp->inp_flags2 &= ~bit; \
|
|
|
|
INP_WUNLOCK(inp); \
|
|
|
|
} while (0)
|
|
|
|
|
1998-08-23 03:07:17 +00:00
|
|
|
case IP_RECVOPTS:
|
|
|
|
OPTSET(INP_RECVOPTS);
|
|
|
|
break;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1998-08-23 03:07:17 +00:00
|
|
|
case IP_RECVRETOPTS:
|
|
|
|
OPTSET(INP_RECVRETOPTS);
|
|
|
|
break;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1998-08-23 03:07:17 +00:00
|
|
|
case IP_RECVDSTADDR:
|
|
|
|
OPTSET(INP_RECVDSTADDR);
|
|
|
|
break;
|
1996-11-11 04:56:32 +00:00
|
|
|
|
2017-03-06 04:01:58 +00:00
|
|
|
case IP_ORIGDSTADDR:
|
|
|
|
OPTSET2(INP_ORIGDSTADDR, optval);
|
|
|
|
break;
|
|
|
|
|
2003-04-29 21:36:18 +00:00
|
|
|
case IP_RECVTTL:
|
|
|
|
OPTSET(INP_RECVTTL);
|
|
|
|
break;
|
|
|
|
|
1998-08-23 03:07:17 +00:00
|
|
|
case IP_RECVIF:
|
|
|
|
OPTSET(INP_RECVIF);
|
|
|
|
break;
|
1999-12-22 19:13:38 +00:00
|
|
|
|
2003-08-20 14:46:40 +00:00
|
|
|
case IP_ONESBCAST:
|
|
|
|
OPTSET(INP_ONESBCAST);
|
|
|
|
break;
|
2005-09-26 20:25:16 +00:00
|
|
|
case IP_DONTFRAG:
|
|
|
|
OPTSET(INP_DONTFRAG);
|
|
|
|
break;
|
2009-06-01 10:30:00 +00:00
|
|
|
case IP_BINDANY:
|
|
|
|
OPTSET(INP_BINDANY);
|
2009-01-09 16:02:19 +00:00
|
|
|
break;
|
2012-06-12 14:02:38 +00:00
|
|
|
case IP_RECVTOS:
|
|
|
|
OPTSET(INP_RECVTOS);
|
|
|
|
break;
|
2014-07-10 03:10:56 +00:00
|
|
|
case IP_BINDMULTI:
|
|
|
|
OPTSET2(INP_BINDMULTI, optval);
|
|
|
|
break;
|
2014-09-09 01:45:39 +00:00
|
|
|
case IP_RECVFLOWID:
|
|
|
|
OPTSET2(INP_RECVFLOWID, optval);
|
|
|
|
break;
|
2014-07-10 03:10:56 +00:00
|
|
|
#ifdef RSS
|
|
|
|
case IP_RSS_LISTEN_BUCKET:
|
|
|
|
if ((optval >= 0) &&
|
|
|
|
(optval < rss_getnumbuckets())) {
|
|
|
|
inp->inp_rss_listen_bucket = optval;
|
|
|
|
OPTSET2(INP_RSS_BUCKET_SET, 1);
|
|
|
|
} else {
|
|
|
|
error = EINVAL;
|
|
|
|
}
|
|
|
|
break;
|
2014-09-09 01:45:39 +00:00
|
|
|
case IP_RECVRSSBUCKETID:
|
|
|
|
OPTSET2(INP_RECVRSSBUCKETID, optval);
|
|
|
|
break;
|
2014-07-10 03:10:56 +00:00
|
|
|
#endif
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
break;
|
|
|
|
#undef OPTSET
|
2014-07-10 03:10:56 +00:00
|
|
|
#undef OPTSET2
|
1994-05-24 10:09:53 +00:00
|
|
|
|
Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.
This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.
The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html
Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.
This work was financially supported by another FreeBSD committer.
Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)
2007-06-12 16:24:56 +00:00
|
|
|
/*
|
|
|
|
* Multicast socket options are processed by the in_mcast
|
|
|
|
* module.
|
|
|
|
*/
|
1994-05-24 10:09:53 +00:00
|
|
|
case IP_MULTICAST_IF:
|
1994-09-06 22:42:31 +00:00
|
|
|
case IP_MULTICAST_VIF:
|
1994-05-24 10:09:53 +00:00
|
|
|
case IP_MULTICAST_TTL:
|
|
|
|
case IP_MULTICAST_LOOP:
|
|
|
|
case IP_ADD_MEMBERSHIP:
|
|
|
|
case IP_DROP_MEMBERSHIP:
|
Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.
This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.
The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html
Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.
This work was financially supported by another FreeBSD committer.
Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)
2007-06-12 16:24:56 +00:00
|
|
|
case IP_ADD_SOURCE_MEMBERSHIP:
|
|
|
|
case IP_DROP_SOURCE_MEMBERSHIP:
|
|
|
|
case IP_BLOCK_SOURCE:
|
|
|
|
case IP_UNBLOCK_SOURCE:
|
|
|
|
case IP_MSFILTER:
|
|
|
|
case MCAST_JOIN_GROUP:
|
|
|
|
case MCAST_LEAVE_GROUP:
|
|
|
|
case MCAST_JOIN_SOURCE_GROUP:
|
|
|
|
case MCAST_LEAVE_SOURCE_GROUP:
|
|
|
|
case MCAST_BLOCK_SOURCE:
|
|
|
|
case MCAST_UNBLOCK_SOURCE:
|
|
|
|
error = inp_setmoptions(inp, sopt);
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
|
1996-02-22 21:32:23 +00:00
|
|
|
case IP_PORTRANGE:
|
1998-08-23 03:07:17 +00:00
|
|
|
error = sooptcopyin(sopt, &optval, sizeof optval,
|
|
|
|
sizeof optval);
|
|
|
|
if (error)
|
|
|
|
break;
|
1996-02-22 21:32:23 +00:00
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
1998-08-23 03:07:17 +00:00
|
|
|
switch (optval) {
|
|
|
|
case IP_PORTRANGE_DEFAULT:
|
|
|
|
inp->inp_flags &= ~(INP_LOWPORT);
|
|
|
|
inp->inp_flags &= ~(INP_HIGHPORT);
|
|
|
|
break;
|
1996-02-22 21:32:23 +00:00
|
|
|
|
1998-08-23 03:07:17 +00:00
|
|
|
case IP_PORTRANGE_HIGH:
|
|
|
|
inp->inp_flags &= ~(INP_LOWPORT);
|
|
|
|
inp->inp_flags |= INP_HIGHPORT;
|
|
|
|
break;
|
1996-02-22 21:32:23 +00:00
|
|
|
|
1998-08-23 03:07:17 +00:00
|
|
|
case IP_PORTRANGE_LOW:
|
|
|
|
inp->inp_flags &= ~(INP_HIGHPORT);
|
|
|
|
inp->inp_flags |= INP_LOWPORT;
|
|
|
|
break;
|
1996-02-22 21:32:23 +00:00
|
|
|
|
1998-08-23 03:07:17 +00:00
|
|
|
default:
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
1996-02-22 21:32:23 +00:00
|
|
|
}
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
1996-05-21 20:47:31 +00:00
|
|
|
break;
|
1996-02-22 21:32:23 +00:00
|
|
|
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT)
|
1999-12-22 19:13:38 +00:00
|
|
|
case IP_IPSEC_POLICY:
|
2017-02-06 08:49:57 +00:00
|
|
|
if (IPSEC_ENABLED(ipv4)) {
|
|
|
|
error = IPSEC_PCBCTL(ipv4, inp, sopt);
|
1999-12-22 19:13:38 +00:00
|
|
|
break;
|
2017-02-06 08:49:57 +00:00
|
|
|
}
|
|
|
|
/* FALLTHROUGH */
|
2007-07-03 12:13:45 +00:00
|
|
|
#endif /* IPSEC */
|
1999-12-22 19:13:38 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
default:
|
|
|
|
error = ENOPROTOOPT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
1998-08-23 03:07:17 +00:00
|
|
|
case SOPT_GET:
|
|
|
|
switch (sopt->sopt_name) {
|
1994-05-24 10:09:53 +00:00
|
|
|
case IP_OPTIONS:
|
|
|
|
case IP_RETOPTS:
|
2018-07-22 20:02:14 +00:00
|
|
|
INP_RLOCK(inp);
|
|
|
|
if (inp->inp_options) {
|
|
|
|
struct mbuf *options;
|
|
|
|
|
2019-01-09 06:36:57 +00:00
|
|
|
options = m_copym(inp->inp_options, 0,
|
|
|
|
M_COPYALL, M_NOWAIT);
|
2018-07-22 20:02:14 +00:00
|
|
|
INP_RUNLOCK(inp);
|
|
|
|
if (options != NULL) {
|
|
|
|
error = sooptcopyout(sopt,
|
|
|
|
mtod(options, char *),
|
|
|
|
options->m_len);
|
|
|
|
m_freem(options);
|
|
|
|
} else
|
|
|
|
error = ENOMEM;
|
|
|
|
} else {
|
|
|
|
INP_RUNLOCK(inp);
|
1998-08-23 03:07:17 +00:00
|
|
|
sopt->sopt_valsize = 0;
|
2018-07-22 20:02:14 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
case IP_TOS:
|
|
|
|
case IP_TTL:
|
2005-08-22 16:13:08 +00:00
|
|
|
case IP_MINTTL:
|
1994-05-24 10:09:53 +00:00
|
|
|
case IP_RECVOPTS:
|
|
|
|
case IP_RECVRETOPTS:
|
2017-03-06 04:01:58 +00:00
|
|
|
case IP_ORIGDSTADDR:
|
1994-05-24 10:09:53 +00:00
|
|
|
case IP_RECVDSTADDR:
|
2003-04-29 21:36:18 +00:00
|
|
|
case IP_RECVTTL:
|
1996-11-11 04:56:32 +00:00
|
|
|
case IP_RECVIF:
|
1998-08-23 03:07:17 +00:00
|
|
|
case IP_PORTRANGE:
|
2003-08-20 14:46:40 +00:00
|
|
|
case IP_ONESBCAST:
|
2005-09-26 20:25:16 +00:00
|
|
|
case IP_DONTFRAG:
|
2010-09-24 14:38:54 +00:00
|
|
|
case IP_BINDANY:
|
2012-06-12 14:02:38 +00:00
|
|
|
case IP_RECVTOS:
|
2014-07-10 03:10:56 +00:00
|
|
|
case IP_BINDMULTI:
|
2014-05-18 22:37:31 +00:00
|
|
|
case IP_FLOWID:
|
|
|
|
case IP_FLOWTYPE:
|
2014-09-09 01:45:39 +00:00
|
|
|
case IP_RECVFLOWID:
|
2014-07-10 03:10:56 +00:00
|
|
|
#ifdef RSS
|
|
|
|
case IP_RSSBUCKETID:
|
2014-09-09 01:45:39 +00:00
|
|
|
case IP_RECVRSSBUCKETID:
|
2014-07-10 03:10:56 +00:00
|
|
|
#endif
|
1998-08-23 03:07:17 +00:00
|
|
|
switch (sopt->sopt_name) {
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
case IP_TOS:
|
1997-04-03 05:14:45 +00:00
|
|
|
optval = inp->inp_ip_tos;
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
case IP_TTL:
|
1997-04-03 05:14:45 +00:00
|
|
|
optval = inp->inp_ip_ttl;
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
|
2005-08-22 16:13:08 +00:00
|
|
|
case IP_MINTTL:
|
|
|
|
optval = inp->inp_ip_minttl;
|
|
|
|
break;
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
#define OPTBIT(bit) (inp->inp_flags & bit ? 1 : 0)
|
2014-07-10 03:10:56 +00:00
|
|
|
#define OPTBIT2(bit) (inp->inp_flags2 & bit ? 1 : 0)
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
case IP_RECVOPTS:
|
|
|
|
optval = OPTBIT(INP_RECVOPTS);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IP_RECVRETOPTS:
|
|
|
|
optval = OPTBIT(INP_RECVRETOPTS);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IP_RECVDSTADDR:
|
|
|
|
optval = OPTBIT(INP_RECVDSTADDR);
|
|
|
|
break;
|
1996-11-11 04:56:32 +00:00
|
|
|
|
2017-03-06 04:01:58 +00:00
|
|
|
case IP_ORIGDSTADDR:
|
|
|
|
optval = OPTBIT2(INP_ORIGDSTADDR);
|
|
|
|
break;
|
|
|
|
|
2003-04-29 21:36:18 +00:00
|
|
|
case IP_RECVTTL:
|
|
|
|
optval = OPTBIT(INP_RECVTTL);
|
|
|
|
break;
|
|
|
|
|
1996-11-11 04:56:32 +00:00
|
|
|
case IP_RECVIF:
|
|
|
|
optval = OPTBIT(INP_RECVIF);
|
|
|
|
break;
|
1998-08-23 03:07:17 +00:00
|
|
|
|
|
|
|
case IP_PORTRANGE:
|
|
|
|
if (inp->inp_flags & INP_HIGHPORT)
|
|
|
|
optval = IP_PORTRANGE_HIGH;
|
|
|
|
else if (inp->inp_flags & INP_LOWPORT)
|
|
|
|
optval = IP_PORTRANGE_LOW;
|
|
|
|
else
|
|
|
|
optval = 0;
|
|
|
|
break;
|
1999-12-22 19:13:38 +00:00
|
|
|
|
2003-08-20 14:46:40 +00:00
|
|
|
case IP_ONESBCAST:
|
|
|
|
optval = OPTBIT(INP_ONESBCAST);
|
|
|
|
break;
|
2005-09-26 20:25:16 +00:00
|
|
|
case IP_DONTFRAG:
|
|
|
|
optval = OPTBIT(INP_DONTFRAG);
|
|
|
|
break;
|
2010-09-24 14:38:54 +00:00
|
|
|
case IP_BINDANY:
|
|
|
|
optval = OPTBIT(INP_BINDANY);
|
|
|
|
break;
|
2012-06-12 14:02:38 +00:00
|
|
|
case IP_RECVTOS:
|
|
|
|
optval = OPTBIT(INP_RECVTOS);
|
|
|
|
break;
|
2014-05-18 22:37:31 +00:00
|
|
|
case IP_FLOWID:
|
|
|
|
optval = inp->inp_flowid;
|
|
|
|
break;
|
|
|
|
case IP_FLOWTYPE:
|
|
|
|
optval = inp->inp_flowtype;
|
|
|
|
break;
|
2014-09-09 01:45:39 +00:00
|
|
|
case IP_RECVFLOWID:
|
|
|
|
optval = OPTBIT2(INP_RECVFLOWID);
|
|
|
|
break;
|
2014-05-18 22:37:31 +00:00
|
|
|
#ifdef RSS
|
2014-06-26 04:12:41 +00:00
|
|
|
case IP_RSSBUCKETID:
|
|
|
|
retval = rss_hash2bucket(inp->inp_flowid,
|
|
|
|
inp->inp_flowtype,
|
|
|
|
&rss_bucket);
|
|
|
|
if (retval == 0)
|
|
|
|
optval = rss_bucket;
|
|
|
|
else
|
|
|
|
error = EINVAL;
|
2014-05-18 22:37:31 +00:00
|
|
|
break;
|
2014-09-09 01:45:39 +00:00
|
|
|
case IP_RECVRSSBUCKETID:
|
|
|
|
optval = OPTBIT2(INP_RECVRSSBUCKETID);
|
|
|
|
break;
|
2014-05-18 22:37:31 +00:00
|
|
|
#endif
|
2014-07-10 03:10:56 +00:00
|
|
|
case IP_BINDMULTI:
|
|
|
|
optval = OPTBIT2(INP_BINDMULTI);
|
|
|
|
break;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
1998-08-23 03:07:17 +00:00
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
|
Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.
This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.
The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html
Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.
This work was financially supported by another FreeBSD committer.
Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)
2007-06-12 16:24:56 +00:00
|
|
|
/*
|
|
|
|
* Multicast socket options are processed by the in_mcast
|
|
|
|
* module.
|
|
|
|
*/
|
1994-05-24 10:09:53 +00:00
|
|
|
case IP_MULTICAST_IF:
|
1994-09-06 22:42:31 +00:00
|
|
|
case IP_MULTICAST_VIF:
|
1994-05-24 10:09:53 +00:00
|
|
|
case IP_MULTICAST_TTL:
|
|
|
|
case IP_MULTICAST_LOOP:
|
Import rewrite of IPv4 socket multicast layer to support source-specific
and protocol-independent host mode multicast. The code is written to
accomodate IPv6, IGMPv3 and MLDv2 with only a little additional work.
This change only pertains to FreeBSD's use as a multicast end-station and
does not concern multicast routing; for an IGMPv3/MLDv2 router
implementation, consider the XORP project.
The work is based on Wilbert de Graaf's IGMPv3 code drop for FreeBSD 4.6,
which is available at: http://www.kloosterhof.com/wilbert/igmpv3.html
Summary
* IPv4 multicast socket processing is now moved out of ip_output.c
into a new module, in_mcast.c.
* The in_mcast.c module implements the IPv4 legacy any-source API in
terms of the protocol-independent source-specific API.
* Source filters are lazy allocated as the common case does not use them.
They are part of per inpcb state and are covered by the inpcb lock.
* struct ip_mreqn is now supported to allow applications to specify
multicast joins by interface index in the legacy IPv4 any-source API.
* In UDP, an incoming multicast datagram only requires that the source
port matches the 4-tuple if the socket was already bound by source port.
An unbound socket SHOULD be able to receive multicasts sent from an
ephemeral source port.
* The UDP socket multicast filter mode defaults to exclusive, that is,
sources present in the per-socket list will be blocked from delivery.
* The RFC 3678 userland functions have been added to libc: setsourcefilter,
getsourcefilter, setipv4sourcefilter, getipv4sourcefilter.
* Definitions for IGMPv3 are merged but not yet used.
* struct sockaddr_storage is now referenced from <netinet/in.h>. It
is therefore defined there if not already declared in the same way
as for the C99 types.
* The RFC 1724 hack (specify 0.0.0.0/8 addresses to IP_MULTICAST_IF
which are then interpreted as interface indexes) is now deprecated.
* A patch for the Rhyolite.com routed in the FreeBSD base system
is available in the -net archives. This only affects individuals
running RIPv1 or RIPv2 via point-to-point and/or unnumbered interfaces.
* Make IPv6 detach path similar to IPv4's in code flow; functionally same.
* Bump __FreeBSD_version to 700048; see UPDATING.
This work was financially supported by another FreeBSD committer.
Obtained from: p4://bms_netdev
Submitted by: Wilbert de Graaf (original work)
Reviewed by: rwatson (locking), silence from fenner,
net@ (but with encouragement)
2007-06-12 16:24:56 +00:00
|
|
|
case IP_MSFILTER:
|
|
|
|
error = inp_getmoptions(inp, sopt);
|
1996-02-22 21:32:23 +00:00
|
|
|
break;
|
|
|
|
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT)
|
1999-12-22 19:13:38 +00:00
|
|
|
case IP_IPSEC_POLICY:
|
2017-02-06 08:49:57 +00:00
|
|
|
if (IPSEC_ENABLED(ipv4)) {
|
|
|
|
error = IPSEC_PCBCTL(ipv4, inp, sopt);
|
|
|
|
break;
|
2000-07-04 16:35:15 +00:00
|
|
|
}
|
2017-02-06 08:49:57 +00:00
|
|
|
/* FALLTHROUGH */
|
2007-07-03 12:13:45 +00:00
|
|
|
#endif /* IPSEC */
|
1999-12-22 19:13:38 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
default:
|
|
|
|
error = ENOPROTOOPT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Routine called from ip_output() to loop back a copy of an IP multicast
|
|
|
|
* packet to the input queue of a specified interface. Note that this
|
|
|
|
* calls the output routine of the loopback "driver", but with an interface
|
1995-04-26 18:10:58 +00:00
|
|
|
* pointer that might NOT be a loopback interface -- evil, but easier than
|
|
|
|
* replicating that code here.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
static void
|
2015-08-08 15:58:35 +00:00
|
|
|
ip_mloopback(struct ifnet *ifp, const struct mbuf *m, int hlen)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2015-08-08 15:58:35 +00:00
|
|
|
struct ip *ip;
|
1994-05-24 10:09:53 +00:00
|
|
|
struct mbuf *copym;
|
|
|
|
|
2008-08-29 20:42:58 +00:00
|
|
|
/*
|
|
|
|
* Make a deep copy of the packet because we're going to
|
|
|
|
* modify the pack in order to generate checksums.
|
|
|
|
*/
|
2012-12-05 08:04:20 +00:00
|
|
|
copym = m_dup(m, M_NOWAIT);
|
2014-10-12 15:49:52 +00:00
|
|
|
if (copym != NULL && (!M_WRITABLE(copym) || copym->m_len < hlen))
|
1997-05-06 21:22:04 +00:00
|
|
|
copym = m_pullup(copym, hlen);
|
1994-05-24 10:09:53 +00:00
|
|
|
if (copym != NULL) {
|
2004-04-07 10:01:39 +00:00
|
|
|
/* If needed, compute the checksum and mark it as valid. */
|
|
|
|
if (copym->m_pkthdr.csum_flags & CSUM_DELAY_DATA) {
|
|
|
|
in_delayed_cksum(copym);
|
|
|
|
copym->m_pkthdr.csum_flags &= ~CSUM_DELAY_DATA;
|
|
|
|
copym->m_pkthdr.csum_flags |=
|
|
|
|
CSUM_DATA_VALID | CSUM_PSEUDO_HDR;
|
|
|
|
copym->m_pkthdr.csum_data = 0xffff;
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* We don't bother to fragment if the IP length is greater
|
|
|
|
* than the interface's MTU. Can this possibly matter?
|
|
|
|
*/
|
2012-10-22 21:09:03 +00:00
|
|
|
ip = mtod(copym, struct ip *);
|
1994-05-24 10:09:53 +00:00
|
|
|
ip->ip_sum = 0;
|
2002-10-20 22:52:07 +00:00
|
|
|
ip->ip_sum = in_cksum(copym, hlen);
|
2015-08-08 15:58:35 +00:00
|
|
|
if_simloop(ifp, copym, AF_INET, 0);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
}
|