2005-01-07 02:30:35 +00:00
|
|
|
/*-
|
2017-11-20 19:43:44 +00:00
|
|
|
* SPDX-License-Identifier: BSD-3-Clause
|
|
|
|
*
|
1999-11-22 02:45:11 +00:00
|
|
|
* Copyright (C) 1995, 1996, 1997, and 1998 WIDE Project.
|
|
|
|
* All rights reserved.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
* 3. Neither the name of the project nor the names of its contributors
|
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE PROJECT OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
2007-12-10 16:03:40 +00:00
|
|
|
*
|
|
|
|
* $KAME: ip6_output.c,v 1.279 2002/01/26 06:12:30 jinmei Exp $
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
|
|
|
|
2005-01-07 02:30:35 +00:00
|
|
|
/*-
|
1999-11-22 02:45:11 +00:00
|
|
|
* Copyright (c) 1982, 1986, 1988, 1990, 1993
|
|
|
|
* The Regents of the University of California. All rights reserved.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
2017-02-28 23:42:47 +00:00
|
|
|
* 3. Neither the name of the University nor the names of its contributors
|
1999-11-22 02:45:11 +00:00
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
|
|
|
* @(#)ip_output.c 8.3 (Berkeley) 1/21/94
|
|
|
|
*/
|
|
|
|
|
2007-12-10 16:03:40 +00:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
2000-07-04 16:35:15 +00:00
|
|
|
#include "opt_inet.h"
|
|
|
|
#include "opt_inet6.h"
|
|
|
|
#include "opt_ipsec.h"
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#include "opt_kern_tls.h"
|
2019-06-11 22:07:39 +00:00
|
|
|
#include "opt_ratelimit.h"
|
2010-05-09 20:32:00 +00:00
|
|
|
#include "opt_route.h"
|
2014-07-12 05:46:33 +00:00
|
|
|
#include "opt_rss.h"
|
2019-06-11 22:07:39 +00:00
|
|
|
#include "opt_sctp.h"
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
#include <sys/param.h>
|
2008-02-02 14:11:31 +00:00
|
|
|
#include <sys/kernel.h>
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#include <sys/ktls.h>
|
1999-11-22 02:45:11 +00:00
|
|
|
#include <sys/malloc.h>
|
|
|
|
#include <sys/mbuf.h>
|
|
|
|
#include <sys/errno.h>
|
2007-06-13 22:42:43 +00:00
|
|
|
#include <sys/priv.h>
|
2008-02-02 14:11:31 +00:00
|
|
|
#include <sys/proc.h>
|
1999-11-22 02:45:11 +00:00
|
|
|
#include <sys/protosw.h>
|
|
|
|
#include <sys/socket.h>
|
|
|
|
#include <sys/socketvar.h>
|
2009-03-03 13:12:12 +00:00
|
|
|
#include <sys/syslog.h>
|
2008-02-02 14:11:31 +00:00
|
|
|
#include <sys/ucred.h>
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2012-05-25 02:17:16 +00:00
|
|
|
#include <machine/in_cksum.h>
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
#include <net/if.h>
|
2013-10-26 17:58:36 +00:00
|
|
|
#include <net/if_var.h>
|
2016-08-24 00:52:30 +00:00
|
|
|
#include <net/if_llatbl.h>
|
2005-04-18 18:35:05 +00:00
|
|
|
#include <net/netisr.h>
|
1999-11-22 02:45:11 +00:00
|
|
|
#include <net/route.h>
|
2000-07-31 13:11:42 +00:00
|
|
|
#include <net/pfil.h>
|
2015-01-18 18:06:40 +00:00
|
|
|
#include <net/rss_config.h>
|
2008-12-02 21:37:28 +00:00
|
|
|
#include <net/vnet.h>
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
#include <netinet/in.h>
|
|
|
|
#include <netinet/in_var.h>
|
2011-08-20 17:05:11 +00:00
|
|
|
#include <netinet/ip_var.h>
|
2016-01-04 18:32:24 +00:00
|
|
|
#include <netinet6/in6_fib.h>
|
2001-06-11 12:39:29 +00:00
|
|
|
#include <netinet6/in6_var.h>
|
2000-07-04 16:35:15 +00:00
|
|
|
#include <netinet/ip6.h>
|
|
|
|
#include <netinet/icmp6.h>
|
1999-11-22 02:45:11 +00:00
|
|
|
#include <netinet6/ip6_var.h>
|
2000-07-04 16:35:15 +00:00
|
|
|
#include <netinet/in_pcb.h>
|
2003-11-20 20:07:39 +00:00
|
|
|
#include <netinet/tcp_var.h>
|
1999-11-22 02:45:11 +00:00
|
|
|
#include <netinet6/nd6.h>
|
2015-01-18 18:06:40 +00:00
|
|
|
#include <netinet6/in6_rss.h>
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2017-02-06 08:49:57 +00:00
|
|
|
#include <netipsec/ipsec_support.h>
|
2010-03-12 08:10:30 +00:00
|
|
|
#ifdef SCTP
|
|
|
|
#include <netinet/sctp.h>
|
|
|
|
#include <netinet/sctp_crc32.h>
|
|
|
|
#endif
|
2002-10-16 02:25:05 +00:00
|
|
|
|
2000-08-12 18:14:13 +00:00
|
|
|
#include <netinet6/ip6protosw.h>
|
2005-07-25 12:31:43 +00:00
|
|
|
#include <netinet6/scope6_var.h>
|
2000-08-12 18:14:13 +00:00
|
|
|
|
Bite the bullet, and make the IPv6 SSM and MLDv2 mega-commit:
import from p4 bms_netdev. Summary of changes:
* Connect netinet6/in6_mcast.c to build.
The legacy KAME KPIs are mostly preserved.
* Eliminate now dead code from ip6_output.c.
Don't do mbuf bingo, we are not going to do RFC 2292 style
CMSG tricks for multicast options as they are not required
by any current IPv6 normative reference.
* Refactor transports (UDP, raw_ip6) to do own mcast filtering.
SCTP, TCP unaffected by this change.
* Add ip6_msource, in6_msource structs to in6_var.h.
* Hookup mld_ifinfo state to in6_ifextra, allocate from
domifattach path.
* Eliminate IN6_LOOKUP_MULTI(), it is no longer referenced.
Kernel consumers which need this should use in6m_lookup().
* Refactor IPv6 socket group memberships to use a vector (like IPv4).
* Update ifmcstat(8) for IPv6 SSM.
* Add witness lock order for IN6_MULTI_LOCK.
* Move IN6_MULTI_LOCK out of lower ip6_output()/ip6_input() paths.
* Introduce IP6STAT_ADD/SUB/INC/DEC as per rwatson's IPv4 cleanup.
* Update carp(4) for new IPv6 SSM KPIs.
* Virtualize ip6_mrouter socket.
Changes mostly localized to IPv6 MROUTING.
* Don't do a local group lookup in MROUTING.
* Kill unused KAME prototypes in6_purgemkludge(), in6_restoremkludge().
* Preserve KAME DAD timer jitter behaviour in MLDv1 compatibility mode.
* Bump __FreeBSD_version to 800084.
* Update UPDATING.
NOTE WELL:
* This code hasn't been tested against real MLDv2 queriers
(yet), although the on-wire protocol has been verified in Wireshark.
* There are a few unresolved issues in the socket layer APIs to
do with scope ID propagation.
* There is a LOR present in ip6_output()'s use of
in6_setscope() which needs to be resolved. See comments in mld6.c.
This is believed to be benign and can't be avoided for the moment
without re-introducing an indirect netisr.
This work was mostly derived from the IGMPv3 implementation, and
has been sponsored by a third party.
2009-04-29 19:19:13 +00:00
|
|
|
extern int in6_mcast_loop;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
struct ip6_exthdrs {
|
2000-07-04 16:35:15 +00:00
|
|
|
struct mbuf *ip6e_ip6;
|
|
|
|
struct mbuf *ip6e_hbh;
|
|
|
|
struct mbuf *ip6e_dest1;
|
|
|
|
struct mbuf *ip6e_rthdr;
|
|
|
|
struct mbuf *ip6e_dest2;
|
1999-11-22 02:45:11 +00:00
|
|
|
};
|
|
|
|
|
2016-05-20 04:45:08 +00:00
|
|
|
static MALLOC_DEFINE(M_IP6OPT, "ip6opt", "IPv6 options");
|
|
|
|
|
2012-10-22 21:49:56 +00:00
|
|
|
static int ip6_pcbopt(int, u_char *, int, struct ip6_pktopts **,
|
|
|
|
struct ucred *, int);
|
|
|
|
static int ip6_pcbopts(struct ip6_pktopts **, struct mbuf *,
|
|
|
|
struct socket *, struct sockopt *);
|
2018-03-22 23:34:48 +00:00
|
|
|
static int ip6_getpcbopt(struct inpcb *, int, struct sockopt *);
|
2012-10-22 21:49:56 +00:00
|
|
|
static int ip6_setpktopt(int, u_char *, int, struct ip6_pktopts *,
|
|
|
|
struct ucred *, int, int, int);
|
2003-10-24 18:26:30 +00:00
|
|
|
|
2008-01-08 19:08:58 +00:00
|
|
|
static int ip6_copyexthdr(struct mbuf **, caddr_t, int);
|
2012-10-22 21:49:56 +00:00
|
|
|
static int ip6_insertfraghdr(struct mbuf *, struct mbuf *, int,
|
|
|
|
struct ip6_frag **);
|
2008-01-08 19:08:58 +00:00
|
|
|
static int ip6_insert_jumboopt(struct ip6_exthdrs *, u_int32_t);
|
|
|
|
static int ip6_splithdr(struct mbuf *, struct ip6_exthdrs *);
|
2016-01-03 09:54:03 +00:00
|
|
|
static int ip6_getpmtu(struct route_in6 *, int,
|
2016-08-01 17:02:21 +00:00
|
|
|
struct ifnet *, const struct in6_addr *, u_long *, int *, u_int,
|
|
|
|
u_int);
|
2016-01-03 09:54:03 +00:00
|
|
|
static int ip6_calcmtu(struct ifnet *, const struct in6_addr *, u_long,
|
2016-08-01 17:02:21 +00:00
|
|
|
u_long *, int *, u_int);
|
2016-05-19 12:45:20 +00:00
|
|
|
static int ip6_getpmtu_ctl(u_int, const struct in6_addr *, u_long *);
|
2008-01-08 19:08:58 +00:00
|
|
|
static int copypktopts(struct ip6_pktopts *, struct ip6_pktopts *, int);
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2003-10-08 18:26:08 +00:00
|
|
|
|
2007-07-01 11:41:27 +00:00
|
|
|
/*
|
|
|
|
* Make an extension header from option data. hp is the source, and
|
|
|
|
* mp is the destination.
|
|
|
|
*/
|
|
|
|
#define MAKE_EXTHDR(hp, mp) \
|
|
|
|
do { \
|
|
|
|
if (hp) { \
|
|
|
|
struct ip6_ext *eh = (struct ip6_ext *)(hp); \
|
|
|
|
error = ip6_copyexthdr((mp), (caddr_t)(hp), \
|
|
|
|
((eh)->ip6e_len + 1) << 3); \
|
|
|
|
if (error) \
|
|
|
|
goto freehdrs; \
|
|
|
|
} \
|
|
|
|
} while (/*CONSTCOND*/ 0)
|
|
|
|
|
|
|
|
/*
|
2007-07-05 16:29:40 +00:00
|
|
|
* Form a chain of extension headers.
|
2007-07-01 11:41:27 +00:00
|
|
|
* m is the extension header mbuf
|
|
|
|
* mp is the previous mbuf in the chain
|
|
|
|
* p is the next header
|
|
|
|
* i is the type of option.
|
|
|
|
*/
|
|
|
|
#define MAKE_CHAIN(m, mp, p, i)\
|
|
|
|
do {\
|
|
|
|
if (m) {\
|
|
|
|
if (!hdrsplit) \
|
|
|
|
panic("assumption failed: hdr not split"); \
|
|
|
|
*mtod((m), u_char *) = *(p);\
|
|
|
|
*(p) = (i);\
|
|
|
|
p = mtod((m), u_char *);\
|
|
|
|
(m)->m_next = (mp)->m_next;\
|
|
|
|
(mp)->m_next = (m);\
|
|
|
|
(mp) = (m);\
|
|
|
|
}\
|
|
|
|
} while (/*CONSTCOND*/ 0)
|
|
|
|
|
2014-05-28 12:45:27 +00:00
|
|
|
void
|
2012-05-25 02:17:16 +00:00
|
|
|
in6_delayed_cksum(struct mbuf *m, uint32_t plen, u_short offset)
|
|
|
|
{
|
|
|
|
u_short csum;
|
|
|
|
|
2012-05-26 23:58:51 +00:00
|
|
|
csum = in_cksum_skip(m, offset + plen, offset);
|
It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.
To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.
Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.
This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.
Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.
Individual driver updates will have to follow, as will SCTP.
Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958
2012-05-28 09:30:13 +00:00
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_UDP_IPV6 && csum == 0)
|
2012-05-25 02:17:16 +00:00
|
|
|
csum = 0xffff;
|
|
|
|
offset += m->m_pkthdr.csum_data; /* checksum offset */
|
|
|
|
|
2018-06-06 10:46:24 +00:00
|
|
|
if (offset + sizeof(csum) > m->m_len)
|
|
|
|
m_copyback(m, offset, sizeof(csum), (caddr_t)&csum);
|
|
|
|
else
|
|
|
|
*(u_short *)mtodo(m, offset) = csum;
|
2012-05-25 02:17:16 +00:00
|
|
|
}
|
|
|
|
|
2015-02-16 06:30:27 +00:00
|
|
|
int
|
|
|
|
ip6_fragment(struct ifnet *ifp, struct mbuf *m0, int hlen, u_char nextproto,
|
2017-04-22 13:04:36 +00:00
|
|
|
int fraglen , uint32_t id)
|
2015-02-16 06:30:27 +00:00
|
|
|
{
|
|
|
|
struct mbuf *m, **mnext, *m_frgpart;
|
|
|
|
struct ip6_hdr *ip6, *mhip6;
|
|
|
|
struct ip6_frag *ip6f;
|
|
|
|
int off;
|
|
|
|
int error;
|
|
|
|
int tlen = m0->m_pkthdr.len;
|
|
|
|
|
2017-04-22 13:04:36 +00:00
|
|
|
KASSERT((fraglen % 8 == 0), ("Fragment length must be a multiple of 8"));
|
2017-04-20 09:05:53 +00:00
|
|
|
|
2015-02-16 06:30:27 +00:00
|
|
|
m = m0;
|
|
|
|
ip6 = mtod(m, struct ip6_hdr *);
|
|
|
|
mnext = &m->m_nextpkt;
|
|
|
|
|
2017-04-22 13:04:36 +00:00
|
|
|
for (off = hlen; off < tlen; off += fraglen) {
|
2015-02-16 06:30:27 +00:00
|
|
|
m = m_gethdr(M_NOWAIT, MT_DATA);
|
|
|
|
if (!m) {
|
|
|
|
IP6STAT_INC(ip6s_odropped);
|
|
|
|
return (ENOBUFS);
|
|
|
|
}
|
2019-05-10 20:15:40 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure the complete packet header gets copied
|
|
|
|
* from the originating mbuf to the newly created
|
|
|
|
* mbuf. This also ensures that existing firewall
|
|
|
|
* classification(s), VLAN tags and so on get copied
|
|
|
|
* to the resulting fragmented packet(s):
|
|
|
|
*/
|
|
|
|
if (m_dup_pkthdr(m, m0, M_NOWAIT) == 0) {
|
|
|
|
m_free(m);
|
|
|
|
IP6STAT_INC(ip6s_odropped);
|
|
|
|
return (ENOBUFS);
|
|
|
|
}
|
|
|
|
|
2015-02-16 06:30:27 +00:00
|
|
|
*mnext = m;
|
|
|
|
mnext = &m->m_nextpkt;
|
|
|
|
m->m_data += max_linkhdr;
|
|
|
|
mhip6 = mtod(m, struct ip6_hdr *);
|
|
|
|
*mhip6 = *ip6;
|
|
|
|
m->m_len = sizeof(*mhip6);
|
|
|
|
error = ip6_insertfraghdr(m0, m, hlen, &ip6f);
|
|
|
|
if (error) {
|
|
|
|
IP6STAT_INC(ip6s_odropped);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
ip6f->ip6f_offlg = htons((u_short)((off - hlen) & ~7));
|
2017-04-22 13:04:36 +00:00
|
|
|
if (off + fraglen >= tlen)
|
|
|
|
fraglen = tlen - off;
|
2015-02-16 06:30:27 +00:00
|
|
|
else
|
|
|
|
ip6f->ip6f_offlg |= IP6F_MORE_FRAG;
|
2017-04-22 13:04:36 +00:00
|
|
|
mhip6->ip6_plen = htons((u_short)(fraglen + hlen +
|
2015-02-16 06:30:27 +00:00
|
|
|
sizeof(*ip6f) - sizeof(struct ip6_hdr)));
|
2017-04-22 13:04:36 +00:00
|
|
|
if ((m_frgpart = m_copym(m0, off, fraglen, M_NOWAIT)) == NULL) {
|
2015-02-16 06:30:27 +00:00
|
|
|
IP6STAT_INC(ip6s_odropped);
|
|
|
|
return (ENOBUFS);
|
|
|
|
}
|
|
|
|
m_cat(m, m_frgpart);
|
2017-04-22 13:04:36 +00:00
|
|
|
m->m_pkthdr.len = fraglen + hlen + sizeof(*ip6f);
|
2015-02-16 06:30:27 +00:00
|
|
|
ip6f->ip6f_reserved = 0;
|
|
|
|
ip6f->ip6f_ident = id;
|
|
|
|
ip6f->ip6f_nxt = nextproto;
|
|
|
|
IP6STAT_INC(ip6s_ofragments);
|
|
|
|
in6_ifstat_inc(ifp, ifs6_out_fragcreat);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
|
|
|
static int
|
|
|
|
ip6_output_send(struct inpcb *inp, struct ifnet *ifp, struct ifnet *origifp,
|
|
|
|
struct mbuf *m, struct sockaddr_in6 *dst, struct route_in6 *ro)
|
|
|
|
{
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#ifdef KERN_TLS
|
|
|
|
struct ktls_session *tls = NULL;
|
|
|
|
#endif
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
|
|
|
struct m_snd_tag *mst;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
MPASS((m->m_pkthdr.csum_flags & CSUM_SND_TAG) == 0);
|
|
|
|
mst = NULL;
|
|
|
|
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#ifdef KERN_TLS
|
|
|
|
/*
|
|
|
|
* If this is an unencrypted TLS record, save a reference to
|
|
|
|
* the record. This local reference is used to call
|
|
|
|
* ktls_output_eagain after the mbuf has been freed (thus
|
|
|
|
* dropping the mbuf's reference) in if_output.
|
|
|
|
*/
|
|
|
|
if (m->m_next != NULL && mbuf_has_tls_session(m->m_next)) {
|
|
|
|
tls = ktls_hold(m->m_next->m_ext.ext_pgs->tls);
|
|
|
|
mst = tls->snd_tag;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If a TLS session doesn't have a valid tag, it must
|
|
|
|
* have had an earlier ifp mismatch, so drop this
|
|
|
|
* packet.
|
|
|
|
*/
|
|
|
|
if (mst == NULL) {
|
|
|
|
error = EAGAIN;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
|
|
|
#ifdef RATELIMIT
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
if (inp != NULL && mst == NULL) {
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
|
|
|
if ((inp->inp_flags2 & INP_RATE_LIMIT_CHANGED) != 0 ||
|
|
|
|
(inp->inp_snd_tag != NULL &&
|
|
|
|
inp->inp_snd_tag->ifp != ifp))
|
|
|
|
in_pcboutput_txrtlmt(inp, ifp, m);
|
|
|
|
|
|
|
|
if (inp->inp_snd_tag != NULL)
|
|
|
|
mst = inp->inp_snd_tag;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
if (mst != NULL) {
|
|
|
|
KASSERT(m->m_pkthdr.rcvif == NULL,
|
|
|
|
("trying to add a send tag to a forwarded packet"));
|
|
|
|
if (mst->ifp != ifp) {
|
|
|
|
error = EAGAIN;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* stamp send tag on mbuf */
|
|
|
|
m->m_pkthdr.snd_tag = m_snd_tag_ref(mst);
|
|
|
|
m->m_pkthdr.csum_flags |= CSUM_SND_TAG;
|
|
|
|
}
|
|
|
|
|
|
|
|
error = nd6_output_ifp(ifp, origifp, m, dst, (struct route *)ro);
|
|
|
|
|
|
|
|
done:
|
|
|
|
/* Check for route change invalidating send tags. */
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#ifdef KERN_TLS
|
|
|
|
if (tls != NULL) {
|
|
|
|
if (error == EAGAIN)
|
|
|
|
error = ktls_output_eagain(inp, tls);
|
|
|
|
ktls_free(tls);
|
|
|
|
}
|
|
|
|
#endif
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
|
|
|
#ifdef RATELIMIT
|
|
|
|
if (error == EAGAIN)
|
|
|
|
in_pcboutput_eagain(inp);
|
|
|
|
#endif
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
|
|
|
* IP6 output. The packet in mbuf chain m contains a skeletal IP6
|
|
|
|
* header (with pri, len, nxt, hlim, src, dst).
|
|
|
|
* This function may modify ver and hlim only.
|
|
|
|
* The mbuf chain containing the packet will be freed.
|
|
|
|
* The mbuf opt, if present, will not be freed.
|
2012-07-04 07:37:53 +00:00
|
|
|
* If route_in6 ro is present and has ro_rt initialized, route lookup would be
|
|
|
|
* skipped and ro->ro_rt would be used. If ro is present but ro->ro_rt is NULL,
|
|
|
|
* then result of route lookup is stored in ro->ro_rt.
|
2001-06-11 12:39:29 +00:00
|
|
|
*
|
2014-03-05 01:17:47 +00:00
|
|
|
* type of "mtu": rt_mtu is u_long, ifnet.ifr_mtu is int, and
|
2001-06-11 12:39:29 +00:00
|
|
|
* nd_ifinfo.linkmtu is u_int32_t. so we use u_long to hold largest one,
|
2014-03-05 01:17:47 +00:00
|
|
|
* which is rt_mtu.
|
2007-07-05 16:23:49 +00:00
|
|
|
*
|
|
|
|
* ifpp - XXX: just for statistics
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
2014-09-09 00:21:21 +00:00
|
|
|
/*
|
|
|
|
* XXX TODO: no flowid is assigned for outbound flows?
|
|
|
|
*/
|
1999-11-22 02:45:11 +00:00
|
|
|
int
|
2007-07-05 16:23:49 +00:00
|
|
|
ip6_output(struct mbuf *m0, struct ip6_pktopts *opt,
|
|
|
|
struct route_in6 *ro, int flags, struct ip6_moptions *im6o,
|
|
|
|
struct ifnet **ifpp, struct inpcb *inp)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
2015-02-16 06:30:27 +00:00
|
|
|
struct ip6_hdr *ip6;
|
2000-07-04 16:35:15 +00:00
|
|
|
struct ifnet *ifp, *origifp;
|
1999-11-22 02:45:11 +00:00
|
|
|
struct mbuf *m = m0;
|
2007-07-01 11:41:27 +00:00
|
|
|
struct mbuf *mprev = NULL;
|
2015-02-16 06:30:27 +00:00
|
|
|
int hlen, tlen, len;
|
1999-11-22 02:45:11 +00:00
|
|
|
struct route_in6 ip6route;
|
2005-07-25 12:31:43 +00:00
|
|
|
struct rtentry *rt = NULL;
|
|
|
|
struct sockaddr_in6 *dst, src_sa, dst_sa;
|
2005-04-18 18:35:05 +00:00
|
|
|
struct in6_addr odst;
|
1999-11-22 02:45:11 +00:00
|
|
|
int error = 0;
|
2000-10-19 23:15:54 +00:00
|
|
|
struct in6_ifaddr *ia = NULL;
|
1999-11-22 02:45:11 +00:00
|
|
|
u_long mtu;
|
2003-10-24 18:26:30 +00:00
|
|
|
int alwaysfrag, dontfrag;
|
1999-11-22 02:45:11 +00:00
|
|
|
u_int32_t optlen = 0, plen = 0, unfragpartlen = 0;
|
|
|
|
struct ip6_exthdrs exthdrs;
|
2016-05-19 12:45:20 +00:00
|
|
|
struct in6_addr src0, dst0;
|
2005-07-25 12:31:43 +00:00
|
|
|
u_int32_t zone;
|
1999-11-22 02:45:11 +00:00
|
|
|
struct route_in6 *ro_pmtu = NULL;
|
|
|
|
int hdrsplit = 0;
|
2012-05-25 02:17:16 +00:00
|
|
|
int sw_csum, tso;
|
2014-10-02 00:25:57 +00:00
|
|
|
int needfiblookup;
|
|
|
|
uint32_t fibnum;
|
2012-12-19 17:28:17 +00:00
|
|
|
struct m_tag *fwd_tag = NULL;
|
2015-04-01 12:15:01 +00:00
|
|
|
uint32_t id;
|
2002-10-16 02:25:05 +00:00
|
|
|
|
2014-09-09 00:21:21 +00:00
|
|
|
if (inp != NULL) {
|
2017-05-10 00:14:55 +00:00
|
|
|
INP_LOCK_ASSERT(inp);
|
2012-02-03 13:08:44 +00:00
|
|
|
M_SETFIB(m, inp->inp_inc.inc_fibnum);
|
2014-12-01 11:45:24 +00:00
|
|
|
if ((flags & IP_NODEFAULTFLOWID) == 0) {
|
|
|
|
/* unconditionally set flowid */
|
2014-09-09 00:21:21 +00:00
|
|
|
m->m_pkthdr.flowid = inp->inp_flowid;
|
2014-12-01 11:45:24 +00:00
|
|
|
M_HASHTYPE_SET(m, inp->inp_flowtype);
|
2014-09-09 00:21:21 +00:00
|
|
|
}
|
2019-04-25 15:37:28 +00:00
|
|
|
#ifdef NUMA
|
|
|
|
m->m_pkthdr.numa_domain = inp->inp_numa_domain;
|
|
|
|
#endif
|
2014-09-09 00:21:21 +00:00
|
|
|
}
|
2012-02-03 13:08:44 +00:00
|
|
|
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT)
|
|
|
|
/*
|
|
|
|
* IPSec checking which handles several cases.
|
|
|
|
* FAST IPSEC: We re-injected the packet.
|
|
|
|
* XXX: need scope argument.
|
|
|
|
*/
|
|
|
|
if (IPSEC_ENABLED(ipv6)) {
|
|
|
|
if ((error = IPSEC_OUTPUT(ipv6, m, inp)) != 0) {
|
|
|
|
if (error == EINPROGRESS)
|
|
|
|
error = 0;
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif /* IPSEC */
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
bzero(&exthdrs, sizeof(exthdrs));
|
|
|
|
if (opt) {
|
|
|
|
/* Hop-by-Hop options header */
|
|
|
|
MAKE_EXTHDR(opt->ip6po_hbh, &exthdrs.ip6e_hbh);
|
|
|
|
/* Destination options header(1st part) */
|
2003-10-31 16:32:12 +00:00
|
|
|
if (opt->ip6po_rthdr) {
|
|
|
|
/*
|
|
|
|
* Destination options header(1st part)
|
2007-07-01 11:41:27 +00:00
|
|
|
* This only makes sense with a routing header.
|
2003-10-31 16:32:12 +00:00
|
|
|
* See Section 9.2 of RFC 3542.
|
|
|
|
* Disabling this part just for MIP6 convenience is
|
|
|
|
* a bad idea. We need to think carefully about a
|
|
|
|
* way to make the advanced API coexist with MIP6
|
|
|
|
* options, which might automatically be inserted in
|
|
|
|
* the kernel.
|
|
|
|
*/
|
|
|
|
MAKE_EXTHDR(opt->ip6po_dest1, &exthdrs.ip6e_dest1);
|
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
/* Routing header */
|
|
|
|
MAKE_EXTHDR(opt->ip6po_rthdr, &exthdrs.ip6e_rthdr);
|
|
|
|
/* Destination options header(2nd part) */
|
|
|
|
MAKE_EXTHDR(opt->ip6po_dest2, &exthdrs.ip6e_dest2);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculate the total length of the extension header chain.
|
|
|
|
* Keep the length of the unfragmentable part for fragmentation.
|
|
|
|
*/
|
|
|
|
optlen = 0;
|
2007-07-05 16:29:40 +00:00
|
|
|
if (exthdrs.ip6e_hbh)
|
2007-07-01 11:41:27 +00:00
|
|
|
optlen += exthdrs.ip6e_hbh->m_len;
|
2007-07-05 16:29:40 +00:00
|
|
|
if (exthdrs.ip6e_dest1)
|
2007-07-01 11:41:27 +00:00
|
|
|
optlen += exthdrs.ip6e_dest1->m_len;
|
2007-07-05 16:29:40 +00:00
|
|
|
if (exthdrs.ip6e_rthdr)
|
2007-07-01 11:41:27 +00:00
|
|
|
optlen += exthdrs.ip6e_rthdr->m_len;
|
1999-11-22 02:45:11 +00:00
|
|
|
unfragpartlen = optlen + sizeof(struct ip6_hdr);
|
2007-07-01 11:41:27 +00:00
|
|
|
|
2014-05-28 12:45:27 +00:00
|
|
|
/* NOTE: we don't add AH/ESP length here (done in ip6_ipsec_output) */
|
2007-07-05 16:29:40 +00:00
|
|
|
if (exthdrs.ip6e_dest2)
|
2007-07-01 11:41:27 +00:00
|
|
|
optlen += exthdrs.ip6e_dest2->m_len;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
/*
|
2014-05-28 12:45:27 +00:00
|
|
|
* If there is at least one extension header,
|
1999-11-22 02:45:11 +00:00
|
|
|
* separate IP6 header from the payload.
|
|
|
|
*/
|
2014-05-28 12:45:27 +00:00
|
|
|
if (optlen && !hdrsplit) {
|
1999-11-22 02:45:11 +00:00
|
|
|
if ((error = ip6_splithdr(m, &exthdrs)) != 0) {
|
|
|
|
m = NULL;
|
|
|
|
goto freehdrs;
|
|
|
|
}
|
|
|
|
m = exthdrs.ip6e_ip6;
|
|
|
|
hdrsplit++;
|
|
|
|
}
|
|
|
|
|
|
|
|
ip6 = mtod(m, struct ip6_hdr *);
|
|
|
|
|
|
|
|
/* adjust mbuf packet header length */
|
|
|
|
m->m_pkthdr.len += optlen;
|
|
|
|
plen = m->m_pkthdr.len - sizeof(*ip6);
|
|
|
|
|
|
|
|
/* If this is a jumbo payload, insert a jumbo payload option. */
|
|
|
|
if (plen > IPV6_MAXPACKET) {
|
|
|
|
if (!hdrsplit) {
|
|
|
|
if ((error = ip6_splithdr(m, &exthdrs)) != 0) {
|
|
|
|
m = NULL;
|
|
|
|
goto freehdrs;
|
|
|
|
}
|
|
|
|
m = exthdrs.ip6e_ip6;
|
|
|
|
hdrsplit++;
|
|
|
|
}
|
|
|
|
/* adjust pointer */
|
|
|
|
ip6 = mtod(m, struct ip6_hdr *);
|
|
|
|
if ((error = ip6_insert_jumboopt(&exthdrs, plen)) != 0)
|
|
|
|
goto freehdrs;
|
|
|
|
ip6->ip6_plen = 0;
|
|
|
|
} else
|
|
|
|
ip6->ip6_plen = htons(plen);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Concatenate headers and fill in next header fields.
|
|
|
|
* Here we have, on "m"
|
|
|
|
* IPv6 payload
|
|
|
|
* and we insert headers accordingly. Finally, we should be getting:
|
|
|
|
* IPv6 hbh dest1 rthdr ah* [esp* dest2 payload]
|
|
|
|
*
|
|
|
|
* during the header composing process, "m" points to IPv6 header.
|
|
|
|
* "mprev" points to an extension header prior to esp.
|
|
|
|
*/
|
2007-07-01 11:41:27 +00:00
|
|
|
u_char *nexthdrp = &ip6->ip6_nxt;
|
|
|
|
mprev = m;
|
2007-07-05 16:29:40 +00:00
|
|
|
|
2007-07-01 11:41:27 +00:00
|
|
|
/*
|
|
|
|
* we treat dest2 specially. this makes IPsec processing
|
|
|
|
* much easier. the goal here is to make mprev point the
|
|
|
|
* mbuf prior to dest2.
|
|
|
|
*
|
|
|
|
* result: IPv6 dest2 payload
|
|
|
|
* m and mprev will point to IPv6 header.
|
|
|
|
*/
|
|
|
|
if (exthdrs.ip6e_dest2) {
|
|
|
|
if (!hdrsplit)
|
|
|
|
panic("assumption failed: hdr not split");
|
|
|
|
exthdrs.ip6e_dest2->m_next = m->m_next;
|
|
|
|
m->m_next = exthdrs.ip6e_dest2;
|
|
|
|
*mtod(exthdrs.ip6e_dest2, u_char *) = ip6->ip6_nxt;
|
|
|
|
ip6->ip6_nxt = IPPROTO_DSTOPTS;
|
|
|
|
}
|
2007-07-05 16:29:40 +00:00
|
|
|
|
2007-07-01 11:41:27 +00:00
|
|
|
/*
|
|
|
|
* result: IPv6 hbh dest1 rthdr dest2 payload
|
|
|
|
* m will point to IPv6 header. mprev will point to the
|
|
|
|
* extension header prior to dest2 (rthdr in the above case).
|
|
|
|
*/
|
|
|
|
MAKE_CHAIN(exthdrs.ip6e_hbh, mprev, nexthdrp, IPPROTO_HOPOPTS);
|
|
|
|
MAKE_CHAIN(exthdrs.ip6e_dest1, mprev, nexthdrp,
|
|
|
|
IPPROTO_DSTOPTS);
|
|
|
|
MAKE_CHAIN(exthdrs.ip6e_rthdr, mprev, nexthdrp,
|
|
|
|
IPPROTO_ROUTING);
|
2007-07-05 16:29:40 +00:00
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
2009-05-09 18:25:58 +00:00
|
|
|
* If there is a routing header, discard the packet.
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
|
|
|
if (exthdrs.ip6e_rthdr) {
|
2009-05-09 18:25:58 +00:00
|
|
|
error = EINVAL;
|
|
|
|
goto bad;
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Source address validation */
|
|
|
|
if (IN6_IS_ADDR_UNSPECIFIED(&ip6->ip6_src) &&
|
2005-10-21 15:45:13 +00:00
|
|
|
(flags & IPV6_UNSPECSRC) == 0) {
|
1999-11-22 02:45:11 +00:00
|
|
|
error = EOPNOTSUPP;
|
2013-04-09 07:11:22 +00:00
|
|
|
IP6STAT_INC(ip6s_badscope);
|
1999-11-22 02:45:11 +00:00
|
|
|
goto bad;
|
|
|
|
}
|
|
|
|
if (IN6_IS_ADDR_MULTICAST(&ip6->ip6_src)) {
|
|
|
|
error = EOPNOTSUPP;
|
2013-04-09 07:11:22 +00:00
|
|
|
IP6STAT_INC(ip6s_badscope);
|
1999-11-22 02:45:11 +00:00
|
|
|
goto bad;
|
|
|
|
}
|
|
|
|
|
2013-04-09 07:11:22 +00:00
|
|
|
IP6STAT_INC(ip6s_localout);
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Route packet.
|
|
|
|
*/
|
2016-04-15 17:30:33 +00:00
|
|
|
if (ro == NULL) {
|
1999-11-22 02:45:11 +00:00
|
|
|
ro = &ip6route;
|
|
|
|
bzero((caddr_t)ro, sizeof(*ro));
|
2017-03-25 15:06:28 +00:00
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
ro_pmtu = ro;
|
|
|
|
if (opt && opt->ip6po_rthdr)
|
|
|
|
ro = &opt->ip6po_route;
|
|
|
|
dst = (struct sockaddr_in6 *)&ro->ro_dst;
|
2014-10-02 00:25:57 +00:00
|
|
|
fibnum = (inp != NULL) ? inp->inp_inc.inc_fibnum : M_GETFIB(m);
|
2005-04-18 18:35:05 +00:00
|
|
|
again:
|
2007-07-05 16:29:40 +00:00
|
|
|
/*
|
2003-10-24 18:26:30 +00:00
|
|
|
* if specified, try to fill in the traffic class field.
|
|
|
|
* do not override if a non-zero value is already set.
|
|
|
|
* we check the diffserv field and the ecn field separately.
|
|
|
|
*/
|
|
|
|
if (opt && opt->ip6po_tclass >= 0) {
|
|
|
|
int mask = 0;
|
|
|
|
|
|
|
|
if ((ip6->ip6_flow & htonl(0xfc << 20)) == 0)
|
|
|
|
mask |= 0xfc;
|
|
|
|
if ((ip6->ip6_flow & htonl(0x03 << 20)) == 0)
|
|
|
|
mask |= 0x03;
|
|
|
|
if (mask != 0)
|
|
|
|
ip6->ip6_flow |= htonl((opt->ip6po_tclass & mask) << 20);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* fill in or override the hop limit field, if necessary. */
|
|
|
|
if (opt && opt->ip6po_hlim != -1)
|
|
|
|
ip6->ip6_hlim = opt->ip6po_hlim & 0xff;
|
|
|
|
else if (IN6_IS_ADDR_MULTICAST(&ip6->ip6_dst)) {
|
|
|
|
if (im6o != NULL)
|
|
|
|
ip6->ip6_hlim = im6o->im6o_multicast_hlim;
|
|
|
|
else
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
ip6->ip6_hlim = V_ip6_defmcasthlim;
|
2003-10-24 18:26:30 +00:00
|
|
|
}
|
2016-03-24 07:54:56 +00:00
|
|
|
/*
|
|
|
|
* Validate route against routing table additions;
|
|
|
|
* a better/more specific route might have been added.
|
|
|
|
* Make sure address family is set in route.
|
|
|
|
*/
|
|
|
|
if (inp) {
|
|
|
|
ro->ro_dst.sin6_family = AF_INET6;
|
|
|
|
RT_VALIDATE((struct route *)ro, &inp->inp_rt_cookie, fibnum);
|
|
|
|
}
|
|
|
|
if (ro->ro_rt && fwd_tag == NULL && (ro->ro_rt->rt_flags & RTF_UP) &&
|
|
|
|
ro->ro_dst.sin6_family == AF_INET6 &&
|
|
|
|
IN6_ARE_ADDR_EQUAL(&ro->ro_dst.sin6_addr, &ip6->ip6_dst)) {
|
2010-05-09 20:32:00 +00:00
|
|
|
rt = ro->ro_rt;
|
|
|
|
ifp = ro->ro_rt->rt_ifp;
|
2012-12-19 17:08:49 +00:00
|
|
|
} else {
|
2016-08-24 00:52:30 +00:00
|
|
|
if (ro->ro_lle)
|
|
|
|
LLE_FREE(ro->ro_lle); /* zeros ro_lle */
|
|
|
|
ro->ro_lle = NULL;
|
2012-12-19 17:28:17 +00:00
|
|
|
if (fwd_tag == NULL) {
|
|
|
|
bzero(&dst_sa, sizeof(dst_sa));
|
|
|
|
dst_sa.sin6_family = AF_INET6;
|
|
|
|
dst_sa.sin6_len = sizeof(dst_sa);
|
|
|
|
dst_sa.sin6_addr = ip6->ip6_dst;
|
|
|
|
}
|
2012-12-19 17:08:49 +00:00
|
|
|
error = in6_selectroute_fib(&dst_sa, opt, im6o, ro, &ifp,
|
2014-10-02 00:25:57 +00:00
|
|
|
&rt, fibnum);
|
2012-12-19 17:08:49 +00:00
|
|
|
if (error != 0) {
|
|
|
|
if (ifp != NULL)
|
|
|
|
in6_ifstat_inc(ifp, ifs6_out_discard);
|
|
|
|
goto bad;
|
|
|
|
}
|
2005-07-25 12:31:43 +00:00
|
|
|
}
|
|
|
|
if (rt == NULL) {
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
2005-07-25 12:31:43 +00:00
|
|
|
* If in6_selectroute() does not return a route entry,
|
|
|
|
* dst may not have been updated.
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
2005-07-25 12:31:43 +00:00
|
|
|
*dst = dst_sa; /* XXX */
|
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2005-07-25 12:31:43 +00:00
|
|
|
/*
|
|
|
|
* then rt (for unicast) and ifp must be non-NULL valid values.
|
|
|
|
*/
|
|
|
|
if ((flags & IPV6_FORWARDING) == 0) {
|
|
|
|
/* XXX: the FORWARDING flag can be set for mrouting. */
|
|
|
|
in6_ifstat_inc(ifp, ifs6_out_request);
|
|
|
|
}
|
|
|
|
if (rt != NULL) {
|
|
|
|
ia = (struct in6_ifaddr *)(rt->rt_ifa);
|
2014-03-05 01:17:47 +00:00
|
|
|
counter_u64_add(rt->rt_pksent, 1);
|
2005-07-25 12:31:43 +00:00
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2019-01-09 14:28:08 +00:00
|
|
|
/* Setup data structures for scope ID checks. */
|
2005-07-25 12:31:43 +00:00
|
|
|
src0 = ip6->ip6_src;
|
|
|
|
bzero(&src_sa, sizeof(src_sa));
|
|
|
|
src_sa.sin6_family = AF_INET6;
|
|
|
|
src_sa.sin6_len = sizeof(src_sa);
|
|
|
|
src_sa.sin6_addr = ip6->ip6_src;
|
|
|
|
|
|
|
|
dst0 = ip6->ip6_dst;
|
|
|
|
/* re-initialize to be sure */
|
|
|
|
bzero(&dst_sa, sizeof(dst_sa));
|
|
|
|
dst_sa.sin6_family = AF_INET6;
|
|
|
|
dst_sa.sin6_len = sizeof(dst_sa);
|
|
|
|
dst_sa.sin6_addr = ip6->ip6_dst;
|
2009-09-05 16:43:16 +00:00
|
|
|
|
2019-01-09 14:28:08 +00:00
|
|
|
/* Check for valid scope ID. */
|
|
|
|
if (in6_setscope(&src0, ifp, &zone) == 0 &&
|
|
|
|
sa6_recoverscope(&src_sa) == 0 && zone == src_sa.sin6_scope_id &&
|
|
|
|
in6_setscope(&dst0, ifp, &zone) == 0 &&
|
|
|
|
sa6_recoverscope(&dst_sa) == 0 && zone == dst_sa.sin6_scope_id) {
|
|
|
|
/*
|
|
|
|
* The outgoing interface is in the zone of the source
|
|
|
|
* and destination addresses.
|
|
|
|
*
|
|
|
|
* Because the loopback interface cannot receive
|
|
|
|
* packets with a different scope ID than its own,
|
|
|
|
* there is a trick is to pretend the outgoing packet
|
|
|
|
* was received by the real network interface, by
|
|
|
|
* setting "origifp" different from "ifp". This is
|
|
|
|
* only allowed when "ifp" is a loopback network
|
|
|
|
* interface. Refer to code in nd6_output_ifp() for
|
|
|
|
* more details.
|
|
|
|
*/
|
|
|
|
origifp = ifp;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We should use ia_ifp to support the case of sending
|
|
|
|
* packets to an address of our own.
|
|
|
|
*/
|
|
|
|
if (ia != NULL && ia->ia_ifp)
|
|
|
|
ifp = ia->ia_ifp;
|
|
|
|
|
|
|
|
} else if ((ifp->if_flags & IFF_LOOPBACK) == 0 ||
|
|
|
|
sa6_recoverscope(&src_sa) != 0 ||
|
|
|
|
sa6_recoverscope(&dst_sa) != 0 ||
|
|
|
|
dst_sa.sin6_scope_id == 0 ||
|
|
|
|
(src_sa.sin6_scope_id != 0 &&
|
|
|
|
src_sa.sin6_scope_id != dst_sa.sin6_scope_id) ||
|
|
|
|
(origifp = ifnet_byindex(dst_sa.sin6_scope_id)) == NULL) {
|
|
|
|
/*
|
|
|
|
* If the destination network interface is not a
|
|
|
|
* loopback interface, or the destination network
|
|
|
|
* address has no scope ID, or the source address has
|
|
|
|
* a scope ID set which is different from the
|
|
|
|
* destination address one, or there is no network
|
|
|
|
* interface representing this scope ID, the address
|
|
|
|
* pair is considered invalid.
|
|
|
|
*/
|
|
|
|
IP6STAT_INC(ip6s_badscope);
|
|
|
|
in6_ifstat_inc(ifp, ifs6_out_discard);
|
|
|
|
if (error == 0)
|
|
|
|
error = EHOSTUNREACH; /* XXX */
|
|
|
|
goto bad;
|
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2019-01-09 14:28:08 +00:00
|
|
|
/* All scope ID checks are successful. */
|
2005-07-25 12:31:43 +00:00
|
|
|
|
|
|
|
if (rt && !IN6_IS_ADDR_MULTICAST(&ip6->ip6_dst)) {
|
|
|
|
if (opt && opt->ip6po_nextroute.ro_rt) {
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
2005-07-25 12:31:43 +00:00
|
|
|
* The nexthop is explicitly specified by the
|
|
|
|
* application. We assume the next hop is an IPv6
|
|
|
|
* address.
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
2005-07-25 12:31:43 +00:00
|
|
|
dst = (struct sockaddr_in6 *)opt->ip6po_nexthop;
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
2005-07-25 12:31:43 +00:00
|
|
|
else if ((rt->rt_flags & RTF_GATEWAY))
|
|
|
|
dst = (struct sockaddr_in6 *)rt->rt_gateway;
|
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2005-07-25 12:31:43 +00:00
|
|
|
if (!IN6_IS_ADDR_MULTICAST(&ip6->ip6_dst)) {
|
|
|
|
m->m_flags &= ~(M_BCAST | M_MCAST); /* just in case */
|
|
|
|
} else {
|
|
|
|
m->m_flags = (m->m_flags & ~M_BCAST) | M_MCAST;
|
1999-11-22 02:45:11 +00:00
|
|
|
in6_ifstat_inc(ifp, ifs6_out_mcast);
|
|
|
|
/*
|
|
|
|
* Confirm that the outgoing interface supports multicast.
|
|
|
|
*/
|
2005-07-25 12:31:43 +00:00
|
|
|
if (!(ifp->if_flags & IFF_MULTICAST)) {
|
2013-04-09 07:11:22 +00:00
|
|
|
IP6STAT_INC(ip6s_noroute);
|
1999-11-22 02:45:11 +00:00
|
|
|
in6_ifstat_inc(ifp, ifs6_out_discard);
|
|
|
|
error = ENETUNREACH;
|
|
|
|
goto bad;
|
|
|
|
}
|
Bite the bullet, and make the IPv6 SSM and MLDv2 mega-commit:
import from p4 bms_netdev. Summary of changes:
* Connect netinet6/in6_mcast.c to build.
The legacy KAME KPIs are mostly preserved.
* Eliminate now dead code from ip6_output.c.
Don't do mbuf bingo, we are not going to do RFC 2292 style
CMSG tricks for multicast options as they are not required
by any current IPv6 normative reference.
* Refactor transports (UDP, raw_ip6) to do own mcast filtering.
SCTP, TCP unaffected by this change.
* Add ip6_msource, in6_msource structs to in6_var.h.
* Hookup mld_ifinfo state to in6_ifextra, allocate from
domifattach path.
* Eliminate IN6_LOOKUP_MULTI(), it is no longer referenced.
Kernel consumers which need this should use in6m_lookup().
* Refactor IPv6 socket group memberships to use a vector (like IPv4).
* Update ifmcstat(8) for IPv6 SSM.
* Add witness lock order for IN6_MULTI_LOCK.
* Move IN6_MULTI_LOCK out of lower ip6_output()/ip6_input() paths.
* Introduce IP6STAT_ADD/SUB/INC/DEC as per rwatson's IPv4 cleanup.
* Update carp(4) for new IPv6 SSM KPIs.
* Virtualize ip6_mrouter socket.
Changes mostly localized to IPv6 MROUTING.
* Don't do a local group lookup in MROUTING.
* Kill unused KAME prototypes in6_purgemkludge(), in6_restoremkludge().
* Preserve KAME DAD timer jitter behaviour in MLDv1 compatibility mode.
* Bump __FreeBSD_version to 800084.
* Update UPDATING.
NOTE WELL:
* This code hasn't been tested against real MLDv2 queriers
(yet), although the on-wire protocol has been verified in Wireshark.
* There are a few unresolved issues in the socket layer APIs to
do with scope ID propagation.
* There is a LOR present in ip6_output()'s use of
in6_setscope() which needs to be resolved. See comments in mld6.c.
This is believed to be benign and can't be avoided for the moment
without re-introducing an indirect netisr.
This work was mostly derived from the IGMPv3 implementation, and
has been sponsored by a third party.
2009-04-29 19:19:13 +00:00
|
|
|
if ((im6o == NULL && in6_mcast_loop) ||
|
|
|
|
(im6o && im6o->im6o_multicast_loop)) {
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
Bite the bullet, and make the IPv6 SSM and MLDv2 mega-commit:
import from p4 bms_netdev. Summary of changes:
* Connect netinet6/in6_mcast.c to build.
The legacy KAME KPIs are mostly preserved.
* Eliminate now dead code from ip6_output.c.
Don't do mbuf bingo, we are not going to do RFC 2292 style
CMSG tricks for multicast options as they are not required
by any current IPv6 normative reference.
* Refactor transports (UDP, raw_ip6) to do own mcast filtering.
SCTP, TCP unaffected by this change.
* Add ip6_msource, in6_msource structs to in6_var.h.
* Hookup mld_ifinfo state to in6_ifextra, allocate from
domifattach path.
* Eliminate IN6_LOOKUP_MULTI(), it is no longer referenced.
Kernel consumers which need this should use in6m_lookup().
* Refactor IPv6 socket group memberships to use a vector (like IPv4).
* Update ifmcstat(8) for IPv6 SSM.
* Add witness lock order for IN6_MULTI_LOCK.
* Move IN6_MULTI_LOCK out of lower ip6_output()/ip6_input() paths.
* Introduce IP6STAT_ADD/SUB/INC/DEC as per rwatson's IPv4 cleanup.
* Update carp(4) for new IPv6 SSM KPIs.
* Virtualize ip6_mrouter socket.
Changes mostly localized to IPv6 MROUTING.
* Don't do a local group lookup in MROUTING.
* Kill unused KAME prototypes in6_purgemkludge(), in6_restoremkludge().
* Preserve KAME DAD timer jitter behaviour in MLDv1 compatibility mode.
* Bump __FreeBSD_version to 800084.
* Update UPDATING.
NOTE WELL:
* This code hasn't been tested against real MLDv2 queriers
(yet), although the on-wire protocol has been verified in Wireshark.
* There are a few unresolved issues in the socket layer APIs to
do with scope ID propagation.
* There is a LOR present in ip6_output()'s use of
in6_setscope() which needs to be resolved. See comments in mld6.c.
This is believed to be benign and can't be avoided for the moment
without re-introducing an indirect netisr.
This work was mostly derived from the IGMPv3 implementation, and
has been sponsored by a third party.
2009-04-29 19:19:13 +00:00
|
|
|
* Loop back multicast datagram if not expressly
|
|
|
|
* forbidden to do so, even if we have not joined
|
|
|
|
* the address; protocols will filter it later,
|
|
|
|
* thus deferring a hash lookup and lock acquisition
|
|
|
|
* at the expense of an m_copym().
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
2015-08-08 15:58:35 +00:00
|
|
|
ip6_mloopback(ifp, m);
|
2000-01-28 05:27:14 +00:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* If we are acting as a multicast router, perform
|
|
|
|
* multicast forwarding as if the packet had just
|
|
|
|
* arrived on the interface to which we are about
|
|
|
|
* to send. The multicast forwarding function
|
|
|
|
* recursively calls this function, using the
|
|
|
|
* IPV6_FORWARDING flag to prevent infinite recursion.
|
|
|
|
*
|
|
|
|
* Multicasts that are looped back by ip6_mloopback(),
|
|
|
|
* above, will be forwarded by the ip6_input() routine,
|
|
|
|
* if necessary.
|
|
|
|
*/
|
Bite the bullet, and make the IPv6 SSM and MLDv2 mega-commit:
import from p4 bms_netdev. Summary of changes:
* Connect netinet6/in6_mcast.c to build.
The legacy KAME KPIs are mostly preserved.
* Eliminate now dead code from ip6_output.c.
Don't do mbuf bingo, we are not going to do RFC 2292 style
CMSG tricks for multicast options as they are not required
by any current IPv6 normative reference.
* Refactor transports (UDP, raw_ip6) to do own mcast filtering.
SCTP, TCP unaffected by this change.
* Add ip6_msource, in6_msource structs to in6_var.h.
* Hookup mld_ifinfo state to in6_ifextra, allocate from
domifattach path.
* Eliminate IN6_LOOKUP_MULTI(), it is no longer referenced.
Kernel consumers which need this should use in6m_lookup().
* Refactor IPv6 socket group memberships to use a vector (like IPv4).
* Update ifmcstat(8) for IPv6 SSM.
* Add witness lock order for IN6_MULTI_LOCK.
* Move IN6_MULTI_LOCK out of lower ip6_output()/ip6_input() paths.
* Introduce IP6STAT_ADD/SUB/INC/DEC as per rwatson's IPv4 cleanup.
* Update carp(4) for new IPv6 SSM KPIs.
* Virtualize ip6_mrouter socket.
Changes mostly localized to IPv6 MROUTING.
* Don't do a local group lookup in MROUTING.
* Kill unused KAME prototypes in6_purgemkludge(), in6_restoremkludge().
* Preserve KAME DAD timer jitter behaviour in MLDv1 compatibility mode.
* Bump __FreeBSD_version to 800084.
* Update UPDATING.
NOTE WELL:
* This code hasn't been tested against real MLDv2 queriers
(yet), although the on-wire protocol has been verified in Wireshark.
* There are a few unresolved issues in the socket layer APIs to
do with scope ID propagation.
* There is a LOR present in ip6_output()'s use of
in6_setscope() which needs to be resolved. See comments in mld6.c.
This is believed to be benign and can't be avoided for the moment
without re-introducing an indirect netisr.
This work was mostly derived from the IGMPv3 implementation, and
has been sponsored by a third party.
2009-04-29 19:19:13 +00:00
|
|
|
if (V_ip6_mrouter && (flags & IPV6_FORWARDING) == 0) {
|
2005-07-25 12:31:43 +00:00
|
|
|
/*
|
|
|
|
* XXX: ip6_mforward expects that rcvif is NULL
|
|
|
|
* when it is called from the originating path.
|
2013-03-15 12:50:29 +00:00
|
|
|
* However, it may not always be the case.
|
2005-07-25 12:31:43 +00:00
|
|
|
*/
|
|
|
|
m->m_pkthdr.rcvif = NULL;
|
2000-07-04 16:35:15 +00:00
|
|
|
if (ip6_mforward(ip6, ifp, m) != 0) {
|
2000-01-28 05:27:14 +00:00
|
|
|
m_freem(m);
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Multicasts with a hoplimit of zero may be looped back,
|
|
|
|
* above, but must not be transmitted on a network.
|
|
|
|
* Also, multicasts addressed to the loopback interface
|
|
|
|
* are not sent -- the above call to ip6_mloopback() will
|
|
|
|
* loop back a copy if this host actually belongs to the
|
|
|
|
* destination group on the loopback interface.
|
|
|
|
*/
|
2003-10-24 18:26:30 +00:00
|
|
|
if (ip6->ip6_hlim == 0 || (ifp->if_flags & IFF_LOOPBACK) ||
|
|
|
|
IN6_IS_ADDR_MC_INTFACELOCAL(&ip6->ip6_dst)) {
|
1999-11-22 02:45:11 +00:00
|
|
|
m_freem(m);
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Fill the outgoing inteface to tell the upper layer
|
|
|
|
* to increment per-interface statistics.
|
|
|
|
*/
|
|
|
|
if (ifpp)
|
|
|
|
*ifpp = ifp;
|
|
|
|
|
2003-10-20 15:27:48 +00:00
|
|
|
/* Determine path MTU. */
|
2016-05-19 12:45:20 +00:00
|
|
|
if ((error = ip6_getpmtu(ro_pmtu, ro != ro_pmtu, ifp, &ip6->ip6_dst,
|
2016-08-01 17:02:21 +00:00
|
|
|
&mtu, &alwaysfrag, fibnum, *nexthdrp)) != 0)
|
2003-10-20 15:27:48 +00:00
|
|
|
goto bad;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2001-06-11 12:39:29 +00:00
|
|
|
/*
|
2004-02-08 18:22:27 +00:00
|
|
|
* The caller of this function may specify to use the minimum MTU
|
|
|
|
* in some cases.
|
|
|
|
* An advanced API option (IPV6_USE_MIN_MTU) can also override MTU
|
|
|
|
* setting. The logic is a bit complicated; by default, unicast
|
|
|
|
* packets will follow path MTU while multicast packets will be sent at
|
|
|
|
* the minimum MTU. If IP6PO_MINMTU_ALL is specified, all packets
|
|
|
|
* including unicast ones will be sent at the minimum MTU. Multicast
|
|
|
|
* packets will always be sent at the minimum MTU unless
|
|
|
|
* IP6PO_MINMTU_DISABLE is explicitly specified.
|
|
|
|
* See RFC 3542 for more details.
|
2001-06-11 12:39:29 +00:00
|
|
|
*/
|
2004-02-08 18:22:27 +00:00
|
|
|
if (mtu > IPV6_MMTU) {
|
|
|
|
if ((flags & IPV6_MINMTU))
|
|
|
|
mtu = IPV6_MMTU;
|
|
|
|
else if (opt && opt->ip6po_minmtu == IP6PO_MINMTU_ALL)
|
|
|
|
mtu = IPV6_MMTU;
|
|
|
|
else if (IN6_IS_ADDR_MULTICAST(&ip6->ip6_dst) &&
|
|
|
|
(opt == NULL ||
|
|
|
|
opt->ip6po_minmtu != IP6PO_MINMTU_DISABLE)) {
|
|
|
|
mtu = IPV6_MMTU;
|
|
|
|
}
|
|
|
|
}
|
2001-06-11 12:39:29 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* clear embedded scope identifiers if necessary.
|
|
|
|
* in6_clearscope will touch the addresses only when necessary.
|
|
|
|
*/
|
|
|
|
in6_clearscope(&ip6->ip6_src);
|
|
|
|
in6_clearscope(&ip6->ip6_dst);
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the outgoing packet contains a hop-by-hop options header,
|
|
|
|
* it must be examined and processed even by the source node.
|
|
|
|
* (RFC 2460, section 4.)
|
|
|
|
*/
|
|
|
|
if (exthdrs.ip6e_hbh) {
|
2001-06-11 12:39:29 +00:00
|
|
|
struct ip6_hbh *hbh = mtod(exthdrs.ip6e_hbh, struct ip6_hbh *);
|
2005-02-27 18:07:18 +00:00
|
|
|
u_int32_t dummy; /* XXX unused */
|
|
|
|
u_int32_t plen = 0; /* XXX: ip6_process will check the value */
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2001-06-11 12:39:29 +00:00
|
|
|
#ifdef DIAGNOSTIC
|
|
|
|
if ((hbh->ip6h_len + 1) << 3 > exthdrs.ip6e_hbh->m_len)
|
2010-11-27 21:51:39 +00:00
|
|
|
panic("ip6e_hbh is not contiguous");
|
2001-06-11 12:39:29 +00:00
|
|
|
#endif
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
|
|
|
* XXX: if we have to send an ICMPv6 error to the sender,
|
|
|
|
* we need the M_LOOP flag since icmp6_error() expects
|
|
|
|
* the IPv6 and the hop-by-hop options header are
|
2010-11-27 21:51:39 +00:00
|
|
|
* contiguous unless the flag is set.
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
|
|
|
m->m_flags |= M_LOOP;
|
|
|
|
m->m_pkthdr.rcvif = ifp;
|
2003-10-08 18:26:08 +00:00
|
|
|
if (ip6_process_hopopts(m, (u_int8_t *)(hbh + 1),
|
|
|
|
((hbh->ip6h_len + 1) << 3) - sizeof(struct ip6_hbh),
|
2005-02-27 18:07:18 +00:00
|
|
|
&dummy, &plen) < 0) {
|
1999-11-22 02:45:11 +00:00
|
|
|
/* m was already freed at this point */
|
|
|
|
error = EINVAL;/* better error? */
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
m->m_flags &= ~M_LOOP; /* XXX */
|
|
|
|
m->m_pkthdr.rcvif = NULL;
|
|
|
|
}
|
|
|
|
|
2004-08-27 15:16:24 +00:00
|
|
|
/* Jump over all PFIL processing if hooks are not active. */
|
New pfil(9) KPI together with newborn pfil API and control utility.
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented. The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.
In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.
New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.
Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.
Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.
Differential Revision: https://reviews.freebsd.org/D18951
2019-01-31 23:01:03 +00:00
|
|
|
if (!PFIL_HOOKED_OUT(V_inet6_pfil_head))
|
2004-08-27 15:16:24 +00:00
|
|
|
goto passout;
|
|
|
|
|
2005-04-18 18:35:05 +00:00
|
|
|
odst = ip6->ip6_dst;
|
2004-08-27 15:16:24 +00:00
|
|
|
/* Run through list of hooks for output packets. */
|
New pfil(9) KPI together with newborn pfil API and control utility.
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented. The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.
In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.
New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.
Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.
Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.
Differential Revision: https://reviews.freebsd.org/D18951
2019-01-31 23:01:03 +00:00
|
|
|
switch (pfil_run_hooks(V_inet6_pfil_head, &m, ifp, PFIL_OUT, inp)) {
|
|
|
|
case PFIL_PASS:
|
|
|
|
ip6 = mtod(m, struct ip6_hdr *);
|
|
|
|
break;
|
|
|
|
case PFIL_DROPPED:
|
|
|
|
error = EPERM;
|
|
|
|
/* FALLTHROUGH */
|
|
|
|
case PFIL_CONSUMED:
|
2003-09-23 17:54:04 +00:00
|
|
|
goto done;
|
New pfil(9) KPI together with newborn pfil API and control utility.
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented. The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.
In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.
New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.
Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.
Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.
Differential Revision: https://reviews.freebsd.org/D18951
2019-01-31 23:01:03 +00:00
|
|
|
}
|
2003-10-08 18:26:08 +00:00
|
|
|
|
2014-10-02 00:25:57 +00:00
|
|
|
needfiblookup = 0;
|
2005-04-18 18:35:05 +00:00
|
|
|
/* See if destination IP address was changed by packet filter. */
|
|
|
|
if (!IN6_ARE_ADDR_EQUAL(&odst, &ip6->ip6_dst)) {
|
|
|
|
m->m_flags |= M_SKIP_FIREWALL;
|
|
|
|
/* If destination is now ourself drop to ip6_input(). */
|
2011-08-20 17:05:11 +00:00
|
|
|
if (in6_localip(&ip6->ip6_dst)) {
|
|
|
|
m->m_flags |= M_FASTFWD_OURS;
|
2005-04-18 18:35:05 +00:00
|
|
|
if (m->m_pkthdr.rcvif == NULL)
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
m->m_pkthdr.rcvif = V_loif;
|
It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.
To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.
Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.
This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.
Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.
Individual driver updates will have to follow, as will SCTP.
Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958
2012-05-28 09:30:13 +00:00
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_DELAY_DATA_IPV6) {
|
2005-04-18 18:35:05 +00:00
|
|
|
m->m_pkthdr.csum_flags |=
|
It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.
To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.
Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.
This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.
Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.
Individual driver updates will have to follow, as will SCTP.
Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958
2012-05-28 09:30:13 +00:00
|
|
|
CSUM_DATA_VALID_IPV6 | CSUM_PSEUDO_HDR;
|
2005-04-18 18:35:05 +00:00
|
|
|
m->m_pkthdr.csum_data = 0xffff;
|
|
|
|
}
|
2010-03-12 08:10:30 +00:00
|
|
|
#ifdef SCTP
|
2012-05-30 20:56:07 +00:00
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_SCTP_IPV6)
|
2010-03-12 08:10:30 +00:00
|
|
|
m->m_pkthdr.csum_flags |= CSUM_SCTP_VALID;
|
|
|
|
#endif
|
2005-04-18 18:35:05 +00:00
|
|
|
error = netisr_queue(NETISR_IPV6, m);
|
|
|
|
goto done;
|
2016-05-17 14:06:55 +00:00
|
|
|
} else {
|
2018-09-03 22:27:27 +00:00
|
|
|
RO_INVALIDATE_CACHE(ro);
|
2014-10-02 00:25:57 +00:00
|
|
|
needfiblookup = 1; /* Redo the routing table lookup. */
|
2016-05-17 14:06:55 +00:00
|
|
|
}
|
2005-04-18 18:35:05 +00:00
|
|
|
}
|
2014-10-02 00:25:57 +00:00
|
|
|
/* See if fib was changed by packet filter. */
|
|
|
|
if (fibnum != M_GETFIB(m)) {
|
|
|
|
m->m_flags |= M_SKIP_FIREWALL;
|
|
|
|
fibnum = M_GETFIB(m);
|
2018-09-03 22:27:27 +00:00
|
|
|
RO_INVALIDATE_CACHE(ro);
|
2014-10-02 00:25:57 +00:00
|
|
|
needfiblookup = 1;
|
|
|
|
}
|
|
|
|
if (needfiblookup)
|
|
|
|
goto again;
|
2005-04-18 18:35:05 +00:00
|
|
|
|
2011-08-20 17:05:11 +00:00
|
|
|
/* See if local, if yes, send it to netisr. */
|
|
|
|
if (m->m_flags & M_FASTFWD_OURS) {
|
|
|
|
if (m->m_pkthdr.rcvif == NULL)
|
|
|
|
m->m_pkthdr.rcvif = V_loif;
|
It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.
To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.
Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.
This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.
Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.
Individual driver updates will have to follow, as will SCTP.
Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958
2012-05-28 09:30:13 +00:00
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_DELAY_DATA_IPV6) {
|
2011-08-20 17:05:11 +00:00
|
|
|
m->m_pkthdr.csum_flags |=
|
It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.
To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.
Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.
This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.
Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.
Individual driver updates will have to follow, as will SCTP.
Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958
2012-05-28 09:30:13 +00:00
|
|
|
CSUM_DATA_VALID_IPV6 | CSUM_PSEUDO_HDR;
|
2011-08-20 17:05:11 +00:00
|
|
|
m->m_pkthdr.csum_data = 0xffff;
|
|
|
|
}
|
|
|
|
#ifdef SCTP
|
2012-05-30 20:56:07 +00:00
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_SCTP_IPV6)
|
2012-05-25 02:17:16 +00:00
|
|
|
m->m_pkthdr.csum_flags |= CSUM_SCTP_VALID;
|
2012-05-30 20:56:07 +00:00
|
|
|
#endif
|
2011-08-20 17:05:11 +00:00
|
|
|
error = netisr_queue(NETISR_IPV6, m);
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
/* Or forward to some other address? */
|
2012-11-02 01:20:55 +00:00
|
|
|
if ((m->m_flags & M_IP6_NEXTHOP) &&
|
|
|
|
(fwd_tag = m_tag_find(m, PACKET_TAG_IPFORWARD, NULL)) != NULL) {
|
2011-08-20 17:05:11 +00:00
|
|
|
dst = (struct sockaddr_in6 *)&ro->ro_dst;
|
2012-12-19 17:28:17 +00:00
|
|
|
bcopy((fwd_tag+1), &dst_sa, sizeof(struct sockaddr_in6));
|
2011-08-20 17:05:11 +00:00
|
|
|
m->m_flags |= M_SKIP_FIREWALL;
|
2012-11-02 01:20:55 +00:00
|
|
|
m->m_flags &= ~M_IP6_NEXTHOP;
|
2011-08-20 17:05:11 +00:00
|
|
|
m_tag_delete(m, fwd_tag);
|
|
|
|
goto again;
|
|
|
|
}
|
2005-04-18 18:35:05 +00:00
|
|
|
|
2004-08-27 15:16:24 +00:00
|
|
|
passout:
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
|
|
|
* Send the packet to the outgoing interface.
|
|
|
|
* If necessary, do IPv6 fragmentation before sending.
|
2003-10-24 18:26:30 +00:00
|
|
|
*
|
|
|
|
* the logic here is rather complex:
|
|
|
|
* 1: normal case (dontfrag == 0, alwaysfrag == 0)
|
|
|
|
* 1-a: send as is if tlen <= path mtu
|
|
|
|
* 1-b: fragment if tlen > path mtu
|
|
|
|
*
|
|
|
|
* 2: if user asks us not to fragment (dontfrag == 1)
|
|
|
|
* 2-a: send as is if tlen <= interface mtu
|
|
|
|
* 2-b: error if tlen > interface mtu
|
|
|
|
*
|
|
|
|
* 3: if we always need to attach fragment header (alwaysfrag == 1)
|
|
|
|
* always fragment
|
|
|
|
*
|
|
|
|
* 4: if dontfrag == 1 && alwaysfrag == 1
|
|
|
|
* error, as we cannot handle this conflicting request
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
2012-05-25 02:17:16 +00:00
|
|
|
sw_csum = m->m_pkthdr.csum_flags;
|
|
|
|
if (!hdrsplit) {
|
|
|
|
tso = ((sw_csum & ifp->if_hwassist & CSUM_TSO) != 0) ? 1 : 0;
|
|
|
|
sw_csum &= ~ifp->if_hwassist;
|
|
|
|
} else
|
|
|
|
tso = 0;
|
|
|
|
/*
|
|
|
|
* If we added extension headers, we will not do TSO and calculate the
|
|
|
|
* checksums ourselves for now.
|
|
|
|
* XXX-BZ Need a framework to know when the NIC can handle it, even
|
|
|
|
* with ext. hdrs.
|
|
|
|
*/
|
It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.
To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.
Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.
This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.
Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.
Individual driver updates will have to follow, as will SCTP.
Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958
2012-05-28 09:30:13 +00:00
|
|
|
if (sw_csum & CSUM_DELAY_DATA_IPV6) {
|
|
|
|
sw_csum &= ~CSUM_DELAY_DATA_IPV6;
|
Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
2019-06-29 00:48:33 +00:00
|
|
|
m = mb_unmapped_to_ext(m);
|
|
|
|
if (m == NULL) {
|
|
|
|
error = ENOBUFS;
|
|
|
|
IP6STAT_INC(ip6s_odropped);
|
|
|
|
goto bad;
|
|
|
|
}
|
2012-05-26 23:58:51 +00:00
|
|
|
in6_delayed_cksum(m, plen, sizeof(struct ip6_hdr));
|
Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
2019-06-29 00:48:33 +00:00
|
|
|
} else if ((ifp->if_capenable & IFCAP_NOMAP) == 0) {
|
|
|
|
m = mb_unmapped_to_ext(m);
|
|
|
|
if (m == NULL) {
|
|
|
|
error = ENOBUFS;
|
|
|
|
IP6STAT_INC(ip6s_odropped);
|
|
|
|
goto bad;
|
|
|
|
}
|
2012-05-25 02:17:16 +00:00
|
|
|
}
|
2010-03-12 08:10:30 +00:00
|
|
|
#ifdef SCTP
|
2012-05-30 20:56:07 +00:00
|
|
|
if (sw_csum & CSUM_SCTP_IPV6) {
|
|
|
|
sw_csum &= ~CSUM_SCTP_IPV6;
|
Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
2019-06-29 00:48:33 +00:00
|
|
|
m = mb_unmapped_to_ext(m);
|
|
|
|
if (m == NULL) {
|
|
|
|
error = ENOBUFS;
|
|
|
|
IP6STAT_INC(ip6s_odropped);
|
|
|
|
goto bad;
|
|
|
|
}
|
2012-05-25 02:17:16 +00:00
|
|
|
sctp_delayed_cksum(m, sizeof(struct ip6_hdr));
|
2010-03-12 08:10:30 +00:00
|
|
|
}
|
|
|
|
#endif
|
2012-05-25 02:17:16 +00:00
|
|
|
m->m_pkthdr.csum_flags &= ifp->if_hwassist;
|
1999-11-22 02:45:11 +00:00
|
|
|
tlen = m->m_pkthdr.len;
|
2003-10-24 18:26:30 +00:00
|
|
|
|
2012-05-25 02:17:16 +00:00
|
|
|
if ((opt && (opt->ip6po_flags & IP6PO_DONTFRAG)) || tso)
|
2003-10-24 18:26:30 +00:00
|
|
|
dontfrag = 1;
|
|
|
|
else
|
|
|
|
dontfrag = 0;
|
|
|
|
if (dontfrag && alwaysfrag) { /* case 4 */
|
|
|
|
/* conflicting request - can't transmit */
|
|
|
|
error = EMSGSIZE;
|
|
|
|
goto bad;
|
|
|
|
}
|
2012-05-25 02:17:16 +00:00
|
|
|
if (dontfrag && tlen > IN6_LINKMTU(ifp) && !tso) { /* case 2-b */
|
2003-10-24 18:26:30 +00:00
|
|
|
/*
|
|
|
|
* Even if the DONTFRAG option is specified, we cannot send the
|
|
|
|
* packet when the data length is larger than the MTU of the
|
|
|
|
* outgoing interface.
|
2015-03-04 11:20:01 +00:00
|
|
|
* Notify the error by sending IPV6_PATHMTU ancillary data if
|
|
|
|
* application wanted to know the MTU value. Also return an
|
|
|
|
* error code (this is not described in the API spec).
|
2003-10-24 18:26:30 +00:00
|
|
|
*/
|
2015-03-04 11:20:01 +00:00
|
|
|
if (inp != NULL)
|
|
|
|
ip6_notify_pmtu(inp, &dst_sa, (u_int32_t)mtu);
|
2003-10-24 18:26:30 +00:00
|
|
|
error = EMSGSIZE;
|
|
|
|
goto bad;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* transmit packet without fragmentation
|
|
|
|
*/
|
|
|
|
if (dontfrag || (!alwaysfrag && tlen <= mtu)) { /* case 1-a and 2-a */
|
|
|
|
struct in6_ifaddr *ia6;
|
|
|
|
|
|
|
|
ip6 = mtod(m, struct ip6_hdr *);
|
|
|
|
ia6 = in6_ifawithifp(ifp, &ip6->ip6_src);
|
|
|
|
if (ia6) {
|
|
|
|
/* Record statistics for this interface address. */
|
2013-10-15 11:37:57 +00:00
|
|
|
counter_u64_add(ia6->ia_ifa.ifa_opackets, 1);
|
|
|
|
counter_u64_add(ia6->ia_ifa.ifa_obytes,
|
|
|
|
m->m_pkthdr.len);
|
2009-06-23 20:19:09 +00:00
|
|
|
ifa_free(&ia6->ia_ifa);
|
2003-10-24 18:26:30 +00:00
|
|
|
}
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
|
|
|
error = ip6_output_send(inp, ifp, origifp, m, dst, ro);
|
1999-11-22 02:45:11 +00:00
|
|
|
goto done;
|
2003-10-24 18:26:30 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* try to fragment the packet. case 1-b and 3
|
|
|
|
*/
|
|
|
|
if (mtu < IPV6_MMTU) {
|
|
|
|
/* path MTU cannot be less than IPV6_MMTU */
|
1999-11-22 02:45:11 +00:00
|
|
|
error = EMSGSIZE;
|
|
|
|
in6_ifstat_inc(ifp, ifs6_out_fragfail);
|
|
|
|
goto bad;
|
2003-10-08 18:26:08 +00:00
|
|
|
} else if (ip6->ip6_plen == 0) {
|
|
|
|
/* jumbo payload cannot be fragmented */
|
1999-11-22 02:45:11 +00:00
|
|
|
error = EMSGSIZE;
|
|
|
|
in6_ifstat_inc(ifp, ifs6_out_fragfail);
|
|
|
|
goto bad;
|
|
|
|
} else {
|
|
|
|
u_char nextproto;
|
2007-07-01 11:41:27 +00:00
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
|
|
|
* Too large for the destination or interface;
|
|
|
|
* fragment if possible.
|
|
|
|
* Must be able to put at least 8 bytes per fragment.
|
|
|
|
*/
|
|
|
|
hlen = unfragpartlen;
|
|
|
|
if (mtu > IPV6_MAXPACKET)
|
|
|
|
mtu = IPV6_MAXPACKET;
|
2001-06-11 12:39:29 +00:00
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
len = (mtu - hlen - sizeof(struct ip6_frag)) & ~7;
|
|
|
|
if (len < 8) {
|
|
|
|
error = EMSGSIZE;
|
|
|
|
in6_ifstat_inc(ifp, ifs6_out_fragfail);
|
|
|
|
goto bad;
|
|
|
|
}
|
|
|
|
|
2012-05-25 02:17:16 +00:00
|
|
|
/*
|
|
|
|
* If the interface will not calculate checksums on
|
|
|
|
* fragmented packets, then do it here.
|
|
|
|
* XXX-BZ handle the hw offloading case. Need flags.
|
|
|
|
*/
|
It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.
To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.
Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.
This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.
Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.
Individual driver updates will have to follow, as will SCTP.
Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958
2012-05-28 09:30:13 +00:00
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_DELAY_DATA_IPV6) {
|
Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
2019-06-29 00:48:33 +00:00
|
|
|
m = mb_unmapped_to_ext(m);
|
|
|
|
if (m == NULL) {
|
|
|
|
in6_ifstat_inc(ifp, ifs6_out_fragfail);
|
|
|
|
error = ENOBUFS;
|
|
|
|
goto bad;
|
|
|
|
}
|
2012-05-26 23:58:51 +00:00
|
|
|
in6_delayed_cksum(m, plen, hlen);
|
It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.
To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.
Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.
This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.
Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.
Individual driver updates will have to follow, as will SCTP.
Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958
2012-05-28 09:30:13 +00:00
|
|
|
m->m_pkthdr.csum_flags &= ~CSUM_DELAY_DATA_IPV6;
|
2012-05-25 02:17:16 +00:00
|
|
|
}
|
|
|
|
#ifdef SCTP
|
2012-05-30 20:56:07 +00:00
|
|
|
if (m->m_pkthdr.csum_flags & CSUM_SCTP_IPV6) {
|
Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
2019-06-29 00:48:33 +00:00
|
|
|
m = mb_unmapped_to_ext(m);
|
|
|
|
if (m == NULL) {
|
|
|
|
in6_ifstat_inc(ifp, ifs6_out_fragfail);
|
|
|
|
error = ENOBUFS;
|
|
|
|
goto bad;
|
|
|
|
}
|
2012-05-25 02:17:16 +00:00
|
|
|
sctp_delayed_cksum(m, hlen);
|
2012-05-30 20:56:07 +00:00
|
|
|
m->m_pkthdr.csum_flags &= ~CSUM_SCTP_IPV6;
|
2012-05-25 02:17:16 +00:00
|
|
|
}
|
|
|
|
#endif
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
|
|
|
* Change the next header field of the last header in the
|
|
|
|
* unfragmentable part.
|
|
|
|
*/
|
|
|
|
if (exthdrs.ip6e_rthdr) {
|
|
|
|
nextproto = *mtod(exthdrs.ip6e_rthdr, u_char *);
|
|
|
|
*mtod(exthdrs.ip6e_rthdr, u_char *) = IPPROTO_FRAGMENT;
|
|
|
|
} else if (exthdrs.ip6e_dest1) {
|
|
|
|
nextproto = *mtod(exthdrs.ip6e_dest1, u_char *);
|
|
|
|
*mtod(exthdrs.ip6e_dest1, u_char *) = IPPROTO_FRAGMENT;
|
|
|
|
} else if (exthdrs.ip6e_hbh) {
|
|
|
|
nextproto = *mtod(exthdrs.ip6e_hbh, u_char *);
|
|
|
|
*mtod(exthdrs.ip6e_hbh, u_char *) = IPPROTO_FRAGMENT;
|
|
|
|
} else {
|
|
|
|
nextproto = ip6->ip6_nxt;
|
|
|
|
ip6->ip6_nxt = IPPROTO_FRAGMENT;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Loop through length of segment after first fragment,
|
2002-04-19 04:46:24 +00:00
|
|
|
* make new header and copy data of each part and link onto
|
|
|
|
* chain.
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
|
|
|
m0 = m;
|
2015-04-01 12:15:01 +00:00
|
|
|
id = htonl(ip6_randomid());
|
|
|
|
if ((error = ip6_fragment(ifp, m, hlen, nextproto, len, id)))
|
2015-02-16 06:30:27 +00:00
|
|
|
goto sendorfree;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
in6_ifstat_inc(ifp, ifs6_out_fragok);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove leading garbages.
|
|
|
|
*/
|
|
|
|
sendorfree:
|
|
|
|
m = m0->m_nextpkt;
|
|
|
|
m0->m_nextpkt = 0;
|
|
|
|
m_freem(m0);
|
2018-03-24 12:43:34 +00:00
|
|
|
for (; m; m = m0) {
|
1999-11-22 02:45:11 +00:00
|
|
|
m0 = m->m_nextpkt;
|
|
|
|
m->m_nextpkt = 0;
|
|
|
|
if (error == 0) {
|
2007-07-05 16:29:40 +00:00
|
|
|
/* Record statistics for this interface address. */
|
|
|
|
if (ia) {
|
2013-10-15 11:37:57 +00:00
|
|
|
counter_u64_add(ia->ia_ifa.ifa_opackets, 1);
|
|
|
|
counter_u64_add(ia->ia_ifa.ifa_obytes,
|
|
|
|
m->m_pkthdr.len);
|
2007-07-05 16:29:40 +00:00
|
|
|
}
|
Restructure mbuf send tags to provide stronger guarantees.
- Perform ifp mismatch checks (to determine if a send tag is allocated
for a different ifp than the one the packet is being output on), in
ip_output() and ip6_output(). This avoids sending packets with send
tags to ifnet drivers that don't support send tags.
Since we are now checking for ifp mismatches before invoking
if_output, we can now try to allocate a new tag before invoking
if_output sending the original packet on the new tag if allocation
succeeds.
To avoid code duplication for the fragment and unfragmented cases,
add ip_output_send() and ip6_output_send() as wrappers around
if_output and nd6_output_ifp, respectively. All of the logic for
setting send tags and dealing with send tag-related errors is done
in these wrapper functions.
For pseudo interfaces that wrap other network interfaces (vlan and
lagg), wrapper send tags are now allocated so that ip*_output see
the wrapper ifp as the ifp in the send tag. The if_transmit
routines rewrite the send tags after performing an ifp mismatch
check. If an ifp mismatch is detected, the transmit routines fail
with EAGAIN.
- To provide clearer life cycle management of send tags, especially
in the presence of vlan and lagg wrapper tags, add a reference count
to send tags managed via m_snd_tag_ref() and m_snd_tag_rele().
Provide a helper function (m_snd_tag_init()) for use by drivers
supporting send tags. m_snd_tag_init() takes care of the if_ref
on the ifp meaning that code alloating send tags via if_snd_tag_alloc
no longer has to manage that manually. Similarly, m_snd_tag_rele
drops the refcount on the ifp after invoking if_snd_tag_free when
the last reference to a send tag is dropped.
This also closes use after free races if there are pending packets in
driver tx rings after the socket is closed (e.g. from tcpdrop).
In order for m_free to work reliably, add a new CSUM_SND_TAG flag in
csum_flags to indicate 'snd_tag' is set (rather than 'rcvif').
Drivers now also check this flag instead of checking snd_tag against
NULL. This avoids false positive matches when a forwarded packet
has a non-NULL rcvif that was treated as a send tag.
- cxgbe was relying on snd_tag_free being called when the inp was
detached so that it could kick the firmware to flush any pending
work on the flow. This is because the driver doesn't require ACK
messages from the firmware for every request, but instead does a
kind of manual interrupt coalescing by only setting a flag to
request a completion on a subset of requests. If all of the
in-flight requests don't have the flag when the tag is detached from
the inp, the flow might never return the credits. The current
snd_tag_free command issues a flush command to force the credits to
return. However, the credit return is what also frees the mbufs,
and since those mbufs now hold references on the tag, this meant
that snd_tag_free would never be called.
To fix, explicitly drop the mbuf's reference on the snd tag when the
mbuf is queued in the firmware work queue. This means that once the
inp's reference on the tag goes away and all in-flight mbufs have
been queued to the firmware, tag's refcount will drop to zero and
snd_tag_free will kick in and send the flush request. Note that we
need to avoid doing this in the middle of ethofld_tx(), so the
driver grabs a temporary reference on the tag around that loop to
defer the free to the end of the function in case it sends the last
mbuf to the queue after the inp has dropped its reference on the
tag.
- mlx5 preallocates send tags and was using the ifp pointer even when
the send tag wasn't in use. Explicitly use the ifp from other data
structures instead.
- Sprinkle some assertions in various places to assert that received
packets don't have a send tag, and that other places that overwrite
rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer.
Reviewed by: gallatin, hselasky, rgrimes, ae
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20117
2019-05-24 22:30:40 +00:00
|
|
|
error = ip6_output_send(inp, ifp, origifp, m, dst, ro);
|
1999-11-22 02:45:11 +00:00
|
|
|
} else
|
|
|
|
m_freem(m);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (error == 0)
|
2013-04-09 07:11:22 +00:00
|
|
|
IP6STAT_INC(ip6s_fragmented);
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
done:
|
2016-10-13 20:15:47 +00:00
|
|
|
if (ro == &ip6route)
|
2012-07-04 07:37:53 +00:00
|
|
|
RO_RTFREE(ro);
|
2003-10-06 14:02:09 +00:00
|
|
|
return (error);
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
freehdrs:
|
|
|
|
m_freem(exthdrs.ip6e_hbh); /* m_freem will check if mbuf is 0 */
|
|
|
|
m_freem(exthdrs.ip6e_dest1);
|
|
|
|
m_freem(exthdrs.ip6e_rthdr);
|
|
|
|
m_freem(exthdrs.ip6e_dest2);
|
2003-10-08 18:26:08 +00:00
|
|
|
/* FALLTHROUGH */
|
1999-11-22 02:45:11 +00:00
|
|
|
bad:
|
2007-07-01 11:41:27 +00:00
|
|
|
if (m)
|
|
|
|
m_freem(m);
|
1999-11-22 02:45:11 +00:00
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2007-07-05 16:23:49 +00:00
|
|
|
ip6_copyexthdr(struct mbuf **mp, caddr_t hdr, int hlen)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
|
|
|
struct mbuf *m;
|
|
|
|
|
|
|
|
if (hlen > MCLBYTES)
|
2003-10-06 14:02:09 +00:00
|
|
|
return (ENOBUFS); /* XXX */
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2013-03-15 13:48:53 +00:00
|
|
|
if (hlen > MLEN)
|
|
|
|
m = m_getcl(M_NOWAIT, MT_DATA, 0);
|
|
|
|
else
|
|
|
|
m = m_get(M_NOWAIT, MT_DATA);
|
|
|
|
if (m == NULL)
|
2003-10-06 14:02:09 +00:00
|
|
|
return (ENOBUFS);
|
1999-11-22 02:45:11 +00:00
|
|
|
m->m_len = hlen;
|
|
|
|
if (hdr)
|
|
|
|
bcopy(hdr, mtod(m, caddr_t), hlen);
|
|
|
|
|
|
|
|
*mp = m;
|
2003-10-06 14:02:09 +00:00
|
|
|
return (0);
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Insert jumbo payload option.
|
|
|
|
*/
|
|
|
|
static int
|
2007-07-05 16:23:49 +00:00
|
|
|
ip6_insert_jumboopt(struct ip6_exthdrs *exthdrs, u_int32_t plen)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
|
|
|
struct mbuf *mopt;
|
|
|
|
u_char *optbuf;
|
2001-06-11 12:39:29 +00:00
|
|
|
u_int32_t v;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
#define JUMBOOPTLEN 8 /* length of jumbo payload option and padding */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If there is no hop-by-hop options header, allocate new one.
|
|
|
|
* If there is one but it doesn't have enough space to store the
|
|
|
|
* jumbo payload option, allocate a cluster to store the whole options.
|
|
|
|
* Otherwise, use it to store the options.
|
|
|
|
*/
|
2016-04-15 17:30:33 +00:00
|
|
|
if (exthdrs->ip6e_hbh == NULL) {
|
2013-03-15 13:48:53 +00:00
|
|
|
mopt = m_get(M_NOWAIT, MT_DATA);
|
|
|
|
if (mopt == NULL)
|
2003-10-06 14:02:09 +00:00
|
|
|
return (ENOBUFS);
|
1999-11-22 02:45:11 +00:00
|
|
|
mopt->m_len = JUMBOOPTLEN;
|
|
|
|
optbuf = mtod(mopt, u_char *);
|
|
|
|
optbuf[1] = 0; /* = ((JUMBOOPTLEN) >> 3) - 1 */
|
|
|
|
exthdrs->ip6e_hbh = mopt;
|
|
|
|
} else {
|
|
|
|
struct ip6_hbh *hbh;
|
|
|
|
|
|
|
|
mopt = exthdrs->ip6e_hbh;
|
|
|
|
if (M_TRAILINGSPACE(mopt) < JUMBOOPTLEN) {
|
2001-06-11 12:39:29 +00:00
|
|
|
/*
|
|
|
|
* XXX assumption:
|
|
|
|
* - exthdrs->ip6e_hbh is not referenced from places
|
|
|
|
* other than exthdrs.
|
|
|
|
* - exthdrs->ip6e_hbh is not an mbuf chain.
|
|
|
|
*/
|
1999-11-22 02:45:11 +00:00
|
|
|
int oldoptlen = mopt->m_len;
|
2001-06-11 12:39:29 +00:00
|
|
|
struct mbuf *n;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2001-06-11 12:39:29 +00:00
|
|
|
/*
|
|
|
|
* XXX: give up if the whole (new) hbh header does
|
|
|
|
* not fit even in an mbuf cluster.
|
|
|
|
*/
|
|
|
|
if (oldoptlen + JUMBOOPTLEN > MCLBYTES)
|
2003-10-06 14:02:09 +00:00
|
|
|
return (ENOBUFS);
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2001-06-11 12:39:29 +00:00
|
|
|
/*
|
|
|
|
* As a consequence, we must always prepare a cluster
|
|
|
|
* at this point.
|
|
|
|
*/
|
2013-03-15 13:48:53 +00:00
|
|
|
n = m_getcl(M_NOWAIT, MT_DATA, 0);
|
|
|
|
if (n == NULL)
|
2003-10-06 14:02:09 +00:00
|
|
|
return (ENOBUFS);
|
2001-06-11 12:39:29 +00:00
|
|
|
n->m_len = oldoptlen + JUMBOOPTLEN;
|
|
|
|
bcopy(mtod(mopt, caddr_t), mtod(n, caddr_t),
|
2003-10-08 18:26:08 +00:00
|
|
|
oldoptlen);
|
2001-06-11 12:39:29 +00:00
|
|
|
optbuf = mtod(n, caddr_t) + oldoptlen;
|
|
|
|
m_freem(mopt);
|
|
|
|
mopt = exthdrs->ip6e_hbh = n;
|
1999-11-22 02:45:11 +00:00
|
|
|
} else {
|
|
|
|
optbuf = mtod(mopt, u_char *) + mopt->m_len;
|
|
|
|
mopt->m_len += JUMBOOPTLEN;
|
|
|
|
}
|
|
|
|
optbuf[0] = IP6OPT_PADN;
|
|
|
|
optbuf[1] = 1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Adjust the header length according to the pad and
|
|
|
|
* the jumbo payload option.
|
|
|
|
*/
|
|
|
|
hbh = mtod(mopt, struct ip6_hbh *);
|
|
|
|
hbh->ip6h_len += (JUMBOOPTLEN >> 3);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* fill in the option. */
|
|
|
|
optbuf[2] = IP6OPT_JUMBO;
|
|
|
|
optbuf[3] = 4;
|
2001-06-11 12:39:29 +00:00
|
|
|
v = (u_int32_t)htonl(plen + JUMBOOPTLEN);
|
|
|
|
bcopy(&v, &optbuf[4], sizeof(u_int32_t));
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
/* finally, adjust the packet header length */
|
|
|
|
exthdrs->ip6e_ip6->m_pkthdr.len += JUMBOOPTLEN;
|
|
|
|
|
2003-10-06 14:02:09 +00:00
|
|
|
return (0);
|
1999-11-22 02:45:11 +00:00
|
|
|
#undef JUMBOOPTLEN
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Insert fragment header and copy unfragmentable header portions.
|
|
|
|
*/
|
|
|
|
static int
|
2007-07-05 16:23:49 +00:00
|
|
|
ip6_insertfraghdr(struct mbuf *m0, struct mbuf *m, int hlen,
|
|
|
|
struct ip6_frag **frghdrp)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
|
|
|
struct mbuf *n, *mlast;
|
|
|
|
|
|
|
|
if (hlen > sizeof(struct ip6_hdr)) {
|
|
|
|
n = m_copym(m0, sizeof(struct ip6_hdr),
|
2012-12-05 08:04:20 +00:00
|
|
|
hlen - sizeof(struct ip6_hdr), M_NOWAIT);
|
2016-04-15 17:30:33 +00:00
|
|
|
if (n == NULL)
|
2003-10-06 14:02:09 +00:00
|
|
|
return (ENOBUFS);
|
1999-11-22 02:45:11 +00:00
|
|
|
m->m_next = n;
|
|
|
|
} else
|
|
|
|
n = m;
|
|
|
|
|
|
|
|
/* Search for the last mbuf of unfragmentable part. */
|
|
|
|
for (mlast = n; mlast->m_next; mlast = mlast->m_next)
|
|
|
|
;
|
|
|
|
|
2014-10-12 15:49:52 +00:00
|
|
|
if (M_WRITABLE(mlast) &&
|
2000-07-04 16:35:15 +00:00
|
|
|
M_TRAILINGSPACE(mlast) >= sizeof(struct ip6_frag)) {
|
1999-11-22 02:45:11 +00:00
|
|
|
/* use the trailing space of the last mbuf for the fragment hdr */
|
2003-10-08 18:26:08 +00:00
|
|
|
*frghdrp = (struct ip6_frag *)(mtod(mlast, caddr_t) +
|
|
|
|
mlast->m_len);
|
1999-11-22 02:45:11 +00:00
|
|
|
mlast->m_len += sizeof(struct ip6_frag);
|
|
|
|
m->m_pkthdr.len += sizeof(struct ip6_frag);
|
|
|
|
} else {
|
|
|
|
/* allocate a new mbuf for the fragment header */
|
|
|
|
struct mbuf *mfrg;
|
|
|
|
|
2013-03-15 13:48:53 +00:00
|
|
|
mfrg = m_get(M_NOWAIT, MT_DATA);
|
|
|
|
if (mfrg == NULL)
|
2003-10-06 14:02:09 +00:00
|
|
|
return (ENOBUFS);
|
1999-11-22 02:45:11 +00:00
|
|
|
mfrg->m_len = sizeof(struct ip6_frag);
|
|
|
|
*frghdrp = mtod(mfrg, struct ip6_frag *);
|
|
|
|
mlast->m_next = mfrg;
|
|
|
|
}
|
|
|
|
|
2003-10-06 14:02:09 +00:00
|
|
|
return (0);
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
|
2016-01-03 09:54:03 +00:00
|
|
|
/*
|
|
|
|
* Calculates IPv6 path mtu for destination @dst.
|
|
|
|
* Resulting MTU is stored in @mtup.
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
2003-10-20 15:27:48 +00:00
|
|
|
static int
|
2016-05-19 12:45:20 +00:00
|
|
|
ip6_getpmtu_ctl(u_int fibnum, const struct in6_addr *dst, u_long *mtup)
|
2016-01-03 09:54:03 +00:00
|
|
|
{
|
2016-01-04 18:32:24 +00:00
|
|
|
struct nhop6_extended nh6;
|
|
|
|
struct in6_addr kdst;
|
|
|
|
uint32_t scopeid;
|
2016-01-03 09:54:03 +00:00
|
|
|
struct ifnet *ifp;
|
|
|
|
u_long mtu;
|
2016-01-04 18:32:24 +00:00
|
|
|
int error;
|
2016-01-03 09:54:03 +00:00
|
|
|
|
2016-01-04 18:32:24 +00:00
|
|
|
in6_splitscope(dst, &kdst, &scopeid);
|
|
|
|
if (fib6_lookup_nh_ext(fibnum, &kdst, scopeid, NHR_REF, 0, &nh6) != 0)
|
2016-01-03 09:54:03 +00:00
|
|
|
return (EHOSTUNREACH);
|
|
|
|
|
2016-01-04 18:32:24 +00:00
|
|
|
ifp = nh6.nh_ifp;
|
|
|
|
mtu = nh6.nh_mtu;
|
2016-01-03 09:54:03 +00:00
|
|
|
|
2016-08-01 17:02:21 +00:00
|
|
|
error = ip6_calcmtu(ifp, dst, mtu, mtup, NULL, 0);
|
2016-01-04 18:32:24 +00:00
|
|
|
fib6_free_nh_ext(fibnum, &nh6);
|
|
|
|
|
|
|
|
return (error);
|
2016-01-03 09:54:03 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculates IPv6 path MTU for @dst based on transmit @ifp,
|
|
|
|
* and cached data in @ro_pmtu.
|
|
|
|
* MTU from (successful) route lookup is saved (along with dst)
|
|
|
|
* inside @ro_pmtu to avoid subsequent route lookups after packet
|
|
|
|
* filter processing.
|
|
|
|
*
|
|
|
|
* Stores mtu and always-frag value into @mtup and @alwaysfragp.
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
ip6_getpmtu(struct route_in6 *ro_pmtu, int do_lookup,
|
2016-05-19 12:45:20 +00:00
|
|
|
struct ifnet *ifp, const struct in6_addr *dst, u_long *mtup,
|
2016-08-01 17:02:21 +00:00
|
|
|
int *alwaysfragp, u_int fibnum, u_int proto)
|
2003-10-20 15:27:48 +00:00
|
|
|
{
|
2016-01-04 18:32:24 +00:00
|
|
|
struct nhop6_basic nh6;
|
|
|
|
struct in6_addr kdst;
|
|
|
|
uint32_t scopeid;
|
2016-01-03 09:54:03 +00:00
|
|
|
struct sockaddr_in6 *sa6_dst;
|
|
|
|
u_long mtu;
|
2003-10-20 15:27:48 +00:00
|
|
|
|
2016-01-03 09:54:03 +00:00
|
|
|
mtu = 0;
|
|
|
|
if (do_lookup) {
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Here ro_pmtu has final destination address, while
|
|
|
|
* ro might represent immediate destination.
|
|
|
|
* Use ro_pmtu destination since mtu might differ.
|
|
|
|
*/
|
|
|
|
sa6_dst = (struct sockaddr_in6 *)&ro_pmtu->ro_dst;
|
|
|
|
if (!IN6_ARE_ADDR_EQUAL(&sa6_dst->sin6_addr, dst))
|
|
|
|
ro_pmtu->ro_mtu = 0;
|
|
|
|
|
|
|
|
if (ro_pmtu->ro_mtu == 0) {
|
2003-10-20 15:27:48 +00:00
|
|
|
bzero(sa6_dst, sizeof(*sa6_dst));
|
|
|
|
sa6_dst->sin6_family = AF_INET6;
|
|
|
|
sa6_dst->sin6_len = sizeof(struct sockaddr_in6);
|
|
|
|
sa6_dst->sin6_addr = *dst;
|
|
|
|
|
2016-01-04 18:32:24 +00:00
|
|
|
in6_splitscope(dst, &kdst, &scopeid);
|
|
|
|
if (fib6_lookup_nh_basic(fibnum, &kdst, scopeid, 0, 0,
|
|
|
|
&nh6) == 0)
|
|
|
|
ro_pmtu->ro_mtu = nh6.nh_mtu;
|
2003-10-20 15:27:48 +00:00
|
|
|
}
|
2016-01-04 18:32:24 +00:00
|
|
|
|
|
|
|
mtu = ro_pmtu->ro_mtu;
|
2003-10-20 15:27:48 +00:00
|
|
|
}
|
2016-01-03 09:54:03 +00:00
|
|
|
|
|
|
|
if (ro_pmtu->ro_rt)
|
|
|
|
mtu = ro_pmtu->ro_rt->rt_mtu;
|
|
|
|
|
2016-08-01 17:02:21 +00:00
|
|
|
return (ip6_calcmtu(ifp, dst, mtu, mtup, alwaysfragp, proto));
|
2016-01-03 09:54:03 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Calculate MTU based on transmit @ifp, route mtu @rt_mtu and
|
|
|
|
* hostcache data for @dst.
|
|
|
|
* Stores mtu and always-frag value into @mtup and @alwaysfragp.
|
|
|
|
*
|
|
|
|
* Returns 0 on success.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
ip6_calcmtu(struct ifnet *ifp, const struct in6_addr *dst, u_long rt_mtu,
|
2016-08-01 17:02:21 +00:00
|
|
|
u_long *mtup, int *alwaysfragp, u_int proto)
|
2016-01-03 09:54:03 +00:00
|
|
|
{
|
|
|
|
u_long mtu = 0;
|
|
|
|
int alwaysfrag = 0;
|
|
|
|
int error = 0;
|
|
|
|
|
|
|
|
if (rt_mtu > 0) {
|
2003-10-20 15:27:48 +00:00
|
|
|
u_int32_t ifmtu;
|
2003-11-20 20:07:39 +00:00
|
|
|
struct in_conninfo inc;
|
|
|
|
|
|
|
|
bzero(&inc, sizeof(inc));
|
2008-12-17 12:52:34 +00:00
|
|
|
inc.inc_flags |= INC_ISIPV6;
|
2003-11-20 20:07:39 +00:00
|
|
|
inc.inc6_faddr = *dst;
|
2003-10-20 15:27:48 +00:00
|
|
|
|
|
|
|
ifmtu = IN6_LINKMTU(ifp);
|
2016-08-01 17:02:21 +00:00
|
|
|
|
|
|
|
/* TCP is known to react to pmtu changes so skip hc */
|
|
|
|
if (proto != IPPROTO_TCP)
|
|
|
|
mtu = tcp_hc_getmtu(&inc);
|
|
|
|
|
2003-11-20 20:07:39 +00:00
|
|
|
if (mtu)
|
2016-01-03 09:54:03 +00:00
|
|
|
mtu = min(mtu, rt_mtu);
|
2003-11-20 20:07:39 +00:00
|
|
|
else
|
2016-01-03 09:54:03 +00:00
|
|
|
mtu = rt_mtu;
|
2003-10-20 15:27:48 +00:00
|
|
|
if (mtu == 0)
|
|
|
|
mtu = ifmtu;
|
2003-10-24 18:26:30 +00:00
|
|
|
else if (mtu < IPV6_MMTU) {
|
|
|
|
/*
|
|
|
|
* RFC2460 section 5, last paragraph:
|
|
|
|
* if we record ICMPv6 too big message with
|
|
|
|
* mtu < IPV6_MMTU, transmit packets sized IPV6_MMTU
|
|
|
|
* or smaller, with framgent header attached.
|
|
|
|
* (fragment header is needed regardless from the
|
|
|
|
* packet size, for translators to identify packets)
|
|
|
|
*/
|
|
|
|
alwaysfrag = 1;
|
|
|
|
mtu = IPV6_MMTU;
|
2003-10-20 15:27:48 +00:00
|
|
|
}
|
|
|
|
} else if (ifp) {
|
|
|
|
mtu = IN6_LINKMTU(ifp);
|
|
|
|
} else
|
|
|
|
error = EHOSTUNREACH; /* XXX */
|
|
|
|
|
|
|
|
*mtup = mtu;
|
2003-10-24 18:26:30 +00:00
|
|
|
if (alwaysfragp)
|
|
|
|
*alwaysfragp = alwaysfrag;
|
2003-10-20 15:27:48 +00:00
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
|
|
|
* IP6 socket option processing.
|
|
|
|
*/
|
|
|
|
int
|
2007-07-05 16:23:49 +00:00
|
|
|
ip6_ctloutput(struct socket *so, struct sockopt *sopt)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
2008-01-24 08:25:59 +00:00
|
|
|
int optdatalen, uproto;
|
2003-10-24 18:26:30 +00:00
|
|
|
void *optdata;
|
2019-08-02 07:41:36 +00:00
|
|
|
struct inpcb *inp = sotoinpcb(so);
|
1999-11-22 02:45:11 +00:00
|
|
|
int error, optval;
|
|
|
|
int level, op, optname;
|
|
|
|
int optlen;
|
2001-09-12 08:38:13 +00:00
|
|
|
struct thread *td;
|
2014-07-12 05:46:33 +00:00
|
|
|
#ifdef RSS
|
|
|
|
uint32_t rss_bucket;
|
|
|
|
int retval;
|
|
|
|
#endif
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2016-10-17 23:25:31 +00:00
|
|
|
/*
|
|
|
|
* Don't use more than a quarter of mbuf clusters. N.B.:
|
|
|
|
* nmbclusters is an int, but nmbclusters * MCLBYTES may overflow
|
|
|
|
* on LP64 architectures, so cast to u_long to avoid undefined
|
|
|
|
* behavior. ILP32 architectures cannot have nmbclusters
|
|
|
|
* large enough to overflow for other reasons.
|
|
|
|
*/
|
|
|
|
#define IPV6_PKTOPTIONS_MBUF_LIMIT ((u_long)nmbclusters * MCLBYTES / 4)
|
|
|
|
|
2008-07-29 09:31:03 +00:00
|
|
|
level = sopt->sopt_level;
|
|
|
|
op = sopt->sopt_dir;
|
|
|
|
optname = sopt->sopt_name;
|
|
|
|
optlen = sopt->sopt_valsize;
|
|
|
|
td = sopt->sopt_td;
|
|
|
|
error = 0;
|
|
|
|
optval = 0;
|
2003-10-24 18:26:30 +00:00
|
|
|
uproto = (int)so->so_proto->pr_protocol;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2011-11-06 10:47:20 +00:00
|
|
|
if (level != IPPROTO_IPV6) {
|
|
|
|
error = EINVAL;
|
|
|
|
|
|
|
|
if (sopt->sopt_level == SOL_SOCKET &&
|
|
|
|
sopt->sopt_dir == SOPT_SET) {
|
|
|
|
switch (sopt->sopt_name) {
|
|
|
|
case SO_REUSEADDR:
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WLOCK(inp);
|
2013-07-04 18:38:00 +00:00
|
|
|
if ((so->so_options & SO_REUSEADDR) != 0)
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_flags2 |= INP_REUSEADDR;
|
2013-07-04 18:38:00 +00:00
|
|
|
else
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_flags2 &= ~INP_REUSEADDR;
|
|
|
|
INP_WUNLOCK(inp);
|
2011-11-06 10:47:20 +00:00
|
|
|
error = 0;
|
|
|
|
break;
|
|
|
|
case SO_REUSEPORT:
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WLOCK(inp);
|
2011-11-06 10:47:20 +00:00
|
|
|
if ((so->so_options & SO_REUSEPORT) != 0)
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_flags2 |= INP_REUSEPORT;
|
2011-11-06 10:47:20 +00:00
|
|
|
else
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_flags2 &= ~INP_REUSEPORT;
|
|
|
|
INP_WUNLOCK(inp);
|
2011-11-06 10:47:20 +00:00
|
|
|
error = 0;
|
|
|
|
break;
|
2018-06-06 15:45:57 +00:00
|
|
|
case SO_REUSEPORT_LB:
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WLOCK(inp);
|
2018-06-06 15:45:57 +00:00
|
|
|
if ((so->so_options & SO_REUSEPORT_LB) != 0)
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_flags2 |= INP_REUSEPORT_LB;
|
2018-06-06 15:45:57 +00:00
|
|
|
else
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_flags2 &= ~INP_REUSEPORT_LB;
|
|
|
|
INP_WUNLOCK(inp);
|
2018-06-06 15:45:57 +00:00
|
|
|
error = 0;
|
|
|
|
break;
|
2012-02-03 11:00:53 +00:00
|
|
|
case SO_SETFIB:
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WLOCK(inp);
|
|
|
|
inp->inp_inc.inc_fibnum = so->so_fibnum;
|
|
|
|
INP_WUNLOCK(inp);
|
2012-02-03 11:00:53 +00:00
|
|
|
error = 0;
|
|
|
|
break;
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
case SO_MAX_PACING_RATE:
|
|
|
|
#ifdef RATELIMIT
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WLOCK(inp);
|
|
|
|
inp->inp_flags2 |= INP_RATE_LIMIT_CHANGED;
|
|
|
|
INP_WUNLOCK(inp);
|
Implement kernel support for hardware rate limited sockets.
- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.
- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.
- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().
- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.
- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.
- How rate limiting works:
1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.
2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.
3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.
4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network
interface.
Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
2017-01-18 13:31:17 +00:00
|
|
|
error = 0;
|
|
|
|
#else
|
|
|
|
error = EOPNOTSUPP;
|
|
|
|
#endif
|
|
|
|
break;
|
2011-11-06 10:47:20 +00:00
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} else { /* level == IPPROTO_IPV6 */
|
1999-11-22 02:45:11 +00:00
|
|
|
switch (op) {
|
2001-06-11 12:39:29 +00:00
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
case SOPT_SET:
|
|
|
|
switch (optname) {
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_2292PKTOPTIONS:
|
|
|
|
#ifdef IPV6_PKTOPTIONS
|
1999-11-22 02:45:11 +00:00
|
|
|
case IPV6_PKTOPTIONS:
|
2003-10-24 18:26:30 +00:00
|
|
|
#endif
|
2001-06-11 12:39:29 +00:00
|
|
|
{
|
1999-11-22 02:45:11 +00:00
|
|
|
struct mbuf *m;
|
|
|
|
|
2016-10-17 23:25:31 +00:00
|
|
|
if (optlen > IPV6_PKTOPTIONS_MBUF_LIMIT) {
|
|
|
|
printf("ip6_ctloutput: mbuf limit hit\n");
|
|
|
|
error = ENOBUFS;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
error = soopt_getm(sopt, &m); /* XXX */
|
2003-12-23 02:36:43 +00:00
|
|
|
if (error != 0)
|
1999-11-22 02:45:11 +00:00
|
|
|
break;
|
|
|
|
error = soopt_mcopyin(sopt, m); /* XXX */
|
2003-12-23 02:36:43 +00:00
|
|
|
if (error != 0)
|
1999-11-22 02:45:11 +00:00
|
|
|
break;
|
2019-08-02 07:41:36 +00:00
|
|
|
error = ip6_pcbopts(&inp->in6p_outputopts,
|
2001-06-11 12:39:29 +00:00
|
|
|
m, so, sopt);
|
|
|
|
m_freem(m); /* XXX */
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Use of some Hop-by-Hop options or some
|
|
|
|
* Destination options, might require special
|
|
|
|
* privilege. That is, normal applications
|
|
|
|
* (without special privilege) might be forbidden
|
|
|
|
* from setting certain options in outgoing packets,
|
|
|
|
* and might never see certain options in received
|
|
|
|
* packets. [RFC 2292 Section 6]
|
|
|
|
* KAME specific note:
|
|
|
|
* KAME prevents non-privileged users from sending or
|
|
|
|
* receiving ANY hbh/dst options in order to avoid
|
|
|
|
* overhead of parsing options in the kernel.
|
|
|
|
*/
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_RECVHOPOPTS:
|
|
|
|
case IPV6_RECVDSTOPTS:
|
|
|
|
case IPV6_RECVRTHDRDSTOPTS:
|
2008-01-24 08:25:59 +00:00
|
|
|
if (td != NULL) {
|
|
|
|
error = priv_check(td,
|
|
|
|
PRIV_NETINET_SETHDROPTS);
|
|
|
|
if (error)
|
|
|
|
break;
|
2003-10-24 18:26:30 +00:00
|
|
|
}
|
|
|
|
/* FALLTHROUGH */
|
1999-11-22 02:45:11 +00:00
|
|
|
case IPV6_UNICAST_HOPS:
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_HOPLIMIT:
|
2001-06-11 12:39:29 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_RECVPKTINFO:
|
|
|
|
case IPV6_RECVHOPLIMIT:
|
|
|
|
case IPV6_RECVRTHDR:
|
|
|
|
case IPV6_RECVPATHMTU:
|
|
|
|
case IPV6_RECVTCLASS:
|
2015-09-06 20:57:57 +00:00
|
|
|
case IPV6_RECVFLOWID:
|
|
|
|
#ifdef RSS
|
|
|
|
case IPV6_RECVRSSBUCKETID:
|
|
|
|
#endif
|
2001-06-11 12:39:29 +00:00
|
|
|
case IPV6_V6ONLY:
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_AUTOFLOWLABEL:
|
2017-03-06 04:01:58 +00:00
|
|
|
case IPV6_ORIGDSTADDR:
|
2009-06-01 10:30:00 +00:00
|
|
|
case IPV6_BINDANY:
|
2014-07-12 05:46:33 +00:00
|
|
|
case IPV6_BINDMULTI:
|
|
|
|
#ifdef RSS
|
|
|
|
case IPV6_RSS_LISTEN_BUCKET:
|
|
|
|
#endif
|
2009-06-01 10:30:00 +00:00
|
|
|
if (optname == IPV6_BINDANY && td != NULL) {
|
|
|
|
error = priv_check(td,
|
|
|
|
PRIV_NETINET_BINDANY);
|
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2001-06-11 12:39:29 +00:00
|
|
|
if (optlen != sizeof(int)) {
|
1999-11-22 02:45:11 +00:00
|
|
|
error = EINVAL;
|
2001-06-11 12:39:29 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
error = sooptcopyin(sopt, &optval,
|
|
|
|
sizeof optval, sizeof optval);
|
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
switch (optname) {
|
|
|
|
|
|
|
|
case IPV6_UNICAST_HOPS:
|
|
|
|
if (optval < -1 || optval >= 256)
|
|
|
|
error = EINVAL;
|
|
|
|
else {
|
|
|
|
/* -1 = kernel default */
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->in6p_hops = optval;
|
|
|
|
if ((inp->inp_vflag &
|
2001-06-11 12:39:29 +00:00
|
|
|
INP_IPV4) != 0)
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_ip_ttl = optval;
|
2001-06-11 12:39:29 +00:00
|
|
|
}
|
|
|
|
break;
|
1999-11-22 02:45:11 +00:00
|
|
|
#define OPTSET(bit) \
|
2001-06-11 12:39:29 +00:00
|
|
|
do { \
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WLOCK(inp); \
|
1999-11-22 02:45:11 +00:00
|
|
|
if (optval) \
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_flags |= (bit); \
|
1999-11-22 02:45:11 +00:00
|
|
|
else \
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_flags &= ~(bit); \
|
|
|
|
INP_WUNLOCK(inp); \
|
2003-10-08 18:26:08 +00:00
|
|
|
} while (/*CONSTCOND*/ 0)
|
2003-10-24 18:26:30 +00:00
|
|
|
#define OPTSET2292(bit) \
|
|
|
|
do { \
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WLOCK(inp); \
|
|
|
|
inp->inp_flags |= IN6P_RFC2292; \
|
2003-10-24 18:26:30 +00:00
|
|
|
if (optval) \
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_flags |= (bit); \
|
2003-10-24 18:26:30 +00:00
|
|
|
else \
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_flags &= ~(bit); \
|
|
|
|
INP_WUNLOCK(inp); \
|
2003-10-24 18:26:30 +00:00
|
|
|
} while (/*CONSTCOND*/ 0)
|
2019-08-02 07:41:36 +00:00
|
|
|
#define OPTBIT(bit) (inp->inp_flags & (bit) ? 1 : 0)
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2018-03-22 20:21:05 +00:00
|
|
|
#define OPTSET2_N(bit, val) do { \
|
2014-07-12 05:46:33 +00:00
|
|
|
if (val) \
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_flags2 |= bit; \
|
2014-07-12 05:46:33 +00:00
|
|
|
else \
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_flags2 &= ~bit; \
|
2018-03-22 20:21:05 +00:00
|
|
|
} while (0)
|
|
|
|
#define OPTSET2(bit, val) do { \
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WLOCK(inp); \
|
2018-03-22 20:21:05 +00:00
|
|
|
OPTSET2_N(bit, val); \
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WUNLOCK(inp); \
|
2014-07-12 05:46:33 +00:00
|
|
|
} while (0)
|
2019-08-02 07:41:36 +00:00
|
|
|
#define OPTBIT2(bit) (inp->inp_flags2 & (bit) ? 1 : 0)
|
2018-03-22 20:21:05 +00:00
|
|
|
#define OPTSET2292_EXCLUSIVE(bit) \
|
|
|
|
do { \
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WLOCK(inp); \
|
2018-03-22 20:21:05 +00:00
|
|
|
if (OPTBIT(IN6P_RFC2292)) { \
|
|
|
|
error = EINVAL; \
|
|
|
|
} else { \
|
|
|
|
if (optval) \
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_flags |= (bit); \
|
2018-03-22 20:21:05 +00:00
|
|
|
else \
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_flags &= ~(bit); \
|
2018-03-22 20:21:05 +00:00
|
|
|
} \
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WUNLOCK(inp); \
|
2018-03-22 20:21:05 +00:00
|
|
|
} while (/*CONSTCOND*/ 0)
|
2014-07-12 05:46:33 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_RECVPKTINFO:
|
2018-03-22 20:21:05 +00:00
|
|
|
OPTSET2292_EXCLUSIVE(IN6P_PKTINFO);
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
case IPV6_HOPLIMIT:
|
|
|
|
{
|
|
|
|
struct ip6_pktopts **optp;
|
|
|
|
|
|
|
|
/* cannot mix with RFC2292 */
|
|
|
|
if (OPTBIT(IN6P_RFC2292)) {
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WLOCK(inp);
|
|
|
|
if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {
|
|
|
|
INP_WUNLOCK(inp);
|
2018-07-15 00:47:06 +00:00
|
|
|
return (ECONNRESET);
|
|
|
|
}
|
2019-08-02 07:41:36 +00:00
|
|
|
optp = &inp->in6p_outputopts;
|
2003-10-24 18:26:30 +00:00
|
|
|
error = ip6_pcbopt(IPV6_HOPLIMIT,
|
2008-01-24 08:25:59 +00:00
|
|
|
(u_char *)&optval, sizeof(optval),
|
|
|
|
optp, (td != NULL) ? td->td_ucred :
|
|
|
|
NULL, uproto);
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
case IPV6_RECVHOPLIMIT:
|
2018-03-22 20:21:05 +00:00
|
|
|
OPTSET2292_EXCLUSIVE(IN6P_HOPLIMIT);
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
case IPV6_RECVHOPOPTS:
|
2018-03-22 20:21:05 +00:00
|
|
|
OPTSET2292_EXCLUSIVE(IN6P_HOPOPTS);
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
case IPV6_RECVDSTOPTS:
|
2018-03-22 20:21:05 +00:00
|
|
|
OPTSET2292_EXCLUSIVE(IN6P_DSTOPTS);
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
case IPV6_RECVRTHDRDSTOPTS:
|
2018-03-22 20:21:05 +00:00
|
|
|
OPTSET2292_EXCLUSIVE(IN6P_RTHDRDSTOPTS);
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
case IPV6_RECVRTHDR:
|
2018-03-22 20:21:05 +00:00
|
|
|
OPTSET2292_EXCLUSIVE(IN6P_RTHDR);
|
2001-06-11 12:39:29 +00:00
|
|
|
break;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_RECVPATHMTU:
|
|
|
|
/*
|
|
|
|
* We ignore this option for TCP
|
|
|
|
* sockets.
|
2005-07-20 08:59:45 +00:00
|
|
|
* (RFC3542 leaves this case
|
2003-10-24 18:26:30 +00:00
|
|
|
* unspecified.)
|
|
|
|
*/
|
|
|
|
if (uproto != IPPROTO_TCP)
|
|
|
|
OPTSET(IN6P_MTU);
|
|
|
|
break;
|
|
|
|
|
2015-09-06 20:57:57 +00:00
|
|
|
case IPV6_RECVFLOWID:
|
|
|
|
OPTSET2(INP_RECVFLOWID, optval);
|
|
|
|
break;
|
|
|
|
|
|
|
|
#ifdef RSS
|
|
|
|
case IPV6_RECVRSSBUCKETID:
|
|
|
|
OPTSET2(INP_RECVRSSBUCKETID, optval);
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
|
2001-06-11 12:39:29 +00:00
|
|
|
case IPV6_V6ONLY:
|
2001-06-24 20:25:38 +00:00
|
|
|
/*
|
|
|
|
* make setsockopt(IPV6_V6ONLY)
|
|
|
|
* available only prior to bind(2).
|
|
|
|
* see ipng mailing list, Jun 22 2001.
|
|
|
|
*/
|
2019-08-02 07:41:36 +00:00
|
|
|
if (inp->inp_lport ||
|
|
|
|
!IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr)) {
|
2001-06-24 20:25:38 +00:00
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
2001-06-11 12:39:29 +00:00
|
|
|
OPTSET(IN6P_IPV6_V6ONLY);
|
2002-07-24 19:19:53 +00:00
|
|
|
if (optval)
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_vflag &= ~INP_IPV4;
|
2002-07-24 19:19:53 +00:00
|
|
|
else
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_vflag |= INP_IPV4;
|
2001-06-11 12:39:29 +00:00
|
|
|
break;
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_RECVTCLASS:
|
|
|
|
/* cannot mix with RFC2292 XXX */
|
2018-03-22 20:21:05 +00:00
|
|
|
OPTSET2292_EXCLUSIVE(IN6P_TCLASS);
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
case IPV6_AUTOFLOWLABEL:
|
|
|
|
OPTSET(IN6P_AUTOFLOWLABEL);
|
|
|
|
break;
|
|
|
|
|
2017-03-06 04:01:58 +00:00
|
|
|
case IPV6_ORIGDSTADDR:
|
|
|
|
OPTSET2(INP_ORIGDSTADDR, optval);
|
|
|
|
break;
|
2009-06-01 10:30:00 +00:00
|
|
|
case IPV6_BINDANY:
|
|
|
|
OPTSET(INP_BINDANY);
|
|
|
|
break;
|
2014-07-12 05:46:33 +00:00
|
|
|
|
|
|
|
case IPV6_BINDMULTI:
|
|
|
|
OPTSET2(INP_BINDMULTI, optval);
|
|
|
|
break;
|
|
|
|
#ifdef RSS
|
|
|
|
case IPV6_RSS_LISTEN_BUCKET:
|
|
|
|
if ((optval >= 0) &&
|
|
|
|
(optval < rss_getnumbuckets())) {
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WLOCK(inp);
|
|
|
|
inp->inp_rss_listen_bucket = optval;
|
2018-03-22 20:21:05 +00:00
|
|
|
OPTSET2_N(INP_RSS_BUCKET_SET, 1);
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2014-07-12 05:46:33 +00:00
|
|
|
} else {
|
|
|
|
error = EINVAL;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
#endif
|
2001-06-11 12:39:29 +00:00
|
|
|
}
|
|
|
|
break;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_TCLASS:
|
|
|
|
case IPV6_DONTFRAG:
|
|
|
|
case IPV6_USE_MIN_MTU:
|
|
|
|
case IPV6_PREFER_TEMPADDR:
|
|
|
|
if (optlen != sizeof(optval)) {
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
error = sooptcopyin(sopt, &optval,
|
|
|
|
sizeof optval, sizeof optval);
|
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
{
|
|
|
|
struct ip6_pktopts **optp;
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WLOCK(inp);
|
|
|
|
if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {
|
|
|
|
INP_WUNLOCK(inp);
|
2018-07-15 00:47:06 +00:00
|
|
|
return (ECONNRESET);
|
|
|
|
}
|
2019-08-02 07:41:36 +00:00
|
|
|
optp = &inp->in6p_outputopts;
|
2003-10-24 18:26:30 +00:00
|
|
|
error = ip6_pcbopt(optname,
|
2008-01-24 08:25:59 +00:00
|
|
|
(u_char *)&optval, sizeof(optval),
|
|
|
|
optp, (td != NULL) ? td->td_ucred :
|
|
|
|
NULL, uproto);
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
case IPV6_2292PKTINFO:
|
|
|
|
case IPV6_2292HOPLIMIT:
|
|
|
|
case IPV6_2292HOPOPTS:
|
|
|
|
case IPV6_2292DSTOPTS:
|
|
|
|
case IPV6_2292RTHDR:
|
2001-06-11 12:39:29 +00:00
|
|
|
/* RFC 2292 */
|
|
|
|
if (optlen != sizeof(int)) {
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
error = sooptcopyin(sopt, &optval,
|
|
|
|
sizeof optval, sizeof optval);
|
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
switch (optname) {
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_2292PKTINFO:
|
|
|
|
OPTSET2292(IN6P_PKTINFO);
|
2001-06-11 12:39:29 +00:00
|
|
|
break;
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_2292HOPLIMIT:
|
|
|
|
OPTSET2292(IN6P_HOPLIMIT);
|
2001-06-11 12:39:29 +00:00
|
|
|
break;
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_2292HOPOPTS:
|
2001-06-11 12:39:29 +00:00
|
|
|
/*
|
|
|
|
* Check super-user privilege.
|
|
|
|
* See comments for IPV6_RECVHOPOPTS.
|
|
|
|
*/
|
2008-01-24 08:25:59 +00:00
|
|
|
if (td != NULL) {
|
|
|
|
error = priv_check(td,
|
|
|
|
PRIV_NETINET_SETHDROPTS);
|
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
}
|
2003-10-24 18:26:30 +00:00
|
|
|
OPTSET2292(IN6P_HOPOPTS);
|
2001-06-11 12:39:29 +00:00
|
|
|
break;
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_2292DSTOPTS:
|
2008-01-24 08:25:59 +00:00
|
|
|
if (td != NULL) {
|
|
|
|
error = priv_check(td,
|
|
|
|
PRIV_NETINET_SETHDROPTS);
|
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
}
|
2003-10-24 18:26:30 +00:00
|
|
|
OPTSET2292(IN6P_DSTOPTS|IN6P_RTHDRDSTOPTS); /* XXX */
|
2001-06-11 12:39:29 +00:00
|
|
|
break;
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_2292RTHDR:
|
|
|
|
OPTSET2292(IN6P_RTHDR);
|
2001-06-11 12:39:29 +00:00
|
|
|
break;
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
break;
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_PKTINFO:
|
|
|
|
case IPV6_HOPOPTS:
|
|
|
|
case IPV6_RTHDR:
|
|
|
|
case IPV6_DSTOPTS:
|
|
|
|
case IPV6_RTHDRDSTOPTS:
|
|
|
|
case IPV6_NEXTHOP:
|
|
|
|
{
|
2005-07-20 08:59:45 +00:00
|
|
|
/* new advanced API (RFC3542) */
|
2003-10-24 18:26:30 +00:00
|
|
|
u_char *optbuf;
|
2005-07-28 18:07:07 +00:00
|
|
|
u_char optbuf_storage[MCLBYTES];
|
2003-10-24 18:26:30 +00:00
|
|
|
int optlen;
|
|
|
|
struct ip6_pktopts **optp;
|
|
|
|
|
2018-03-23 18:34:38 +00:00
|
|
|
/* cannot mix with RFC2292 */
|
|
|
|
if (OPTBIT(IN6P_RFC2292)) {
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2005-07-28 18:07:07 +00:00
|
|
|
/*
|
|
|
|
* We only ensure valsize is not too large
|
|
|
|
* here. Further validation will be done
|
|
|
|
* later.
|
|
|
|
*/
|
|
|
|
error = sooptcopyin(sopt, optbuf_storage,
|
|
|
|
sizeof(optbuf_storage), 0);
|
2004-03-26 19:52:18 +00:00
|
|
|
if (error)
|
|
|
|
break;
|
2003-10-24 18:26:30 +00:00
|
|
|
optlen = sopt->sopt_valsize;
|
2005-07-28 18:07:07 +00:00
|
|
|
optbuf = optbuf_storage;
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WLOCK(inp);
|
|
|
|
if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {
|
|
|
|
INP_WUNLOCK(inp);
|
2018-07-15 00:47:06 +00:00
|
|
|
return (ECONNRESET);
|
|
|
|
}
|
2019-08-02 07:41:36 +00:00
|
|
|
optp = &inp->in6p_outputopts;
|
2008-01-24 08:25:59 +00:00
|
|
|
error = ip6_pcbopt(optname, optbuf, optlen,
|
|
|
|
optp, (td != NULL) ? td->td_ucred : NULL,
|
|
|
|
uproto);
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
#undef OPTSET
|
|
|
|
|
|
|
|
case IPV6_MULTICAST_IF:
|
|
|
|
case IPV6_MULTICAST_HOPS:
|
|
|
|
case IPV6_MULTICAST_LOOP:
|
|
|
|
case IPV6_JOIN_GROUP:
|
|
|
|
case IPV6_LEAVE_GROUP:
|
Bite the bullet, and make the IPv6 SSM and MLDv2 mega-commit:
import from p4 bms_netdev. Summary of changes:
* Connect netinet6/in6_mcast.c to build.
The legacy KAME KPIs are mostly preserved.
* Eliminate now dead code from ip6_output.c.
Don't do mbuf bingo, we are not going to do RFC 2292 style
CMSG tricks for multicast options as they are not required
by any current IPv6 normative reference.
* Refactor transports (UDP, raw_ip6) to do own mcast filtering.
SCTP, TCP unaffected by this change.
* Add ip6_msource, in6_msource structs to in6_var.h.
* Hookup mld_ifinfo state to in6_ifextra, allocate from
domifattach path.
* Eliminate IN6_LOOKUP_MULTI(), it is no longer referenced.
Kernel consumers which need this should use in6m_lookup().
* Refactor IPv6 socket group memberships to use a vector (like IPv4).
* Update ifmcstat(8) for IPv6 SSM.
* Add witness lock order for IN6_MULTI_LOCK.
* Move IN6_MULTI_LOCK out of lower ip6_output()/ip6_input() paths.
* Introduce IP6STAT_ADD/SUB/INC/DEC as per rwatson's IPv4 cleanup.
* Update carp(4) for new IPv6 SSM KPIs.
* Virtualize ip6_mrouter socket.
Changes mostly localized to IPv6 MROUTING.
* Don't do a local group lookup in MROUTING.
* Kill unused KAME prototypes in6_purgemkludge(), in6_restoremkludge().
* Preserve KAME DAD timer jitter behaviour in MLDv1 compatibility mode.
* Bump __FreeBSD_version to 800084.
* Update UPDATING.
NOTE WELL:
* This code hasn't been tested against real MLDv2 queriers
(yet), although the on-wire protocol has been verified in Wireshark.
* There are a few unresolved issues in the socket layer APIs to
do with scope ID propagation.
* There is a LOR present in ip6_output()'s use of
in6_setscope() which needs to be resolved. See comments in mld6.c.
This is believed to be benign and can't be avoided for the moment
without re-introducing an indirect netisr.
This work was mostly derived from the IGMPv3 implementation, and
has been sponsored by a third party.
2009-04-29 19:19:13 +00:00
|
|
|
case IPV6_MSFILTER:
|
|
|
|
case MCAST_BLOCK_SOURCE:
|
|
|
|
case MCAST_UNBLOCK_SOURCE:
|
|
|
|
case MCAST_JOIN_GROUP:
|
|
|
|
case MCAST_LEAVE_GROUP:
|
|
|
|
case MCAST_JOIN_SOURCE_GROUP:
|
|
|
|
case MCAST_LEAVE_SOURCE_GROUP:
|
2019-08-02 07:41:36 +00:00
|
|
|
error = ip6_setmoptions(inp, sopt);
|
1999-11-22 02:45:11 +00:00
|
|
|
break;
|
|
|
|
|
2000-07-04 16:35:15 +00:00
|
|
|
case IPV6_PORTRANGE:
|
|
|
|
error = sooptcopyin(sopt, &optval,
|
|
|
|
sizeof optval, sizeof optval);
|
|
|
|
if (error)
|
|
|
|
break;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WLOCK(inp);
|
2000-07-04 16:35:15 +00:00
|
|
|
switch (optval) {
|
|
|
|
case IPV6_PORTRANGE_DEFAULT:
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_flags &= ~(INP_LOWPORT);
|
|
|
|
inp->inp_flags &= ~(INP_HIGHPORT);
|
2000-07-04 16:35:15 +00:00
|
|
|
break;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2000-07-04 16:35:15 +00:00
|
|
|
case IPV6_PORTRANGE_HIGH:
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_flags &= ~(INP_LOWPORT);
|
|
|
|
inp->inp_flags |= INP_HIGHPORT;
|
2000-07-04 16:35:15 +00:00
|
|
|
break;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2000-07-04 16:35:15 +00:00
|
|
|
case IPV6_PORTRANGE_LOW:
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->inp_flags &= ~(INP_HIGHPORT);
|
|
|
|
inp->inp_flags |= INP_LOWPORT;
|
2000-07-04 16:35:15 +00:00
|
|
|
break;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2000-07-04 16:35:15 +00:00
|
|
|
default:
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_WUNLOCK(inp);
|
1999-11-22 02:45:11 +00:00
|
|
|
break;
|
|
|
|
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT)
|
1999-11-22 02:45:11 +00:00
|
|
|
case IPV6_IPSEC_POLICY:
|
2017-02-06 08:49:57 +00:00
|
|
|
if (IPSEC_ENABLED(ipv6)) {
|
2019-08-02 07:41:36 +00:00
|
|
|
error = IPSEC_PCBCTL(ipv6, inp, sopt);
|
1999-11-22 02:45:11 +00:00
|
|
|
break;
|
2017-02-06 08:49:57 +00:00
|
|
|
}
|
|
|
|
/* FALLTHROUGH */
|
2007-07-03 12:13:45 +00:00
|
|
|
#endif /* IPSEC */
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
default:
|
|
|
|
error = ENOPROTOOPT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
|
|
|
case SOPT_GET:
|
|
|
|
switch (optname) {
|
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_2292PKTOPTIONS:
|
|
|
|
#ifdef IPV6_PKTOPTIONS
|
1999-11-22 02:45:11 +00:00
|
|
|
case IPV6_PKTOPTIONS:
|
2003-10-24 18:26:30 +00:00
|
|
|
#endif
|
|
|
|
/*
|
|
|
|
* RFC3542 (effectively) deprecated the
|
|
|
|
* semantics of the 2292-style pktoptions.
|
|
|
|
* Since it was not reliable in nature (i.e.,
|
|
|
|
* applications had to expect the lack of some
|
|
|
|
* information after all), it would make sense
|
|
|
|
* to simplify this part by always returning
|
|
|
|
* empty data.
|
|
|
|
*/
|
|
|
|
sopt->sopt_valsize = 0;
|
1999-11-22 02:45:11 +00:00
|
|
|
break;
|
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_RECVHOPOPTS:
|
|
|
|
case IPV6_RECVDSTOPTS:
|
|
|
|
case IPV6_RECVRTHDRDSTOPTS:
|
1999-11-22 02:45:11 +00:00
|
|
|
case IPV6_UNICAST_HOPS:
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_RECVPKTINFO:
|
|
|
|
case IPV6_RECVHOPLIMIT:
|
|
|
|
case IPV6_RECVRTHDR:
|
|
|
|
case IPV6_RECVPATHMTU:
|
2001-06-11 12:39:29 +00:00
|
|
|
|
|
|
|
case IPV6_V6ONLY:
|
2000-01-13 05:07:42 +00:00
|
|
|
case IPV6_PORTRANGE:
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_RECVTCLASS:
|
|
|
|
case IPV6_AUTOFLOWLABEL:
|
2010-09-24 14:38:54 +00:00
|
|
|
case IPV6_BINDANY:
|
2014-07-12 05:46:33 +00:00
|
|
|
case IPV6_FLOWID:
|
|
|
|
case IPV6_FLOWTYPE:
|
2015-09-06 20:57:57 +00:00
|
|
|
case IPV6_RECVFLOWID:
|
2014-07-12 05:46:33 +00:00
|
|
|
#ifdef RSS
|
|
|
|
case IPV6_RSSBUCKETID:
|
2015-09-06 20:57:57 +00:00
|
|
|
case IPV6_RECVRSSBUCKETID:
|
2014-07-12 05:46:33 +00:00
|
|
|
#endif
|
2015-12-30 18:08:05 +00:00
|
|
|
case IPV6_BINDMULTI:
|
1999-11-22 02:45:11 +00:00
|
|
|
switch (optname) {
|
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_RECVHOPOPTS:
|
|
|
|
optval = OPTBIT(IN6P_HOPOPTS);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IPV6_RECVDSTOPTS:
|
|
|
|
optval = OPTBIT(IN6P_DSTOPTS);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IPV6_RECVRTHDRDSTOPTS:
|
|
|
|
optval = OPTBIT(IN6P_RTHDRDSTOPTS);
|
|
|
|
break;
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
case IPV6_UNICAST_HOPS:
|
2019-08-02 07:41:36 +00:00
|
|
|
optval = inp->in6p_hops;
|
1999-11-22 02:45:11 +00:00
|
|
|
break;
|
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_RECVPKTINFO:
|
|
|
|
optval = OPTBIT(IN6P_PKTINFO);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IPV6_RECVHOPLIMIT:
|
|
|
|
optval = OPTBIT(IN6P_HOPLIMIT);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IPV6_RECVRTHDR:
|
|
|
|
optval = OPTBIT(IN6P_RTHDR);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IPV6_RECVPATHMTU:
|
|
|
|
optval = OPTBIT(IN6P_MTU);
|
1999-11-22 02:45:11 +00:00
|
|
|
break;
|
|
|
|
|
2001-06-11 12:39:29 +00:00
|
|
|
case IPV6_V6ONLY:
|
2002-07-22 15:51:02 +00:00
|
|
|
optval = OPTBIT(IN6P_IPV6_V6ONLY);
|
1999-11-22 02:45:11 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
case IPV6_PORTRANGE:
|
|
|
|
{
|
|
|
|
int flags;
|
2019-08-02 07:41:36 +00:00
|
|
|
flags = inp->inp_flags;
|
2008-12-17 13:00:18 +00:00
|
|
|
if (flags & INP_HIGHPORT)
|
1999-11-22 02:45:11 +00:00
|
|
|
optval = IPV6_PORTRANGE_HIGH;
|
2008-12-17 13:00:18 +00:00
|
|
|
else if (flags & INP_LOWPORT)
|
1999-11-22 02:45:11 +00:00
|
|
|
optval = IPV6_PORTRANGE_LOW;
|
|
|
|
else
|
|
|
|
optval = 0;
|
|
|
|
break;
|
|
|
|
}
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_RECVTCLASS:
|
|
|
|
optval = OPTBIT(IN6P_TCLASS);
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IPV6_AUTOFLOWLABEL:
|
|
|
|
optval = OPTBIT(IN6P_AUTOFLOWLABEL);
|
|
|
|
break;
|
2009-06-01 10:30:00 +00:00
|
|
|
|
2017-03-06 04:01:58 +00:00
|
|
|
case IPV6_ORIGDSTADDR:
|
|
|
|
optval = OPTBIT2(INP_ORIGDSTADDR);
|
|
|
|
break;
|
|
|
|
|
2009-06-01 10:30:00 +00:00
|
|
|
case IPV6_BINDANY:
|
|
|
|
optval = OPTBIT(INP_BINDANY);
|
|
|
|
break;
|
2014-07-12 05:46:33 +00:00
|
|
|
|
|
|
|
case IPV6_FLOWID:
|
2019-08-02 07:41:36 +00:00
|
|
|
optval = inp->inp_flowid;
|
2014-07-12 05:46:33 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
case IPV6_FLOWTYPE:
|
2019-08-02 07:41:36 +00:00
|
|
|
optval = inp->inp_flowtype;
|
2014-07-12 05:46:33 +00:00
|
|
|
break;
|
2015-09-06 20:57:57 +00:00
|
|
|
|
|
|
|
case IPV6_RECVFLOWID:
|
|
|
|
optval = OPTBIT2(INP_RECVFLOWID);
|
|
|
|
break;
|
2014-07-12 05:46:33 +00:00
|
|
|
#ifdef RSS
|
|
|
|
case IPV6_RSSBUCKETID:
|
|
|
|
retval =
|
2019-08-02 07:41:36 +00:00
|
|
|
rss_hash2bucket(inp->inp_flowid,
|
|
|
|
inp->inp_flowtype,
|
2014-07-12 05:46:33 +00:00
|
|
|
&rss_bucket);
|
|
|
|
if (retval == 0)
|
|
|
|
optval = rss_bucket;
|
|
|
|
else
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
2015-09-06 20:57:57 +00:00
|
|
|
|
|
|
|
case IPV6_RECVRSSBUCKETID:
|
|
|
|
optval = OPTBIT2(INP_RECVRSSBUCKETID);
|
|
|
|
break;
|
2014-07-12 05:46:33 +00:00
|
|
|
#endif
|
|
|
|
|
|
|
|
case IPV6_BINDMULTI:
|
|
|
|
optval = OPTBIT2(INP_BINDMULTI);
|
|
|
|
break;
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
2003-10-24 18:26:30 +00:00
|
|
|
if (error)
|
|
|
|
break;
|
1999-11-22 02:45:11 +00:00
|
|
|
error = sooptcopyout(sopt, &optval,
|
|
|
|
sizeof optval);
|
|
|
|
break;
|
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_PATHMTU:
|
|
|
|
{
|
|
|
|
u_long pmtu = 0;
|
|
|
|
struct ip6_mtuinfo mtuinfo;
|
2018-03-22 21:18:34 +00:00
|
|
|
struct in6_addr addr;
|
2003-10-24 18:26:30 +00:00
|
|
|
|
|
|
|
if (!(so->so_state & SS_ISCONNECTED))
|
|
|
|
return (ENOTCONN);
|
|
|
|
/*
|
|
|
|
* XXX: we dot not consider the case of source
|
|
|
|
* routing, or optional information to specify
|
|
|
|
* the outgoing interface.
|
2019-08-02 07:41:36 +00:00
|
|
|
* Copy faddr out of inp to avoid holding lock
|
2018-03-22 21:18:34 +00:00
|
|
|
* on inp during route lookup.
|
2003-10-24 18:26:30 +00:00
|
|
|
*/
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_RLOCK(inp);
|
|
|
|
bcopy(&inp->in6p_faddr, &addr, sizeof(addr));
|
|
|
|
INP_RUNLOCK(inp);
|
2016-01-03 09:54:03 +00:00
|
|
|
error = ip6_getpmtu_ctl(so->so_fibnum,
|
2018-03-22 21:18:34 +00:00
|
|
|
&addr, &pmtu);
|
2003-10-24 18:26:30 +00:00
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
if (pmtu > IPV6_MAXPACKET)
|
|
|
|
pmtu = IPV6_MAXPACKET;
|
|
|
|
|
|
|
|
bzero(&mtuinfo, sizeof(mtuinfo));
|
|
|
|
mtuinfo.ip6m_mtu = (u_int32_t)pmtu;
|
|
|
|
optdata = (void *)&mtuinfo;
|
|
|
|
optdatalen = sizeof(mtuinfo);
|
|
|
|
error = sooptcopyout(sopt, optdata,
|
|
|
|
optdatalen);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
case IPV6_2292PKTINFO:
|
|
|
|
case IPV6_2292HOPLIMIT:
|
|
|
|
case IPV6_2292HOPOPTS:
|
|
|
|
case IPV6_2292RTHDR:
|
|
|
|
case IPV6_2292DSTOPTS:
|
2001-06-11 12:39:29 +00:00
|
|
|
switch (optname) {
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_2292PKTINFO:
|
2001-06-11 12:39:29 +00:00
|
|
|
optval = OPTBIT(IN6P_PKTINFO);
|
|
|
|
break;
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_2292HOPLIMIT:
|
2001-06-11 12:39:29 +00:00
|
|
|
optval = OPTBIT(IN6P_HOPLIMIT);
|
|
|
|
break;
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_2292HOPOPTS:
|
2001-06-11 12:39:29 +00:00
|
|
|
optval = OPTBIT(IN6P_HOPOPTS);
|
|
|
|
break;
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_2292RTHDR:
|
2001-06-11 12:39:29 +00:00
|
|
|
optval = OPTBIT(IN6P_RTHDR);
|
|
|
|
break;
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_2292DSTOPTS:
|
2001-06-11 12:39:29 +00:00
|
|
|
optval = OPTBIT(IN6P_DSTOPTS|IN6P_RTHDRDSTOPTS);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
error = sooptcopyout(sopt, &optval,
|
2003-10-24 18:26:30 +00:00
|
|
|
sizeof optval);
|
|
|
|
break;
|
|
|
|
case IPV6_PKTINFO:
|
|
|
|
case IPV6_HOPOPTS:
|
|
|
|
case IPV6_RTHDR:
|
|
|
|
case IPV6_DSTOPTS:
|
|
|
|
case IPV6_RTHDRDSTOPTS:
|
|
|
|
case IPV6_NEXTHOP:
|
|
|
|
case IPV6_TCLASS:
|
|
|
|
case IPV6_DONTFRAG:
|
|
|
|
case IPV6_USE_MIN_MTU:
|
|
|
|
case IPV6_PREFER_TEMPADDR:
|
2019-08-02 07:41:36 +00:00
|
|
|
error = ip6_getpcbopt(inp, optname, sopt);
|
2001-06-11 12:39:29 +00:00
|
|
|
break;
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
case IPV6_MULTICAST_IF:
|
|
|
|
case IPV6_MULTICAST_HOPS:
|
|
|
|
case IPV6_MULTICAST_LOOP:
|
Bite the bullet, and make the IPv6 SSM and MLDv2 mega-commit:
import from p4 bms_netdev. Summary of changes:
* Connect netinet6/in6_mcast.c to build.
The legacy KAME KPIs are mostly preserved.
* Eliminate now dead code from ip6_output.c.
Don't do mbuf bingo, we are not going to do RFC 2292 style
CMSG tricks for multicast options as they are not required
by any current IPv6 normative reference.
* Refactor transports (UDP, raw_ip6) to do own mcast filtering.
SCTP, TCP unaffected by this change.
* Add ip6_msource, in6_msource structs to in6_var.h.
* Hookup mld_ifinfo state to in6_ifextra, allocate from
domifattach path.
* Eliminate IN6_LOOKUP_MULTI(), it is no longer referenced.
Kernel consumers which need this should use in6m_lookup().
* Refactor IPv6 socket group memberships to use a vector (like IPv4).
* Update ifmcstat(8) for IPv6 SSM.
* Add witness lock order for IN6_MULTI_LOCK.
* Move IN6_MULTI_LOCK out of lower ip6_output()/ip6_input() paths.
* Introduce IP6STAT_ADD/SUB/INC/DEC as per rwatson's IPv4 cleanup.
* Update carp(4) for new IPv6 SSM KPIs.
* Virtualize ip6_mrouter socket.
Changes mostly localized to IPv6 MROUTING.
* Don't do a local group lookup in MROUTING.
* Kill unused KAME prototypes in6_purgemkludge(), in6_restoremkludge().
* Preserve KAME DAD timer jitter behaviour in MLDv1 compatibility mode.
* Bump __FreeBSD_version to 800084.
* Update UPDATING.
NOTE WELL:
* This code hasn't been tested against real MLDv2 queriers
(yet), although the on-wire protocol has been verified in Wireshark.
* There are a few unresolved issues in the socket layer APIs to
do with scope ID propagation.
* There is a LOR present in ip6_output()'s use of
in6_setscope() which needs to be resolved. See comments in mld6.c.
This is believed to be benign and can't be avoided for the moment
without re-introducing an indirect netisr.
This work was mostly derived from the IGMPv3 implementation, and
has been sponsored by a third party.
2009-04-29 19:19:13 +00:00
|
|
|
case IPV6_MSFILTER:
|
2019-08-02 07:41:36 +00:00
|
|
|
error = ip6_getmoptions(inp, sopt);
|
1999-11-22 02:45:11 +00:00
|
|
|
break;
|
|
|
|
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT)
|
1999-11-22 02:45:11 +00:00
|
|
|
case IPV6_IPSEC_POLICY:
|
2017-02-06 08:49:57 +00:00
|
|
|
if (IPSEC_ENABLED(ipv6)) {
|
2019-08-02 07:41:36 +00:00
|
|
|
error = IPSEC_PCBCTL(ipv6, inp, sopt);
|
2000-07-04 16:35:15 +00:00
|
|
|
break;
|
|
|
|
}
|
2017-02-06 08:49:57 +00:00
|
|
|
/* FALLTHROUGH */
|
2007-07-03 12:13:45 +00:00
|
|
|
#endif /* IPSEC */
|
1999-11-22 02:45:11 +00:00
|
|
|
default:
|
|
|
|
error = ENOPROTOOPT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2003-10-06 14:02:09 +00:00
|
|
|
return (error);
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
|
2003-10-26 18:17:01 +00:00
|
|
|
int
|
2007-07-05 16:23:49 +00:00
|
|
|
ip6_raw_ctloutput(struct socket *so, struct sockopt *sopt)
|
2003-10-26 18:17:01 +00:00
|
|
|
{
|
|
|
|
int error = 0, optval, optlen;
|
|
|
|
const int icmp6off = offsetof(struct icmp6_hdr, icmp6_cksum);
|
2019-08-02 07:41:36 +00:00
|
|
|
struct inpcb *inp = sotoinpcb(so);
|
2003-10-26 18:17:01 +00:00
|
|
|
int level, op, optname;
|
|
|
|
|
2008-07-29 09:31:03 +00:00
|
|
|
level = sopt->sopt_level;
|
|
|
|
op = sopt->sopt_dir;
|
|
|
|
optname = sopt->sopt_name;
|
|
|
|
optlen = sopt->sopt_valsize;
|
2003-10-26 18:17:01 +00:00
|
|
|
|
|
|
|
if (level != IPPROTO_IPV6) {
|
|
|
|
return (EINVAL);
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (optname) {
|
|
|
|
case IPV6_CHECKSUM:
|
|
|
|
/*
|
|
|
|
* For ICMPv6 sockets, no modification allowed for checksum
|
|
|
|
* offset, permit "no change" values to help existing apps.
|
|
|
|
*
|
2005-07-20 08:59:45 +00:00
|
|
|
* RFC3542 says: "An attempt to set IPV6_CHECKSUM
|
2003-10-26 18:17:01 +00:00
|
|
|
* for an ICMPv6 socket will fail."
|
2005-07-20 08:59:45 +00:00
|
|
|
* The current behavior does not meet RFC3542.
|
2003-10-26 18:17:01 +00:00
|
|
|
*/
|
|
|
|
switch (op) {
|
|
|
|
case SOPT_SET:
|
|
|
|
if (optlen != sizeof(int)) {
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
error = sooptcopyin(sopt, &optval, sizeof(optval),
|
|
|
|
sizeof(optval));
|
|
|
|
if (error)
|
|
|
|
break;
|
2019-04-19 17:17:41 +00:00
|
|
|
if (optval < -1 || (optval % 2) != 0) {
|
|
|
|
/*
|
|
|
|
* The API assumes non-negative even offset
|
|
|
|
* values or -1 as a special value.
|
|
|
|
*/
|
2003-10-26 18:17:01 +00:00
|
|
|
error = EINVAL;
|
|
|
|
} else if (so->so_proto->pr_protocol ==
|
|
|
|
IPPROTO_ICMPV6) {
|
|
|
|
if (optval != icmp6off)
|
|
|
|
error = EINVAL;
|
|
|
|
} else
|
2019-08-02 07:41:36 +00:00
|
|
|
inp->in6p_cksum = optval;
|
2003-10-26 18:17:01 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
case SOPT_GET:
|
|
|
|
if (so->so_proto->pr_protocol == IPPROTO_ICMPV6)
|
|
|
|
optval = icmp6off;
|
|
|
|
else
|
2019-08-02 07:41:36 +00:00
|
|
|
optval = inp->in6p_cksum;
|
2003-10-26 18:17:01 +00:00
|
|
|
|
|
|
|
error = sooptcopyout(sopt, &optval, sizeof(optval));
|
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
error = ENOPROTOOPT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
2001-06-11 12:39:29 +00:00
|
|
|
* Set up IP6 options in pcb for insertion in output packets or
|
|
|
|
* specifying behavior of outgoing packets.
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
|
|
|
static int
|
2007-07-05 16:23:49 +00:00
|
|
|
ip6_pcbopts(struct ip6_pktopts **pktopt, struct mbuf *m,
|
|
|
|
struct socket *so, struct sockopt *sopt)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
2001-06-11 12:39:29 +00:00
|
|
|
struct ip6_pktopts *opt = *pktopt;
|
1999-11-22 02:45:11 +00:00
|
|
|
int error = 0;
|
2001-09-12 08:38:13 +00:00
|
|
|
struct thread *td = sopt->sopt_td;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
/* turn off any old options. */
|
|
|
|
if (opt) {
|
2001-06-11 12:39:29 +00:00
|
|
|
#ifdef DIAGNOSTIC
|
|
|
|
if (opt->ip6po_pktinfo || opt->ip6po_nexthop ||
|
|
|
|
opt->ip6po_hbh || opt->ip6po_dest1 || opt->ip6po_dest2 ||
|
|
|
|
opt->ip6po_rhinfo.ip6po_rhi_rthdr)
|
|
|
|
printf("ip6_pcbopts: all specified options are cleared.\n");
|
|
|
|
#endif
|
2003-10-24 18:26:30 +00:00
|
|
|
ip6_clearpktopts(opt, -1);
|
1999-11-22 02:45:11 +00:00
|
|
|
} else
|
2003-02-19 05:47:46 +00:00
|
|
|
opt = malloc(sizeof(*opt), M_IP6OPT, M_WAITOK);
|
2001-06-11 12:39:29 +00:00
|
|
|
*pktopt = NULL;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
if (!m || m->m_len == 0) {
|
|
|
|
/*
|
2002-10-31 19:45:48 +00:00
|
|
|
* Only turning off any previous options, regardless of
|
|
|
|
* whether the opt is just created or given.
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
2002-10-31 19:45:48 +00:00
|
|
|
free(opt, M_IP6OPT);
|
2003-10-06 14:02:09 +00:00
|
|
|
return (0);
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* set options specified by user. */
|
2008-01-24 08:25:59 +00:00
|
|
|
if ((error = ip6_setpktopts(m, opt, NULL, (td != NULL) ?
|
|
|
|
td->td_ucred : NULL, so->so_proto->pr_protocol)) != 0) {
|
2003-10-24 18:26:30 +00:00
|
|
|
ip6_clearpktopts(opt, -1); /* XXX: discard all options */
|
2002-10-31 19:45:48 +00:00
|
|
|
free(opt, M_IP6OPT);
|
2003-10-06 14:02:09 +00:00
|
|
|
return (error);
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
*pktopt = opt;
|
2003-10-06 14:02:09 +00:00
|
|
|
return (0);
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
|
2001-06-11 12:39:29 +00:00
|
|
|
/*
|
|
|
|
* initialize ip6_pktopts. beware that there are non-zero default values in
|
|
|
|
* the struct.
|
|
|
|
*/
|
|
|
|
void
|
2007-07-05 16:23:49 +00:00
|
|
|
ip6_initpktopts(struct ip6_pktopts *opt)
|
2001-06-11 12:39:29 +00:00
|
|
|
{
|
|
|
|
|
|
|
|
bzero(opt, sizeof(*opt));
|
|
|
|
opt->ip6po_hlim = -1; /* -1 means default hop limit */
|
2003-10-24 18:26:30 +00:00
|
|
|
opt->ip6po_tclass = -1; /* -1 means default traffic class */
|
|
|
|
opt->ip6po_minmtu = IP6PO_MINMTU_MCASTONLY;
|
|
|
|
opt->ip6po_prefer_tempaddr = IP6PO_TEMPADDR_SYSTEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
2007-07-05 16:23:49 +00:00
|
|
|
ip6_pcbopt(int optname, u_char *buf, int len, struct ip6_pktopts **pktopt,
|
2008-01-24 08:25:59 +00:00
|
|
|
struct ucred *cred, int uproto)
|
2003-10-24 18:26:30 +00:00
|
|
|
{
|
|
|
|
struct ip6_pktopts *opt;
|
|
|
|
|
|
|
|
if (*pktopt == NULL) {
|
|
|
|
*pktopt = malloc(sizeof(struct ip6_pktopts), M_IP6OPT,
|
2018-07-15 00:47:06 +00:00
|
|
|
M_NOWAIT);
|
|
|
|
if (*pktopt == NULL)
|
|
|
|
return (ENOBUFS);
|
2005-07-21 15:06:32 +00:00
|
|
|
ip6_initpktopts(*pktopt);
|
2003-10-24 18:26:30 +00:00
|
|
|
}
|
|
|
|
opt = *pktopt;
|
|
|
|
|
2008-01-24 08:25:59 +00:00
|
|
|
return (ip6_setpktopt(optname, buf, len, opt, cred, 1, 0, uproto));
|
2003-10-24 18:26:30 +00:00
|
|
|
}
|
|
|
|
|
2018-03-22 23:34:48 +00:00
|
|
|
#define GET_PKTOPT_VAR(field, lenexpr) do { \
|
|
|
|
if (pktopt && pktopt->field) { \
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_RUNLOCK(inp); \
|
2018-03-22 23:34:48 +00:00
|
|
|
optdata = malloc(sopt->sopt_valsize, M_TEMP, M_WAITOK); \
|
|
|
|
malloc_optdata = true; \
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_RLOCK(inp); \
|
|
|
|
if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) { \
|
|
|
|
INP_RUNLOCK(inp); \
|
2018-03-22 23:34:48 +00:00
|
|
|
free(optdata, M_TEMP); \
|
|
|
|
return (ECONNRESET); \
|
|
|
|
} \
|
2019-08-02 07:41:36 +00:00
|
|
|
pktopt = inp->in6p_outputopts; \
|
2018-03-22 23:34:48 +00:00
|
|
|
if (pktopt && pktopt->field) { \
|
|
|
|
optdatalen = min(lenexpr, sopt->sopt_valsize); \
|
|
|
|
bcopy(&pktopt->field, optdata, optdatalen); \
|
|
|
|
} else { \
|
|
|
|
free(optdata, M_TEMP); \
|
|
|
|
optdata = NULL; \
|
|
|
|
malloc_optdata = false; \
|
|
|
|
} \
|
|
|
|
} \
|
|
|
|
} while(0)
|
|
|
|
|
|
|
|
#define GET_PKTOPT_EXT_HDR(field) GET_PKTOPT_VAR(field, \
|
|
|
|
(((struct ip6_ext *)pktopt->field)->ip6e_len + 1) << 3)
|
|
|
|
|
|
|
|
#define GET_PKTOPT_SOCKADDR(field) GET_PKTOPT_VAR(field, \
|
|
|
|
pktopt->field->sa_len)
|
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
static int
|
2019-08-02 07:41:36 +00:00
|
|
|
ip6_getpcbopt(struct inpcb *inp, int optname, struct sockopt *sopt)
|
2003-10-24 18:26:30 +00:00
|
|
|
{
|
|
|
|
void *optdata = NULL;
|
2018-03-22 23:34:48 +00:00
|
|
|
bool malloc_optdata = false;
|
2003-10-24 18:26:30 +00:00
|
|
|
int optdatalen = 0;
|
|
|
|
int error = 0;
|
|
|
|
struct in6_pktinfo null_pktinfo;
|
|
|
|
int deftclass = 0, on;
|
|
|
|
int defminmtu = IP6PO_MINMTU_MCASTONLY;
|
|
|
|
int defpreftemp = IP6PO_TEMPADDR_SYSTEM;
|
2018-03-22 23:34:48 +00:00
|
|
|
struct ip6_pktopts *pktopt;
|
|
|
|
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_RLOCK(inp);
|
|
|
|
pktopt = inp->in6p_outputopts;
|
2003-10-24 18:26:30 +00:00
|
|
|
|
|
|
|
switch (optname) {
|
|
|
|
case IPV6_PKTINFO:
|
2015-07-03 19:01:38 +00:00
|
|
|
optdata = (void *)&null_pktinfo;
|
|
|
|
if (pktopt && pktopt->ip6po_pktinfo) {
|
|
|
|
bcopy(pktopt->ip6po_pktinfo, &null_pktinfo,
|
|
|
|
sizeof(null_pktinfo));
|
|
|
|
in6_clearscope(&null_pktinfo.ipi6_addr);
|
|
|
|
} else {
|
2003-10-24 18:26:30 +00:00
|
|
|
/* XXX: we don't have to do this every time... */
|
|
|
|
bzero(&null_pktinfo, sizeof(null_pktinfo));
|
|
|
|
}
|
|
|
|
optdatalen = sizeof(struct in6_pktinfo);
|
|
|
|
break;
|
|
|
|
case IPV6_TCLASS:
|
|
|
|
if (pktopt && pktopt->ip6po_tclass >= 0)
|
2018-03-22 23:34:48 +00:00
|
|
|
deftclass = pktopt->ip6po_tclass;
|
|
|
|
optdata = (void *)&deftclass;
|
2003-10-24 18:26:30 +00:00
|
|
|
optdatalen = sizeof(int);
|
|
|
|
break;
|
|
|
|
case IPV6_HOPOPTS:
|
2018-03-22 23:34:48 +00:00
|
|
|
GET_PKTOPT_EXT_HDR(ip6po_hbh);
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
case IPV6_RTHDR:
|
2018-03-22 23:34:48 +00:00
|
|
|
GET_PKTOPT_EXT_HDR(ip6po_rthdr);
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
case IPV6_RTHDRDSTOPTS:
|
2018-03-22 23:34:48 +00:00
|
|
|
GET_PKTOPT_EXT_HDR(ip6po_dest1);
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
case IPV6_DSTOPTS:
|
2018-03-22 23:34:48 +00:00
|
|
|
GET_PKTOPT_EXT_HDR(ip6po_dest2);
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
case IPV6_NEXTHOP:
|
2018-03-22 23:34:48 +00:00
|
|
|
GET_PKTOPT_SOCKADDR(ip6po_nexthop);
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
case IPV6_USE_MIN_MTU:
|
|
|
|
if (pktopt)
|
2018-03-22 23:34:48 +00:00
|
|
|
defminmtu = pktopt->ip6po_minmtu;
|
|
|
|
optdata = (void *)&defminmtu;
|
2003-10-24 18:26:30 +00:00
|
|
|
optdatalen = sizeof(int);
|
|
|
|
break;
|
|
|
|
case IPV6_DONTFRAG:
|
|
|
|
if (pktopt && ((pktopt->ip6po_flags) & IP6PO_DONTFRAG))
|
|
|
|
on = 1;
|
|
|
|
else
|
|
|
|
on = 0;
|
|
|
|
optdata = (void *)&on;
|
|
|
|
optdatalen = sizeof(on);
|
|
|
|
break;
|
|
|
|
case IPV6_PREFER_TEMPADDR:
|
|
|
|
if (pktopt)
|
2018-03-22 23:34:48 +00:00
|
|
|
defpreftemp = pktopt->ip6po_prefer_tempaddr;
|
|
|
|
optdata = (void *)&defpreftemp;
|
2003-10-24 18:26:30 +00:00
|
|
|
optdatalen = sizeof(int);
|
|
|
|
break;
|
|
|
|
default: /* should not happen */
|
|
|
|
#ifdef DIAGNOSTIC
|
|
|
|
panic("ip6_getpcbopt: unexpected option\n");
|
|
|
|
#endif
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_RUNLOCK(inp);
|
2003-10-24 18:26:30 +00:00
|
|
|
return (ENOPROTOOPT);
|
|
|
|
}
|
2019-08-02 07:41:36 +00:00
|
|
|
INP_RUNLOCK(inp);
|
2003-10-24 18:26:30 +00:00
|
|
|
|
|
|
|
error = sooptcopyout(sopt, optdata, optdatalen);
|
2018-03-22 23:34:48 +00:00
|
|
|
if (malloc_optdata)
|
|
|
|
free(optdata, M_TEMP);
|
2003-10-24 18:26:30 +00:00
|
|
|
|
|
|
|
return (error);
|
2001-06-11 12:39:29 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2007-07-05 16:23:49 +00:00
|
|
|
ip6_clearpktopts(struct ip6_pktopts *pktopt, int optname)
|
2001-06-11 12:39:29 +00:00
|
|
|
{
|
2003-11-24 01:53:36 +00:00
|
|
|
if (pktopt == NULL)
|
|
|
|
return;
|
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
if (optname == -1 || optname == IPV6_PKTINFO) {
|
2005-07-21 16:39:23 +00:00
|
|
|
if (pktopt->ip6po_pktinfo)
|
2001-06-11 12:39:29 +00:00
|
|
|
free(pktopt->ip6po_pktinfo, M_IP6OPT);
|
|
|
|
pktopt->ip6po_pktinfo = NULL;
|
|
|
|
}
|
2003-10-24 18:26:30 +00:00
|
|
|
if (optname == -1 || optname == IPV6_HOPLIMIT)
|
2001-06-11 12:39:29 +00:00
|
|
|
pktopt->ip6po_hlim = -1;
|
2003-10-24 18:26:30 +00:00
|
|
|
if (optname == -1 || optname == IPV6_TCLASS)
|
|
|
|
pktopt->ip6po_tclass = -1;
|
|
|
|
if (optname == -1 || optname == IPV6_NEXTHOP) {
|
|
|
|
if (pktopt->ip6po_nextroute.ro_rt) {
|
|
|
|
RTFREE(pktopt->ip6po_nextroute.ro_rt);
|
|
|
|
pktopt->ip6po_nextroute.ro_rt = NULL;
|
|
|
|
}
|
2005-07-21 16:39:23 +00:00
|
|
|
if (pktopt->ip6po_nexthop)
|
2001-06-11 12:39:29 +00:00
|
|
|
free(pktopt->ip6po_nexthop, M_IP6OPT);
|
|
|
|
pktopt->ip6po_nexthop = NULL;
|
|
|
|
}
|
2003-10-24 18:26:30 +00:00
|
|
|
if (optname == -1 || optname == IPV6_HOPOPTS) {
|
2005-07-21 16:39:23 +00:00
|
|
|
if (pktopt->ip6po_hbh)
|
2001-06-11 12:39:29 +00:00
|
|
|
free(pktopt->ip6po_hbh, M_IP6OPT);
|
|
|
|
pktopt->ip6po_hbh = NULL;
|
|
|
|
}
|
2003-10-24 18:26:30 +00:00
|
|
|
if (optname == -1 || optname == IPV6_RTHDRDSTOPTS) {
|
2005-07-21 16:39:23 +00:00
|
|
|
if (pktopt->ip6po_dest1)
|
2001-06-11 12:39:29 +00:00
|
|
|
free(pktopt->ip6po_dest1, M_IP6OPT);
|
|
|
|
pktopt->ip6po_dest1 = NULL;
|
|
|
|
}
|
2003-10-24 18:26:30 +00:00
|
|
|
if (optname == -1 || optname == IPV6_RTHDR) {
|
2005-07-21 16:39:23 +00:00
|
|
|
if (pktopt->ip6po_rhinfo.ip6po_rhi_rthdr)
|
2001-06-11 12:39:29 +00:00
|
|
|
free(pktopt->ip6po_rhinfo.ip6po_rhi_rthdr, M_IP6OPT);
|
|
|
|
pktopt->ip6po_rhinfo.ip6po_rhi_rthdr = NULL;
|
|
|
|
if (pktopt->ip6po_route.ro_rt) {
|
|
|
|
RTFREE(pktopt->ip6po_route.ro_rt);
|
|
|
|
pktopt->ip6po_route.ro_rt = NULL;
|
|
|
|
}
|
|
|
|
}
|
2003-10-24 18:26:30 +00:00
|
|
|
if (optname == -1 || optname == IPV6_DSTOPTS) {
|
2005-07-21 16:39:23 +00:00
|
|
|
if (pktopt->ip6po_dest2)
|
2001-06-11 12:39:29 +00:00
|
|
|
free(pktopt->ip6po_dest2, M_IP6OPT);
|
|
|
|
pktopt->ip6po_dest2 = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
#define PKTOPT_EXTHDRCPY(type) \
|
|
|
|
do {\
|
|
|
|
if (src->type) {\
|
2003-10-08 18:26:08 +00:00
|
|
|
int hlen = (((struct ip6_ext *)src->type)->ip6e_len + 1) << 3;\
|
2001-06-11 12:39:29 +00:00
|
|
|
dst->type = malloc(hlen, M_IP6OPT, canwait);\
|
2017-05-30 14:50:28 +00:00
|
|
|
if (dst->type == NULL)\
|
2001-06-11 12:39:29 +00:00
|
|
|
goto bad;\
|
|
|
|
bcopy(src->type, dst->type, hlen);\
|
|
|
|
}\
|
2003-10-08 18:26:08 +00:00
|
|
|
} while (/*CONSTCOND*/ 0)
|
2001-06-11 12:39:29 +00:00
|
|
|
|
2005-07-21 16:39:23 +00:00
|
|
|
static int
|
2007-07-05 16:23:49 +00:00
|
|
|
copypktopts(struct ip6_pktopts *dst, struct ip6_pktopts *src, int canwait)
|
2001-06-11 12:39:29 +00:00
|
|
|
{
|
2005-07-21 16:39:23 +00:00
|
|
|
if (dst == NULL || src == NULL) {
|
2001-06-11 12:39:29 +00:00
|
|
|
printf("ip6_clearpktopts: invalid argument\n");
|
2005-07-21 16:39:23 +00:00
|
|
|
return (EINVAL);
|
2001-06-11 12:39:29 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
dst->ip6po_hlim = src->ip6po_hlim;
|
2003-10-24 18:26:30 +00:00
|
|
|
dst->ip6po_tclass = src->ip6po_tclass;
|
|
|
|
dst->ip6po_flags = src->ip6po_flags;
|
2011-09-20 00:29:17 +00:00
|
|
|
dst->ip6po_minmtu = src->ip6po_minmtu;
|
|
|
|
dst->ip6po_prefer_tempaddr = src->ip6po_prefer_tempaddr;
|
2001-06-11 12:39:29 +00:00
|
|
|
if (src->ip6po_pktinfo) {
|
|
|
|
dst->ip6po_pktinfo = malloc(sizeof(*dst->ip6po_pktinfo),
|
2003-10-08 18:26:08 +00:00
|
|
|
M_IP6OPT, canwait);
|
2007-07-01 11:41:27 +00:00
|
|
|
if (dst->ip6po_pktinfo == NULL)
|
2001-06-11 12:39:29 +00:00
|
|
|
goto bad;
|
|
|
|
*dst->ip6po_pktinfo = *src->ip6po_pktinfo;
|
|
|
|
}
|
|
|
|
if (src->ip6po_nexthop) {
|
|
|
|
dst->ip6po_nexthop = malloc(src->ip6po_nexthop->sa_len,
|
2003-10-08 18:26:08 +00:00
|
|
|
M_IP6OPT, canwait);
|
2005-05-15 02:28:30 +00:00
|
|
|
if (dst->ip6po_nexthop == NULL)
|
2001-06-11 12:39:29 +00:00
|
|
|
goto bad;
|
|
|
|
bcopy(src->ip6po_nexthop, dst->ip6po_nexthop,
|
2003-10-08 18:26:08 +00:00
|
|
|
src->ip6po_nexthop->sa_len);
|
2001-06-11 12:39:29 +00:00
|
|
|
}
|
|
|
|
PKTOPT_EXTHDRCPY(ip6po_hbh);
|
|
|
|
PKTOPT_EXTHDRCPY(ip6po_dest1);
|
|
|
|
PKTOPT_EXTHDRCPY(ip6po_dest2);
|
|
|
|
PKTOPT_EXTHDRCPY(ip6po_rthdr); /* not copy the cached route */
|
2005-07-21 16:39:23 +00:00
|
|
|
return (0);
|
2001-06-11 12:39:29 +00:00
|
|
|
|
|
|
|
bad:
|
2007-11-21 16:01:42 +00:00
|
|
|
ip6_clearpktopts(dst, -1);
|
2005-07-21 16:39:23 +00:00
|
|
|
return (ENOBUFS);
|
2001-06-11 12:39:29 +00:00
|
|
|
}
|
|
|
|
#undef PKTOPT_EXTHDRCPY
|
|
|
|
|
2005-07-21 16:39:23 +00:00
|
|
|
struct ip6_pktopts *
|
2007-07-05 16:23:49 +00:00
|
|
|
ip6_copypktopts(struct ip6_pktopts *src, int canwait)
|
2005-07-21 16:39:23 +00:00
|
|
|
{
|
|
|
|
int error;
|
|
|
|
struct ip6_pktopts *dst;
|
|
|
|
|
|
|
|
dst = malloc(sizeof(*dst), M_IP6OPT, canwait);
|
2007-07-01 11:41:27 +00:00
|
|
|
if (dst == NULL)
|
2005-07-21 16:39:23 +00:00
|
|
|
return (NULL);
|
|
|
|
ip6_initpktopts(dst);
|
|
|
|
|
|
|
|
if ((error = copypktopts(dst, src, canwait)) != 0) {
|
|
|
|
free(dst, M_IP6OPT);
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (dst);
|
|
|
|
}
|
|
|
|
|
2001-06-11 12:39:29 +00:00
|
|
|
void
|
2007-07-05 16:23:49 +00:00
|
|
|
ip6_freepcbopts(struct ip6_pktopts *pktopt)
|
2001-06-11 12:39:29 +00:00
|
|
|
{
|
|
|
|
if (pktopt == NULL)
|
|
|
|
return;
|
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
ip6_clearpktopts(pktopt, -1);
|
2001-06-11 12:39:29 +00:00
|
|
|
|
|
|
|
free(pktopt, M_IP6OPT);
|
|
|
|
}
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
/*
|
|
|
|
* Set IPv6 outgoing packet options based on advanced API.
|
|
|
|
*/
|
|
|
|
int
|
2007-07-05 16:23:49 +00:00
|
|
|
ip6_setpktopts(struct mbuf *control, struct ip6_pktopts *opt,
|
2008-01-24 08:25:59 +00:00
|
|
|
struct ip6_pktopts *stickyopt, struct ucred *cred, int uproto)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
2016-04-15 17:30:33 +00:00
|
|
|
struct cmsghdr *cm = NULL;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2005-07-21 14:57:53 +00:00
|
|
|
if (control == NULL || opt == NULL)
|
2003-10-06 14:02:09 +00:00
|
|
|
return (EINVAL);
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2005-07-21 16:39:23 +00:00
|
|
|
ip6_initpktopts(opt);
|
2003-10-24 18:26:30 +00:00
|
|
|
if (stickyopt) {
|
2005-07-21 16:39:23 +00:00
|
|
|
int error;
|
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
/*
|
|
|
|
* If stickyopt is provided, make a local copy of the options
|
|
|
|
* for this particular packet, then override them by ancillary
|
|
|
|
* objects.
|
2005-07-21 16:39:23 +00:00
|
|
|
* XXX: copypktopts() does not copy the cached route to a next
|
|
|
|
* hop (if any). This is not very good in terms of efficiency,
|
|
|
|
* but we can allow this since this option should be rarely
|
|
|
|
* used.
|
2003-10-24 18:26:30 +00:00
|
|
|
*/
|
2005-07-21 16:39:23 +00:00
|
|
|
if ((error = copypktopts(opt, stickyopt, M_NOWAIT)) != 0)
|
|
|
|
return (error);
|
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* XXX: Currently, we assume all the optional information is stored
|
|
|
|
* in a single mbuf.
|
|
|
|
*/
|
|
|
|
if (control->m_next)
|
2003-10-06 14:02:09 +00:00
|
|
|
return (EINVAL);
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2008-10-15 19:24:18 +00:00
|
|
|
for (; control->m_len > 0; control->m_data += CMSG_ALIGN(cm->cmsg_len),
|
2003-10-08 18:26:08 +00:00
|
|
|
control->m_len -= CMSG_ALIGN(cm->cmsg_len)) {
|
2003-10-24 18:26:30 +00:00
|
|
|
int error;
|
|
|
|
|
|
|
|
if (control->m_len < CMSG_LEN(0))
|
|
|
|
return (EINVAL);
|
|
|
|
|
1999-11-22 02:45:11 +00:00
|
|
|
cm = mtod(control, struct cmsghdr *);
|
|
|
|
if (cm->cmsg_len == 0 || cm->cmsg_len > control->m_len)
|
2003-10-06 14:02:09 +00:00
|
|
|
return (EINVAL);
|
1999-11-22 02:45:11 +00:00
|
|
|
if (cm->cmsg_level != IPPROTO_IPV6)
|
|
|
|
continue;
|
|
|
|
|
2005-07-21 15:06:32 +00:00
|
|
|
error = ip6_setpktopt(cm->cmsg_type, CMSG_DATA(cm),
|
2008-01-24 08:25:59 +00:00
|
|
|
cm->cmsg_len - CMSG_LEN(0), opt, cred, 0, 1, uproto);
|
2003-10-24 18:26:30 +00:00
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Set a particular packet option, as a sticky option or an ancillary data
|
|
|
|
* item. "len" can be 0 only when it's a sticky option.
|
|
|
|
* We have 4 cases of combination of "sticky" and "cmsg":
|
|
|
|
* "sticky=0, cmsg=0": impossible
|
2005-07-20 08:59:45 +00:00
|
|
|
* "sticky=0, cmsg=1": RFC2292 or RFC3542 ancillary data
|
|
|
|
* "sticky=1, cmsg=0": RFC3542 socket option
|
2003-10-24 18:26:30 +00:00
|
|
|
* "sticky=1, cmsg=1": RFC2292 socket option
|
|
|
|
*/
|
|
|
|
static int
|
2007-07-05 16:23:49 +00:00
|
|
|
ip6_setpktopt(int optname, u_char *buf, int len, struct ip6_pktopts *opt,
|
2008-01-24 08:25:59 +00:00
|
|
|
struct ucred *cred, int sticky, int cmsg, int uproto)
|
2003-10-24 18:26:30 +00:00
|
|
|
{
|
|
|
|
int minmtupolicy, preftemp;
|
2008-01-24 08:25:59 +00:00
|
|
|
int error;
|
2003-10-24 18:26:30 +00:00
|
|
|
|
|
|
|
if (!sticky && !cmsg) {
|
|
|
|
#ifdef DIAGNOSTIC
|
2005-07-21 15:06:32 +00:00
|
|
|
printf("ip6_setpktopt: impossible case\n");
|
2003-10-24 18:26:30 +00:00
|
|
|
#endif
|
|
|
|
return (EINVAL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* IPV6_2292xxx is for backward compatibility to RFC2292, and should
|
2005-07-20 08:59:45 +00:00
|
|
|
* not be specified in the context of RFC3542. Conversely,
|
|
|
|
* RFC3542 types should not be specified in the context of RFC2292.
|
2003-10-24 18:26:30 +00:00
|
|
|
*/
|
|
|
|
if (!cmsg) {
|
|
|
|
switch (optname) {
|
|
|
|
case IPV6_2292PKTINFO:
|
|
|
|
case IPV6_2292HOPLIMIT:
|
|
|
|
case IPV6_2292NEXTHOP:
|
|
|
|
case IPV6_2292HOPOPTS:
|
|
|
|
case IPV6_2292DSTOPTS:
|
|
|
|
case IPV6_2292RTHDR:
|
|
|
|
case IPV6_2292PKTOPTIONS:
|
|
|
|
return (ENOPROTOOPT);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (sticky && cmsg) {
|
|
|
|
switch (optname) {
|
|
|
|
case IPV6_PKTINFO:
|
|
|
|
case IPV6_HOPLIMIT:
|
|
|
|
case IPV6_NEXTHOP:
|
|
|
|
case IPV6_HOPOPTS:
|
|
|
|
case IPV6_DSTOPTS:
|
|
|
|
case IPV6_RTHDRDSTOPTS:
|
|
|
|
case IPV6_RTHDR:
|
|
|
|
case IPV6_USE_MIN_MTU:
|
|
|
|
case IPV6_DONTFRAG:
|
|
|
|
case IPV6_TCLASS:
|
2005-07-20 08:59:45 +00:00
|
|
|
case IPV6_PREFER_TEMPADDR: /* XXX: not an RFC3542 option */
|
2003-10-24 18:26:30 +00:00
|
|
|
return (ENOPROTOOPT);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
switch (optname) {
|
|
|
|
case IPV6_2292PKTINFO:
|
|
|
|
case IPV6_PKTINFO:
|
|
|
|
{
|
|
|
|
struct ifnet *ifp = NULL;
|
|
|
|
struct in6_pktinfo *pktinfo;
|
|
|
|
|
|
|
|
if (len != sizeof(struct in6_pktinfo))
|
|
|
|
return (EINVAL);
|
|
|
|
|
|
|
|
pktinfo = (struct in6_pktinfo *)buf;
|
|
|
|
|
2001-06-11 12:39:29 +00:00
|
|
|
/*
|
2003-10-24 18:26:30 +00:00
|
|
|
* An application can clear any sticky IPV6_PKTINFO option by
|
|
|
|
* doing a "regular" setsockopt with ipi6_addr being
|
|
|
|
* in6addr_any and ipi6_ifindex being zero.
|
|
|
|
* [RFC 3542, Section 6]
|
2001-06-11 12:39:29 +00:00
|
|
|
*/
|
2003-10-24 18:26:30 +00:00
|
|
|
if (optname == IPV6_PKTINFO && opt->ip6po_pktinfo &&
|
|
|
|
pktinfo->ipi6_ifindex == 0 &&
|
|
|
|
IN6_IS_ADDR_UNSPECIFIED(&pktinfo->ipi6_addr)) {
|
|
|
|
ip6_clearpktopts(opt, optname);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (uproto == IPPROTO_TCP && optname == IPV6_PKTINFO &&
|
|
|
|
sticky && !IN6_IS_ADDR_UNSPECIFIED(&pktinfo->ipi6_addr)) {
|
|
|
|
return (EINVAL);
|
|
|
|
}
|
2014-09-10 14:32:07 +00:00
|
|
|
if (IN6_IS_ADDR_MULTICAST(&pktinfo->ipi6_addr))
|
|
|
|
return (EINVAL);
|
2003-10-24 18:26:30 +00:00
|
|
|
/* validate the interface index if specified. */
|
2013-10-15 10:12:19 +00:00
|
|
|
if (pktinfo->ipi6_ifindex > V_if_index)
|
2003-10-24 18:26:30 +00:00
|
|
|
return (ENXIO);
|
|
|
|
if (pktinfo->ipi6_ifindex) {
|
|
|
|
ifp = ifnet_byindex(pktinfo->ipi6_ifindex);
|
|
|
|
if (ifp == NULL)
|
2003-10-06 14:02:09 +00:00
|
|
|
return (ENXIO);
|
2003-10-24 18:26:30 +00:00
|
|
|
}
|
2016-07-13 19:41:19 +00:00
|
|
|
if (ifp != NULL && (ifp->if_afdata[AF_INET6] == NULL ||
|
|
|
|
(ND_IFINFO(ifp)->flags & ND6_IFF_IFDISABLED) != 0))
|
2014-09-10 14:32:07 +00:00
|
|
|
return (ENETDOWN);
|
|
|
|
|
|
|
|
if (ifp != NULL &&
|
|
|
|
!IN6_IS_ADDR_UNSPECIFIED(&pktinfo->ipi6_addr)) {
|
|
|
|
struct in6_ifaddr *ia;
|
|
|
|
|
2015-07-03 19:01:38 +00:00
|
|
|
in6_setscope(&pktinfo->ipi6_addr, ifp, NULL);
|
2014-09-10 14:32:07 +00:00
|
|
|
ia = in6ifa_ifpwithaddr(ifp, &pktinfo->ipi6_addr);
|
|
|
|
if (ia == NULL)
|
|
|
|
return (EADDRNOTAVAIL);
|
|
|
|
ifa_free(&ia->ia_ifa);
|
|
|
|
}
|
2003-10-24 18:26:30 +00:00
|
|
|
/*
|
|
|
|
* We store the address anyway, and let in6_selectsrc()
|
|
|
|
* validate the specified address. This is because ipi6_addr
|
|
|
|
* may not have enough information about its scope zone, and
|
|
|
|
* we may need additional information (such as outgoing
|
|
|
|
* interface or the scope zone of a destination address) to
|
|
|
|
* disambiguate the scope.
|
|
|
|
* XXX: the delay of the validation may confuse the
|
|
|
|
* application when it is used as a sticky option.
|
|
|
|
*/
|
2005-07-21 16:39:23 +00:00
|
|
|
if (opt->ip6po_pktinfo == NULL) {
|
|
|
|
opt->ip6po_pktinfo = malloc(sizeof(*pktinfo),
|
|
|
|
M_IP6OPT, M_NOWAIT);
|
|
|
|
if (opt->ip6po_pktinfo == NULL)
|
|
|
|
return (ENOBUFS);
|
|
|
|
}
|
|
|
|
bcopy(pktinfo, opt->ip6po_pktinfo, sizeof(*pktinfo));
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_2292HOPLIMIT:
|
|
|
|
case IPV6_HOPLIMIT:
|
|
|
|
{
|
|
|
|
int *hlimp;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
/*
|
|
|
|
* RFC 3542 deprecated the usage of sticky IPV6_HOPLIMIT
|
|
|
|
* to simplify the ordering among hoplimit options.
|
|
|
|
*/
|
|
|
|
if (optname == IPV6_HOPLIMIT && sticky)
|
|
|
|
return (ENOPROTOOPT);
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
if (len != sizeof(int))
|
|
|
|
return (EINVAL);
|
|
|
|
hlimp = (int *)buf;
|
|
|
|
if (*hlimp < -1 || *hlimp > 255)
|
|
|
|
return (EINVAL);
|
2001-06-11 12:39:29 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
opt->ip6po_hlim = *hlimp;
|
|
|
|
break;
|
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_TCLASS:
|
|
|
|
{
|
|
|
|
int tclass;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
if (len != sizeof(int))
|
|
|
|
return (EINVAL);
|
|
|
|
tclass = *(int *)buf;
|
|
|
|
if (tclass < -1 || tclass > 255)
|
|
|
|
return (EINVAL);
|
2001-06-11 12:39:29 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
opt->ip6po_tclass = tclass;
|
|
|
|
break;
|
|
|
|
}
|
2001-06-11 12:39:29 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_2292NEXTHOP:
|
|
|
|
case IPV6_NEXTHOP:
|
2008-01-24 08:25:59 +00:00
|
|
|
if (cred != NULL) {
|
2018-12-11 19:32:16 +00:00
|
|
|
error = priv_check_cred(cred, PRIV_NETINET_SETHDROPTS);
|
2008-01-24 08:25:59 +00:00
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
}
|
2003-10-24 18:26:30 +00:00
|
|
|
|
|
|
|
if (len == 0) { /* just remove the option */
|
|
|
|
ip6_clearpktopts(opt, IPV6_NEXTHOP);
|
1999-11-22 02:45:11 +00:00
|
|
|
break;
|
2001-06-11 12:39:29 +00:00
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
/* check if cmsg_len is large enough for sa_len */
|
|
|
|
if (len < sizeof(struct sockaddr) || len < *buf)
|
|
|
|
return (EINVAL);
|
|
|
|
|
|
|
|
switch (((struct sockaddr *)buf)->sa_family) {
|
|
|
|
case AF_INET6:
|
2001-06-11 12:39:29 +00:00
|
|
|
{
|
2003-10-24 18:26:30 +00:00
|
|
|
struct sockaddr_in6 *sa6 = (struct sockaddr_in6 *)buf;
|
|
|
|
int error;
|
2001-06-11 12:39:29 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
if (sa6->sin6_len != sizeof(struct sockaddr_in6))
|
2003-10-06 14:02:09 +00:00
|
|
|
return (EINVAL);
|
2003-10-24 18:26:30 +00:00
|
|
|
|
|
|
|
if (IN6_IS_ADDR_UNSPECIFIED(&sa6->sin6_addr) ||
|
|
|
|
IN6_IS_ADDR_MULTICAST(&sa6->sin6_addr)) {
|
2003-10-06 14:02:09 +00:00
|
|
|
return (EINVAL);
|
2003-10-24 18:26:30 +00:00
|
|
|
}
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if ((error = sa6_embedscope(sa6, V_ip6_use_defzone))
|
2003-10-24 18:26:30 +00:00
|
|
|
!= 0) {
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case AF_LINK: /* should eventually be supported */
|
|
|
|
default:
|
|
|
|
return (EAFNOSUPPORT);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* turn off the previous option, then set the new option. */
|
|
|
|
ip6_clearpktopts(opt, IPV6_NEXTHOP);
|
2005-10-21 16:23:01 +00:00
|
|
|
opt->ip6po_nexthop = malloc(*buf, M_IP6OPT, M_NOWAIT);
|
|
|
|
if (opt->ip6po_nexthop == NULL)
|
|
|
|
return (ENOBUFS);
|
2005-07-21 16:39:23 +00:00
|
|
|
bcopy(buf, opt->ip6po_nexthop, *buf);
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
case IPV6_2292HOPOPTS:
|
|
|
|
case IPV6_HOPOPTS:
|
|
|
|
{
|
|
|
|
struct ip6_hbh *hbh;
|
|
|
|
int hbhlen;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* XXX: We don't allow a non-privileged user to set ANY HbH
|
|
|
|
* options, since per-option restriction has too much
|
|
|
|
* overhead.
|
|
|
|
*/
|
2008-01-24 08:25:59 +00:00
|
|
|
if (cred != NULL) {
|
2018-12-11 19:32:16 +00:00
|
|
|
error = priv_check_cred(cred, PRIV_NETINET_SETHDROPTS);
|
2008-01-24 08:25:59 +00:00
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
}
|
2003-10-24 18:26:30 +00:00
|
|
|
|
|
|
|
if (len == 0) {
|
|
|
|
ip6_clearpktopts(opt, IPV6_HOPOPTS);
|
|
|
|
break; /* just remove the option */
|
|
|
|
}
|
|
|
|
|
|
|
|
/* message length validation */
|
|
|
|
if (len < sizeof(struct ip6_hbh))
|
|
|
|
return (EINVAL);
|
|
|
|
hbh = (struct ip6_hbh *)buf;
|
|
|
|
hbhlen = (hbh->ip6h_len + 1) << 3;
|
|
|
|
if (len != hbhlen)
|
|
|
|
return (EINVAL);
|
|
|
|
|
|
|
|
/* turn off the previous option, then set the new option. */
|
|
|
|
ip6_clearpktopts(opt, IPV6_HOPOPTS);
|
2005-10-21 16:23:01 +00:00
|
|
|
opt->ip6po_hbh = malloc(hbhlen, M_IP6OPT, M_NOWAIT);
|
|
|
|
if (opt->ip6po_hbh == NULL)
|
|
|
|
return (ENOBUFS);
|
2005-07-21 16:39:23 +00:00
|
|
|
bcopy(hbh, opt->ip6po_hbh, hbhlen);
|
2003-10-24 18:26:30 +00:00
|
|
|
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
case IPV6_2292DSTOPTS:
|
|
|
|
case IPV6_DSTOPTS:
|
|
|
|
case IPV6_RTHDRDSTOPTS:
|
|
|
|
{
|
|
|
|
struct ip6_dest *dest, **newdest = NULL;
|
|
|
|
int destlen;
|
|
|
|
|
2008-01-24 08:25:59 +00:00
|
|
|
if (cred != NULL) { /* XXX: see the comment for IPV6_HOPOPTS */
|
2018-12-11 19:32:16 +00:00
|
|
|
error = priv_check_cred(cred, PRIV_NETINET_SETHDROPTS);
|
2008-01-24 08:25:59 +00:00
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
}
|
2003-10-24 18:26:30 +00:00
|
|
|
|
|
|
|
if (len == 0) {
|
|
|
|
ip6_clearpktopts(opt, optname);
|
|
|
|
break; /* just remove the option */
|
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
/* message length validation */
|
|
|
|
if (len < sizeof(struct ip6_dest))
|
|
|
|
return (EINVAL);
|
|
|
|
dest = (struct ip6_dest *)buf;
|
|
|
|
destlen = (dest->ip6d_len + 1) << 3;
|
|
|
|
if (len != destlen)
|
|
|
|
return (EINVAL);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Determine the position that the destination options header
|
|
|
|
* should be inserted; before or after the routing header.
|
|
|
|
*/
|
|
|
|
switch (optname) {
|
|
|
|
case IPV6_2292DSTOPTS:
|
|
|
|
/*
|
|
|
|
* The old advacned API is ambiguous on this point.
|
|
|
|
* Our approach is to determine the position based
|
|
|
|
* according to the existence of a routing header.
|
|
|
|
* Note, however, that this depends on the order of the
|
|
|
|
* extension headers in the ancillary data; the 1st
|
|
|
|
* part of the destination options header must appear
|
|
|
|
* before the routing header in the ancillary data,
|
|
|
|
* too.
|
2005-07-20 08:59:45 +00:00
|
|
|
* RFC3542 solved the ambiguity by introducing
|
2003-10-24 18:26:30 +00:00
|
|
|
* separate ancillary data or option types.
|
1999-11-22 02:45:11 +00:00
|
|
|
*/
|
2001-06-11 12:39:29 +00:00
|
|
|
if (opt->ip6po_rthdr == NULL)
|
|
|
|
newdest = &opt->ip6po_dest1;
|
|
|
|
else
|
|
|
|
newdest = &opt->ip6po_dest2;
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
case IPV6_RTHDRDSTOPTS:
|
|
|
|
newdest = &opt->ip6po_dest1;
|
|
|
|
break;
|
|
|
|
case IPV6_DSTOPTS:
|
|
|
|
newdest = &opt->ip6po_dest2;
|
|
|
|
break;
|
|
|
|
}
|
2001-06-11 12:39:29 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
/* turn off the previous option, then set the new option. */
|
|
|
|
ip6_clearpktopts(opt, optname);
|
2005-10-21 16:23:01 +00:00
|
|
|
*newdest = malloc(destlen, M_IP6OPT, M_NOWAIT);
|
2006-01-14 00:09:41 +00:00
|
|
|
if (*newdest == NULL)
|
2005-10-21 16:23:01 +00:00
|
|
|
return (ENOBUFS);
|
2005-07-21 16:39:23 +00:00
|
|
|
bcopy(dest, *newdest, destlen);
|
2001-06-11 12:39:29 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
case IPV6_2292RTHDR:
|
|
|
|
case IPV6_RTHDR:
|
|
|
|
{
|
|
|
|
struct ip6_rthdr *rth;
|
|
|
|
int rthlen;
|
|
|
|
|
|
|
|
if (len == 0) {
|
|
|
|
ip6_clearpktopts(opt, IPV6_RTHDR);
|
|
|
|
break; /* just remove the option */
|
2001-06-11 12:39:29 +00:00
|
|
|
}
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
/* message length validation */
|
|
|
|
if (len < sizeof(struct ip6_rthdr))
|
|
|
|
return (EINVAL);
|
|
|
|
rth = (struct ip6_rthdr *)buf;
|
|
|
|
rthlen = (rth->ip6r_len + 1) << 3;
|
|
|
|
if (len != rthlen)
|
|
|
|
return (EINVAL);
|
2001-06-11 12:39:29 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
switch (rth->ip6r_type) {
|
|
|
|
case IPV6_RTHDR_TYPE_0:
|
|
|
|
if (rth->ip6r_len == 0) /* must contain one addr */
|
2003-10-06 14:02:09 +00:00
|
|
|
return (EINVAL);
|
2003-10-24 18:26:30 +00:00
|
|
|
if (rth->ip6r_len % 2) /* length must be even */
|
2003-10-06 14:02:09 +00:00
|
|
|
return (EINVAL);
|
2003-10-24 18:26:30 +00:00
|
|
|
if (rth->ip6r_len / 2 != rth->ip6r_segleft)
|
|
|
|
return (EINVAL);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return (EINVAL); /* not supported */
|
|
|
|
}
|
2001-06-11 12:39:29 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
/* turn off the previous option */
|
|
|
|
ip6_clearpktopts(opt, IPV6_RTHDR);
|
2005-10-21 16:23:01 +00:00
|
|
|
opt->ip6po_rthdr = malloc(rthlen, M_IP6OPT, M_NOWAIT);
|
|
|
|
if (opt->ip6po_rthdr == NULL)
|
|
|
|
return (ENOBUFS);
|
2005-07-21 16:39:23 +00:00
|
|
|
bcopy(rth, opt->ip6po_rthdr, rthlen);
|
2001-06-11 12:39:29 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
break;
|
|
|
|
}
|
2001-06-11 12:39:29 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_USE_MIN_MTU:
|
|
|
|
if (len != sizeof(int))
|
|
|
|
return (EINVAL);
|
|
|
|
minmtupolicy = *(int *)buf;
|
|
|
|
if (minmtupolicy != IP6PO_MINMTU_MCASTONLY &&
|
|
|
|
minmtupolicy != IP6PO_MINMTU_DISABLE &&
|
|
|
|
minmtupolicy != IP6PO_MINMTU_ALL) {
|
|
|
|
return (EINVAL);
|
2001-06-11 12:39:29 +00:00
|
|
|
}
|
2003-10-24 18:26:30 +00:00
|
|
|
opt->ip6po_minmtu = minmtupolicy;
|
|
|
|
break;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2003-10-24 18:26:30 +00:00
|
|
|
case IPV6_DONTFRAG:
|
|
|
|
if (len != sizeof(int))
|
|
|
|
return (EINVAL);
|
|
|
|
|
|
|
|
if (uproto == IPPROTO_TCP || *(int *)buf == 0) {
|
|
|
|
/*
|
|
|
|
* we ignore this option for TCP sockets.
|
2005-07-20 08:59:45 +00:00
|
|
|
* (RFC3542 leaves this case unspecified.)
|
2003-10-24 18:26:30 +00:00
|
|
|
*/
|
|
|
|
opt->ip6po_flags &= ~IP6PO_DONTFRAG;
|
|
|
|
} else
|
|
|
|
opt->ip6po_flags |= IP6PO_DONTFRAG;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case IPV6_PREFER_TEMPADDR:
|
|
|
|
if (len != sizeof(int))
|
|
|
|
return (EINVAL);
|
|
|
|
preftemp = *(int *)buf;
|
|
|
|
if (preftemp != IP6PO_TEMPADDR_SYSTEM &&
|
|
|
|
preftemp != IP6PO_TEMPADDR_NOTPREFER &&
|
|
|
|
preftemp != IP6PO_TEMPADDR_PREFER) {
|
|
|
|
return (EINVAL);
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
2003-10-24 18:26:30 +00:00
|
|
|
opt->ip6po_prefer_tempaddr = preftemp;
|
|
|
|
break;
|
|
|
|
|
|
|
|
default:
|
|
|
|
return (ENOPROTOOPT);
|
|
|
|
} /* end of switch */
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2003-10-06 14:02:09 +00:00
|
|
|
return (0);
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Routine called from ip6_output() to loop back a copy of an IP6 multicast
|
|
|
|
* packet to the input queue of a specified interface. Note that this
|
|
|
|
* calls the output routine of the loopback "driver", but with an interface
|
|
|
|
* pointer that might NOT be &loif -- easier than replicating that code here.
|
|
|
|
*/
|
|
|
|
void
|
2016-03-01 00:17:14 +00:00
|
|
|
ip6_mloopback(struct ifnet *ifp, struct mbuf *m)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
2000-07-04 16:35:15 +00:00
|
|
|
struct mbuf *copym;
|
|
|
|
struct ip6_hdr *ip6;
|
1999-11-22 02:45:11 +00:00
|
|
|
|
2016-09-15 07:41:48 +00:00
|
|
|
copym = m_copym(m, 0, M_COPYALL, M_NOWAIT);
|
2000-07-04 16:35:15 +00:00
|
|
|
if (copym == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Make sure to deep-copy IPv6 header portion in case the data
|
|
|
|
* is in an mbuf cluster, so that we can safely override the IPv6
|
|
|
|
* header portion later.
|
|
|
|
*/
|
2014-10-12 15:49:52 +00:00
|
|
|
if (!M_WRITABLE(copym) ||
|
2000-07-04 16:35:15 +00:00
|
|
|
copym->m_len < sizeof(struct ip6_hdr)) {
|
|
|
|
copym = m_pullup(copym, sizeof(struct ip6_hdr));
|
|
|
|
if (copym == NULL)
|
|
|
|
return;
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
2001-06-11 12:39:29 +00:00
|
|
|
ip6 = mtod(copym, struct ip6_hdr *);
|
|
|
|
/*
|
|
|
|
* clear embedded scope identifiers if necessary.
|
|
|
|
* in6_clearscope will touch the addresses only when necessary.
|
|
|
|
*/
|
|
|
|
in6_clearscope(&ip6->ip6_src);
|
|
|
|
in6_clearscope(&ip6->ip6_dst);
|
2015-05-07 14:17:43 +00:00
|
|
|
if (copym->m_pkthdr.csum_flags & CSUM_DELAY_DATA_IPV6) {
|
|
|
|
copym->m_pkthdr.csum_flags |= CSUM_DATA_VALID_IPV6 |
|
|
|
|
CSUM_PSEUDO_HDR;
|
|
|
|
copym->m_pkthdr.csum_data = 0xffff;
|
|
|
|
}
|
2015-08-08 15:58:35 +00:00
|
|
|
if_simloop(ifp, copym, AF_INET6, 0);
|
1999-11-22 02:45:11 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Chop IPv6 header off from the payload.
|
|
|
|
*/
|
|
|
|
static int
|
2007-07-05 16:23:49 +00:00
|
|
|
ip6_splithdr(struct mbuf *m, struct ip6_exthdrs *exthdrs)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
|
|
|
struct mbuf *mh;
|
|
|
|
struct ip6_hdr *ip6;
|
|
|
|
|
|
|
|
ip6 = mtod(m, struct ip6_hdr *);
|
|
|
|
if (m->m_len > sizeof(*ip6)) {
|
2013-03-15 12:50:29 +00:00
|
|
|
mh = m_gethdr(M_NOWAIT, MT_DATA);
|
|
|
|
if (mh == NULL) {
|
1999-11-22 02:45:11 +00:00
|
|
|
m_freem(m);
|
|
|
|
return ENOBUFS;
|
|
|
|
}
|
2013-03-15 13:48:53 +00:00
|
|
|
m_move_pkthdr(mh, m);
|
To ease changes to underlying mbuf structure and the mbuf allocator, reduce
the knowledge of mbuf layout, and in particular constants such as M_EXT,
MLEN, MHLEN, and so on, in mbuf consumers by unifying various alignment
utility functions (M_ALIGN(), MH_ALIGN(), MEXT_ALIGN() in a single
M_ALIGN() macro, implemented by a now-inlined m_align() function:
- Move m_align() from uipc_mbuf.c to mbuf.h; mark as __inline.
- Reimplement M_ALIGN(), MH_ALIGN(), and MEXT_ALIGN() using m_align().
- Update consumers around the tree to simply use M_ALIGN().
This change eliminates a number of cases where mbuf consumers must be aware
of whether or not mbufs returned by the allocator use external storage, but
also assumptions about the size of the returned mbuf. This will make it
easier to introduce changes in how we use external storage, as well as
features such as variable-size mbufs.
Differential Revision: https://reviews.freebsd.org/D1436
Reviewed by: glebius, trasz, gnn, bz
Sponsored by: EMC / Isilon Storage Division
2015-01-05 09:58:32 +00:00
|
|
|
M_ALIGN(mh, sizeof(*ip6));
|
1999-11-22 02:45:11 +00:00
|
|
|
m->m_len -= sizeof(*ip6);
|
|
|
|
m->m_data += sizeof(*ip6);
|
|
|
|
mh->m_next = m;
|
|
|
|
m = mh;
|
|
|
|
m->m_len = sizeof(*ip6);
|
|
|
|
bcopy((caddr_t)ip6, mtod(m, caddr_t), sizeof(*ip6));
|
|
|
|
}
|
|
|
|
exthdrs->ip6e_ip6 = m;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Compute IPv6 extension header length.
|
|
|
|
*/
|
|
|
|
int
|
2019-08-02 07:41:36 +00:00
|
|
|
ip6_optlen(struct inpcb *inp)
|
1999-11-22 02:45:11 +00:00
|
|
|
{
|
|
|
|
int len;
|
|
|
|
|
2019-08-02 07:41:36 +00:00
|
|
|
if (!inp->in6p_outputopts)
|
1999-11-22 02:45:11 +00:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
len = 0;
|
|
|
|
#define elen(x) \
|
|
|
|
(((struct ip6_ext *)(x)) ? (((struct ip6_ext *)(x))->ip6e_len + 1) << 3 : 0)
|
|
|
|
|
2019-08-02 07:41:36 +00:00
|
|
|
len += elen(inp->in6p_outputopts->ip6po_hbh);
|
|
|
|
if (inp->in6p_outputopts->ip6po_rthdr)
|
2001-06-11 12:39:29 +00:00
|
|
|
/* dest1 is valid with rthdr only */
|
2019-08-02 07:41:36 +00:00
|
|
|
len += elen(inp->in6p_outputopts->ip6po_dest1);
|
|
|
|
len += elen(inp->in6p_outputopts->ip6po_rthdr);
|
|
|
|
len += elen(inp->in6p_outputopts->ip6po_dest2);
|
1999-11-22 02:45:11 +00:00
|
|
|
return len;
|
|
|
|
#undef elen
|
|
|
|
}
|