2005-01-07 01:45:51 +00:00
|
|
|
/*-
|
2017-11-20 19:43:44 +00:00
|
|
|
* SPDX-License-Identifier: BSD-3-Clause
|
|
|
|
*
|
1995-09-22 20:05:58 +00:00
|
|
|
* Copyright (c) 1982, 1986, 1988, 1990, 1993, 1995
|
1994-05-24 10:09:53 +00:00
|
|
|
* The Regents of the University of California. All rights reserved.
|
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
2017-02-28 23:42:47 +00:00
|
|
|
* 3. Neither the name of the University nor the names of its contributors
|
1994-05-24 10:09:53 +00:00
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
1995-09-22 20:05:58 +00:00
|
|
|
* @(#)tcp_output.c 8.4 (Berkeley) 5/24/95
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
|
2007-10-07 20:44:24 +00:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
Initial import of RFC 2385 (TCP-MD5) digest support.
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.
For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.
Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.
There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.
Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.
This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.
Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.
Sponsored by: sentex.net
2004-02-11 04:26:04 +00:00
|
|
|
#include "opt_inet.h"
|
2000-01-09 19:17:30 +00:00
|
|
|
#include "opt_inet6.h"
|
2000-01-15 14:56:38 +00:00
|
|
|
#include "opt_ipsec.h"
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#include "opt_kern_tls.h"
|
1997-09-16 18:36:06 +00:00
|
|
|
#include "opt_tcpdebug.h"
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/param.h>
|
|
|
|
#include <sys/systm.h>
|
2019-12-02 20:58:04 +00:00
|
|
|
#include <sys/arb.h>
|
2001-05-01 08:13:21 +00:00
|
|
|
#include <sys/domain.h>
|
In the TCP stack, the hhook(9) framework provides hooks for kernel modules
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.
Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.
This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.
Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.
Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8185
2016-10-12 02:16:42 +00:00
|
|
|
#ifdef TCP_HHOOK
|
2010-12-28 12:13:30 +00:00
|
|
|
#include <sys/hhook.h>
|
In the TCP stack, the hhook(9) framework provides hooks for kernel modules
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.
Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.
This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.
Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.
Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8185
2016-10-12 02:16:42 +00:00
|
|
|
#endif
|
1999-05-27 12:24:21 +00:00
|
|
|
#include <sys/kernel.h>
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#ifdef KERN_TLS
|
|
|
|
#include <sys/ktls.h>
|
|
|
|
#endif
|
2001-05-01 08:13:21 +00:00
|
|
|
#include <sys/lock.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/mbuf.h>
|
2001-05-01 08:13:21 +00:00
|
|
|
#include <sys/mutex.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/protosw.h>
|
2019-12-02 20:58:04 +00:00
|
|
|
#include <sys/qmath.h>
|
2013-08-25 21:54:41 +00:00
|
|
|
#include <sys/sdt.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/socket.h>
|
|
|
|
#include <sys/socketvar.h>
|
2001-05-01 08:13:21 +00:00
|
|
|
#include <sys/sysctl.h>
|
2019-12-02 20:58:04 +00:00
|
|
|
#include <sys/stats.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2008-12-02 21:37:28 +00:00
|
|
|
#include <net/if.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <net/route.h>
|
2020-04-25 09:06:11 +00:00
|
|
|
#include <net/route/nhop.h>
|
2009-08-01 19:26:27 +00:00
|
|
|
#include <net/vnet.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
#include <netinet/in.h>
|
2013-08-25 21:54:41 +00:00
|
|
|
#include <netinet/in_kdtrace.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <netinet/in_systm.h>
|
|
|
|
#include <netinet/ip.h>
|
|
|
|
#include <netinet/in_pcb.h>
|
|
|
|
#include <netinet/ip_var.h>
|
2005-11-18 20:12:40 +00:00
|
|
|
#include <netinet/ip_options.h>
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
2000-07-04 16:35:15 +00:00
|
|
|
#include <netinet6/in6_pcb.h>
|
|
|
|
#include <netinet/ip6.h>
|
2000-01-09 19:17:30 +00:00
|
|
|
#include <netinet6/ip6_var.h>
|
|
|
|
#endif
|
2016-01-21 22:34:51 +00:00
|
|
|
#include <netinet/tcp.h>
|
2018-06-23 06:53:53 +00:00
|
|
|
#define TCPOUTFLAGS
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <netinet/tcp_fsm.h>
|
2018-03-22 09:40:08 +00:00
|
|
|
#include <netinet/tcp_log_buf.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <netinet/tcp_seq.h>
|
|
|
|
#include <netinet/tcp_timer.h>
|
|
|
|
#include <netinet/tcp_var.h>
|
|
|
|
#include <netinet/tcpip.h>
|
2016-01-27 17:59:39 +00:00
|
|
|
#include <netinet/cc/cc.h>
|
2018-02-26 02:53:22 +00:00
|
|
|
#include <netinet/tcp_fastopen.h>
|
There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.
It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.
To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).
There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.
I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.
The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.
Differential Revision: D3100
Submitted by: Jonathan Looney <jlooney at juniper dot net>
Reviewed by: gnn, hiren
2015-10-14 00:35:37 +00:00
|
|
|
#ifdef TCPPCAP
|
|
|
|
#include <netinet/tcp_pcap.h>
|
|
|
|
#endif
|
1994-09-15 10:36:56 +00:00
|
|
|
#ifdef TCPDEBUG
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <netinet/tcp_debug.h>
|
1994-09-15 10:36:56 +00:00
|
|
|
#endif
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
#include <netinet/tcp_offload.h>
|
|
|
|
#endif
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2017-02-06 08:49:57 +00:00
|
|
|
#include <netipsec/ipsec_support.h>
|
2002-10-16 02:25:05 +00:00
|
|
|
|
2000-03-27 19:14:27 +00:00
|
|
|
#include <machine/in_cksum.h>
|
|
|
|
|
2006-10-22 11:52:19 +00:00
|
|
|
#include <security/mac/mac_framework.h>
|
|
|
|
|
2010-04-29 11:52:42 +00:00
|
|
|
VNET_DEFINE(int, path_mtu_discovery) = 1;
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_INT(_net_inet_tcp, OID_AUTO, path_mtu_discovery, CTLFLAG_VNET | CTLFLAG_RW,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
&VNET_NAME(path_mtu_discovery), 1,
|
|
|
|
"Enable Path MTU Discovery");
|
|
|
|
|
2010-04-29 11:52:42 +00:00
|
|
|
VNET_DEFINE(int, tcp_do_tso) = 1;
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_INT(_net_inet_tcp, OID_AUTO, tso, CTLFLAG_VNET | CTLFLAG_RW,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
&VNET_NAME(tcp_do_tso), 0,
|
|
|
|
"Enable TCP Segmentation Offload");
|
|
|
|
|
2011-10-16 20:18:39 +00:00
|
|
|
VNET_DEFINE(int, tcp_sendspace) = 1024*32;
|
|
|
|
#define V_tcp_sendspace VNET(tcp_sendspace)
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_INT(_net_inet_tcp, TCPCTL_SENDSPACE, sendspace, CTLFLAG_VNET | CTLFLAG_RW,
|
2011-10-16 20:18:39 +00:00
|
|
|
&VNET_NAME(tcp_sendspace), 0, "Initial send socket buffer size");
|
|
|
|
|
2010-04-29 11:52:42 +00:00
|
|
|
VNET_DEFINE(int, tcp_do_autosndbuf) = 1;
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_INT(_net_inet_tcp, OID_AUTO, sendbuf_auto, CTLFLAG_VNET | CTLFLAG_RW,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
&VNET_NAME(tcp_do_autosndbuf), 0,
|
|
|
|
"Enable automatic send buffer sizing");
|
|
|
|
|
2010-04-29 11:52:42 +00:00
|
|
|
VNET_DEFINE(int, tcp_autosndbuf_inc) = 8*1024;
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_INT(_net_inet_tcp, OID_AUTO, sendbuf_inc, CTLFLAG_VNET | CTLFLAG_RW,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
&VNET_NAME(tcp_autosndbuf_inc), 0,
|
Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit
Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.
Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().
Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).
All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).
(*) netipsec/keysock.c did not validate depending on compile time options.
Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
|
|
|
"Incrementor step size of automatic send buffer");
|
2007-02-01 18:32:13 +00:00
|
|
|
|
2011-08-25 09:20:13 +00:00
|
|
|
VNET_DEFINE(int, tcp_autosndbuf_max) = 2*1024*1024;
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_INT(_net_inet_tcp, OID_AUTO, sendbuf_max, CTLFLAG_VNET | CTLFLAG_RW,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
&VNET_NAME(tcp_autosndbuf_max), 0,
|
Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit
Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.
Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().
Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).
All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).
(*) netipsec/keysock.c did not validate depending on compile time options.
Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
|
|
|
"Max size of automatic send buffer");
|
2007-02-01 18:32:13 +00:00
|
|
|
|
2017-07-03 19:39:58 +00:00
|
|
|
VNET_DEFINE(int, tcp_sendbuf_auto_lowat) = 0;
|
|
|
|
#define V_tcp_sendbuf_auto_lowat VNET(tcp_sendbuf_auto_lowat)
|
|
|
|
SYSCTL_INT(_net_inet_tcp, OID_AUTO, sendbuf_auto_lowat, CTLFLAG_VNET | CTLFLAG_RW,
|
|
|
|
&VNET_NAME(tcp_sendbuf_auto_lowat), 0,
|
|
|
|
"Modify threshold for auto send buffer growth to account for SO_SNDLOWAT");
|
|
|
|
|
tcp: Don't prematurely drop receiving-only connections
If the connection was persistent and receiving-only, several (12)
sporadic device insufficient buffers would cause the connection be
dropped prematurely:
Upon ENOBUFS in tcp_output() for an ACK, retransmission timer is
started. No one will stop this retransmission timer for receiving-
only connection, so the retransmission timer promises to expire and
t_rxtshift is promised to be increased. And t_rxtshift will not be
reset to 0, since no RTT measurement will be done for receiving-only
connection. If this receiving-only connection lived long enough
(e.g. >350sec, given the RTO starts from 200ms), and it suffered 12
sporadic device insufficient buffers, i.e. t_rxtshift >= 12, this
receiving-only connection would be dropped prematurely by the
retransmission timer.
We now assert that for data segments, SYNs or FINs either rexmit or
persist timer was wired upon ENOBUFS. And don't set rexmit timer
for other cases, i.e. ENOBUFS upon ACKs.
Discussed with: lstewart, hiren, jtl, Mike Karels
MFC after: 3 weeks
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5872
2016-05-30 03:31:37 +00:00
|
|
|
/*
|
|
|
|
* Make sure that either retransmit or persist timer is set for SYN, FIN and
|
|
|
|
* non-ACK.
|
|
|
|
*/
|
|
|
|
#define TCP_XMIT_TIMER_ASSERT(tp, len, th_flags) \
|
2019-04-03 19:35:07 +00:00
|
|
|
KASSERT(((len) == 0 && ((th_flags) & (TH_SYN | TH_FIN)) == 0) ||\
|
tcp: Don't prematurely drop receiving-only connections
If the connection was persistent and receiving-only, several (12)
sporadic device insufficient buffers would cause the connection be
dropped prematurely:
Upon ENOBUFS in tcp_output() for an ACK, retransmission timer is
started. No one will stop this retransmission timer for receiving-
only connection, so the retransmission timer promises to expire and
t_rxtshift is promised to be increased. And t_rxtshift will not be
reset to 0, since no RTT measurement will be done for receiving-only
connection. If this receiving-only connection lived long enough
(e.g. >350sec, given the RTO starts from 200ms), and it suffered 12
sporadic device insufficient buffers, i.e. t_rxtshift >= 12, this
receiving-only connection would be dropped prematurely by the
retransmission timer.
We now assert that for data segments, SYNs or FINs either rexmit or
persist timer was wired upon ENOBUFS. And don't set rexmit timer
for other cases, i.e. ENOBUFS upon ACKs.
Discussed with: lstewart, hiren, jtl, Mike Karels
MFC after: 3 weeks
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5872
2016-05-30 03:31:37 +00:00
|
|
|
tcp_timer_active((tp), TT_REXMT) || \
|
|
|
|
tcp_timer_active((tp), TT_PERSIST), \
|
|
|
|
("neither rexmt nor persist timer is set"))
|
|
|
|
|
2010-11-12 06:41:55 +00:00
|
|
|
static void inline cc_after_idle(struct tcpcb *tp);
|
|
|
|
|
In the TCP stack, the hhook(9) framework provides hooks for kernel modules
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.
Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.
This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.
Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.
Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8185
2016-10-12 02:16:42 +00:00
|
|
|
#ifdef TCP_HHOOK
|
2010-12-28 12:13:30 +00:00
|
|
|
/*
|
2012-11-07 07:00:59 +00:00
|
|
|
* Wrapper for the TCP established output helper hook.
|
2010-12-28 12:13:30 +00:00
|
|
|
*/
|
2018-06-07 18:18:13 +00:00
|
|
|
void
|
2010-12-28 12:13:30 +00:00
|
|
|
hhook_run_tcp_est_out(struct tcpcb *tp, struct tcphdr *th,
|
2016-10-06 16:28:34 +00:00
|
|
|
struct tcpopt *to, uint32_t len, int tso)
|
2010-12-28 12:13:30 +00:00
|
|
|
{
|
|
|
|
struct tcp_hhook_data hhook_data;
|
|
|
|
|
|
|
|
if (V_tcp_hhh[HHOOK_TCP_EST_OUT]->hhh_nhooks > 0) {
|
|
|
|
hhook_data.tp = tp;
|
|
|
|
hhook_data.th = th;
|
|
|
|
hhook_data.to = to;
|
|
|
|
hhook_data.len = len;
|
|
|
|
hhook_data.tso = tso;
|
|
|
|
|
|
|
|
hhook_run_hooks(V_tcp_hhh[HHOOK_TCP_EST_OUT], &hhook_data,
|
|
|
|
tp->osd);
|
|
|
|
}
|
|
|
|
}
|
In the TCP stack, the hhook(9) framework provides hooks for kernel modules
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.
Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.
This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.
Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.
Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8185
2016-10-12 02:16:42 +00:00
|
|
|
#endif
|
2010-12-28 12:13:30 +00:00
|
|
|
|
2010-11-12 06:41:55 +00:00
|
|
|
/*
|
|
|
|
* CC wrapper hook functions
|
|
|
|
*/
|
|
|
|
static void inline
|
|
|
|
cc_after_idle(struct tcpcb *tp)
|
|
|
|
{
|
|
|
|
INP_WLOCK_ASSERT(tp->t_inpcb);
|
|
|
|
|
|
|
|
if (CC_ALGO(tp)->after_idle != NULL)
|
|
|
|
CC_ALGO(tp)->after_idle(tp->ccv);
|
|
|
|
}
|
2007-02-01 18:32:13 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Tcp output routine: figure out what should be sent and send it.
|
|
|
|
*/
|
|
|
|
int
|
2002-06-23 21:25:36 +00:00
|
|
|
tcp_output(struct tcpcb *tp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2002-06-23 21:25:36 +00:00
|
|
|
struct socket *so = tp->t_inpcb->inp_socket;
|
2016-10-06 16:28:34 +00:00
|
|
|
int32_t len;
|
|
|
|
uint32_t recwin, sendwin;
|
2011-04-30 11:21:29 +00:00
|
|
|
int off, flags, error = 0; /* Keep compiler happy */
|
2018-06-21 21:03:58 +00:00
|
|
|
u_int if_hw_tsomaxsegcount = 0;
|
2019-09-06 18:25:42 +00:00
|
|
|
u_int if_hw_tsomaxsegsize = 0;
|
2002-06-23 21:25:36 +00:00
|
|
|
struct mbuf *m;
|
2000-01-09 19:17:30 +00:00
|
|
|
struct ip *ip = NULL;
|
2017-12-25 04:48:39 +00:00
|
|
|
#ifdef TCPDEBUG
|
2002-06-23 21:25:36 +00:00
|
|
|
struct ipovly *ipov = NULL;
|
2017-12-25 04:48:39 +00:00
|
|
|
#endif
|
2002-06-23 21:25:36 +00:00
|
|
|
struct tcphdr *th;
|
1995-02-09 23:13:27 +00:00
|
|
|
u_char opt[TCP_MAXOLEN];
|
1998-05-24 18:41:04 +00:00
|
|
|
unsigned ipoptlen, optlen, hdrlen;
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT)
|
2007-11-21 22:30:14 +00:00
|
|
|
unsigned ipsec_optlen = 0;
|
|
|
|
#endif
|
2018-05-08 02:22:34 +00:00
|
|
|
int idle, sendalot, curticks;
|
2007-03-15 15:59:28 +00:00
|
|
|
int sack_rxmit, sack_bytes_rxmt;
|
2004-06-23 21:04:37 +00:00
|
|
|
struct sackhole *p;
|
2012-07-16 07:08:34 +00:00
|
|
|
int tso, mtu;
|
2007-03-15 15:59:28 +00:00
|
|
|
struct tcpopt to;
|
2018-02-26 02:53:22 +00:00
|
|
|
unsigned int wanted_cookie = 0;
|
|
|
|
unsigned int dont_sendalot = 0;
|
2001-12-02 08:49:29 +00:00
|
|
|
#if 0
|
2000-05-06 03:31:09 +00:00
|
|
|
int maxburst = TCP_MAXBURST;
|
2001-12-02 08:49:29 +00:00
|
|
|
#endif
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
2002-06-23 21:25:36 +00:00
|
|
|
struct ip6_hdr *ip6 = NULL;
|
2000-01-09 19:17:30 +00:00
|
|
|
int isipv6;
|
|
|
|
|
|
|
|
isipv6 = (tp->t_inpcb->inp_vflag & INP_IPV6) != 0;
|
|
|
|
#endif
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#ifdef KERN_TLS
|
|
|
|
const bool hw_tls = (so->so_snd.sb_flags & SB_TLS_IFNET) != 0;
|
|
|
|
#else
|
|
|
|
const bool hw_tls = false;
|
|
|
|
#endif
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2020-01-22 05:53:16 +00:00
|
|
|
NET_EPOCH_ASSERT();
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(tp->t_inpcb);
|
2002-08-12 03:22:46 +00:00
|
|
|
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
if (tp->t_flags & TF_TOE)
|
|
|
|
return (tcp_offload_output(tp));
|
|
|
|
#endif
|
|
|
|
|
2015-12-24 19:09:48 +00:00
|
|
|
/*
|
Fix some TCP fast open issues.
The following issues are fixed:
* Whenever a TCP server with TCP fast open enabled, calls accept(),
recv(), send(), and close() before the TCP-ACK segment has been received,
the TCP connection is just dropped and the reception of the TCP-ACK
segment triggers the sending of a TCP-RST segment.
* Whenever a TCP server with TCP fast open enabled, calls accept(), recv(),
send(), send(), and close() before the TCP-ACK segment has been received,
the first byte provided in the second send call is not transferred.
* Whenever a TCP client with TCP fast open enabled calls sendto() followed
by close() the TCP connection is just dropped.
Reviewed by: jtl@, kbowling@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16485
2018-07-30 20:35:50 +00:00
|
|
|
* For TFO connections in SYN_SENT or SYN_RECEIVED,
|
|
|
|
* only allow the initial SYN or SYN|ACK and those sent
|
|
|
|
* by the retransmit timer.
|
2015-12-24 19:09:48 +00:00
|
|
|
*/
|
2016-10-12 19:06:50 +00:00
|
|
|
if (IS_FASTOPEN(tp->t_flags) &&
|
Fix some TCP fast open issues.
The following issues are fixed:
* Whenever a TCP server with TCP fast open enabled, calls accept(),
recv(), send(), and close() before the TCP-ACK segment has been received,
the TCP connection is just dropped and the reception of the TCP-ACK
segment triggers the sending of a TCP-RST segment.
* Whenever a TCP server with TCP fast open enabled, calls accept(), recv(),
send(), send(), and close() before the TCP-ACK segment has been received,
the first byte provided in the second send call is not transferred.
* Whenever a TCP client with TCP fast open enabled calls sendto() followed
by close() the TCP connection is just dropped.
Reviewed by: jtl@, kbowling@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16485
2018-07-30 20:35:50 +00:00
|
|
|
((tp->t_state == TCPS_SYN_SENT) ||
|
|
|
|
(tp->t_state == TCPS_SYN_RECEIVED)) &&
|
|
|
|
SEQ_GT(tp->snd_max, tp->snd_una) && /* initial SYN or SYN|ACK sent */
|
|
|
|
(tp->snd_nxt != tp->snd_una)) /* not a retransmit */
|
2015-12-24 19:09:48 +00:00
|
|
|
return (0);
|
2018-02-26 03:03:41 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Determine length of data that should be transmitted,
|
|
|
|
* and flags that will be used.
|
|
|
|
* If there is some data or critical controls (SYN, RST)
|
|
|
|
* to send, then transmit; otherwise, investigate further.
|
|
|
|
*/
|
2001-10-05 21:33:38 +00:00
|
|
|
idle = (tp->t_flags & TF_LASTIDLE) || (tp->snd_max == tp->snd_una);
|
2020-06-24 13:42:42 +00:00
|
|
|
if (idle && (((ticks - tp->t_rcvtime) >= tp->t_rxtcur) ||
|
|
|
|
(tp->t_sndtime && ((ticks - tp->t_sndtime) >= tp->t_rxtcur))))
|
2010-12-02 01:36:00 +00:00
|
|
|
cc_after_idle(tp);
|
2001-10-05 21:33:38 +00:00
|
|
|
tp->t_flags &= ~TF_LASTIDLE;
|
|
|
|
if (idle) {
|
|
|
|
if (tp->t_flags & TF_MORETOCOME) {
|
|
|
|
tp->t_flags |= TF_LASTIDLE;
|
|
|
|
idle = 0;
|
|
|
|
}
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
again:
|
2004-06-23 21:04:37 +00:00
|
|
|
/*
|
|
|
|
* If we've recently taken a timeout, snd_max will be greater than
|
|
|
|
* snd_nxt. There may be SACK information that allows us to avoid
|
|
|
|
* resending already delivered data. Adjust snd_nxt accordingly.
|
|
|
|
*/
|
2007-05-06 15:56:31 +00:00
|
|
|
if ((tp->t_flags & TF_SACK_PERMIT) &&
|
|
|
|
SEQ_LT(tp->snd_nxt, tp->snd_max))
|
2004-06-23 21:04:37 +00:00
|
|
|
tcp_sack_adjust(tp);
|
1994-05-24 10:09:53 +00:00
|
|
|
sendalot = 0;
|
2010-08-14 21:41:33 +00:00
|
|
|
tso = 0;
|
2012-07-16 07:08:34 +00:00
|
|
|
mtu = 0;
|
1994-05-24 10:09:53 +00:00
|
|
|
off = tp->snd_nxt - tp->snd_una;
|
2004-01-22 23:22:14 +00:00
|
|
|
sendwin = min(tp->snd_wnd, tp->snd_cwnd);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
flags = tcp_outflags[tp->t_state];
|
2004-06-23 21:04:37 +00:00
|
|
|
/*
|
|
|
|
* Send any SACK-generated retransmissions. If we're explicitly trying
|
|
|
|
* to send out new data (when sendalot is 1), bypass this function.
|
|
|
|
* If we retransmit in fast recovery mode, decrement snd_cwnd, since
|
|
|
|
* we're replacing a (future) new transmission with a retransmission
|
|
|
|
* now, and we previously incremented snd_cwnd in tcp_input().
|
|
|
|
*/
|
2004-08-16 18:32:07 +00:00
|
|
|
/*
|
2004-06-23 21:04:37 +00:00
|
|
|
* Still in sack recovery , reset rxmit flag to zero.
|
|
|
|
*/
|
|
|
|
sack_rxmit = 0;
|
2004-10-05 18:36:24 +00:00
|
|
|
sack_bytes_rxmt = 0;
|
2004-06-23 21:04:37 +00:00
|
|
|
len = 0;
|
|
|
|
p = NULL;
|
2010-11-12 06:41:55 +00:00
|
|
|
if ((tp->t_flags & TF_SACK_PERMIT) && IN_FASTRECOVERY(tp->t_flags) &&
|
2004-10-05 18:36:24 +00:00
|
|
|
(p = tcp_sack_output(tp, &sack_bytes_rxmt))) {
|
2016-10-06 16:28:34 +00:00
|
|
|
uint32_t cwin;
|
2020-02-12 13:31:36 +00:00
|
|
|
|
2016-10-06 16:28:34 +00:00
|
|
|
cwin =
|
|
|
|
imax(min(tp->snd_wnd, tp->snd_cwnd) - sack_bytes_rxmt, 0);
|
2004-07-19 22:06:01 +00:00
|
|
|
/* Do not retransmit SACK segments beyond snd_recover */
|
|
|
|
if (SEQ_GT(p->end, tp->snd_recover)) {
|
|
|
|
/*
|
2004-08-16 18:32:07 +00:00
|
|
|
* (At least) part of sack hole extends beyond
|
|
|
|
* snd_recover. Check to see if we can rexmit data
|
2004-07-19 22:06:01 +00:00
|
|
|
* for this hole.
|
|
|
|
*/
|
|
|
|
if (SEQ_GEQ(p->rxmit, tp->snd_recover)) {
|
2004-08-16 18:32:07 +00:00
|
|
|
/*
|
2004-07-19 22:06:01 +00:00
|
|
|
* Can't rexmit any more data for this hole.
|
2004-08-16 18:32:07 +00:00
|
|
|
* That data will be rexmitted in the next
|
|
|
|
* sack recovery episode, when snd_recover
|
2004-07-19 22:06:01 +00:00
|
|
|
* moves past p->rxmit.
|
|
|
|
*/
|
|
|
|
p = NULL;
|
|
|
|
goto after_sack_rexmit;
|
|
|
|
} else
|
|
|
|
/* Can rexmit part of the current hole */
|
2016-10-06 16:28:34 +00:00
|
|
|
len = ((int32_t)ulmin(cwin,
|
2004-10-05 18:36:24 +00:00
|
|
|
tp->snd_recover - p->rxmit));
|
2004-07-19 22:06:01 +00:00
|
|
|
} else
|
2016-10-06 16:28:34 +00:00
|
|
|
len = ((int32_t)ulmin(cwin, p->end - p->rxmit));
|
2004-06-23 21:04:37 +00:00
|
|
|
off = p->rxmit - tp->snd_una;
|
2004-08-16 18:32:07 +00:00
|
|
|
KASSERT(off >= 0,("%s: sack block to the left of una : %d",
|
2004-07-19 22:06:01 +00:00
|
|
|
__func__, off));
|
2004-06-23 21:04:37 +00:00
|
|
|
if (len > 0) {
|
2004-10-30 12:02:50 +00:00
|
|
|
sack_rxmit = 1;
|
|
|
|
sendalot = 1;
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sack_rexmits);
|
|
|
|
TCPSTAT_ADD(tcps_sack_rexmit_bytes,
|
2020-10-09 12:44:56 +00:00
|
|
|
min(len, tcp_maxseg(tp)));
|
2004-06-23 21:04:37 +00:00
|
|
|
}
|
|
|
|
}
|
2004-07-19 22:06:01 +00:00
|
|
|
after_sack_rexmit:
|
1995-02-09 23:13:27 +00:00
|
|
|
/*
|
|
|
|
* Get standard flags, and add SYN or FIN if requested by 'hidden'
|
|
|
|
* state flags.
|
|
|
|
*/
|
|
|
|
if (tp->t_flags & TF_NEEDFIN)
|
|
|
|
flags |= TH_FIN;
|
|
|
|
if (tp->t_flags & TF_NEEDSYN)
|
|
|
|
flags |= TH_SYN;
|
|
|
|
|
2004-10-09 16:48:51 +00:00
|
|
|
SOCKBUF_LOCK(&so->so_snd);
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* If in persist timeout with window of 0, send 1 byte.
|
|
|
|
* Otherwise, if window is small but nonzero
|
|
|
|
* and timer expired, we will send what we can
|
|
|
|
* and go to transmit state.
|
|
|
|
*/
|
2005-05-21 00:38:29 +00:00
|
|
|
if (tp->t_flags & TF_FORCEDATA) {
|
2004-01-22 23:22:14 +00:00
|
|
|
if (sendwin == 0) {
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* If we still have some data to send, then
|
|
|
|
* clear the FIN bit. Usually this would
|
|
|
|
* happen below when it realizes that we
|
|
|
|
* aren't sending all the data. However,
|
1999-04-07 22:22:06 +00:00
|
|
|
* if we have exactly 1 byte of unsent data,
|
1994-05-24 10:09:53 +00:00
|
|
|
* then it won't clear the FIN bit below,
|
|
|
|
* and if we are in persist state, we wind
|
|
|
|
* up sending the packet without recording
|
|
|
|
* that we sent the FIN bit.
|
|
|
|
*
|
|
|
|
* We can't just blindly clear the FIN bit,
|
|
|
|
* because if we don't have any more data
|
|
|
|
* to send then the probe will be the FIN
|
|
|
|
* itself.
|
|
|
|
*/
|
2014-11-12 09:57:15 +00:00
|
|
|
if (off < sbused(&so->so_snd))
|
1994-05-24 10:09:53 +00:00
|
|
|
flags &= ~TH_FIN;
|
2004-01-22 23:22:14 +00:00
|
|
|
sendwin = 1;
|
1994-05-24 10:09:53 +00:00
|
|
|
} else {
|
2007-04-11 09:45:16 +00:00
|
|
|
tcp_timer_activate(tp, TT_PERSIST, 0);
|
1994-05-24 10:09:53 +00:00
|
|
|
tp->t_rxtshift = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2002-10-10 19:21:50 +00:00
|
|
|
/*
|
2004-08-16 18:32:07 +00:00
|
|
|
* If snd_nxt == snd_max and we have transmitted a FIN, the
|
2002-10-10 19:21:50 +00:00
|
|
|
* offset will be > 0 even if so_snd.sb_cc is 0, resulting in
|
2004-01-22 23:22:14 +00:00
|
|
|
* a negative length. This can also occur when TCP opens up
|
2002-10-10 19:21:50 +00:00
|
|
|
* its congestion window while receiving additional duplicate
|
|
|
|
* acks after fast-retransmit because TCP will reset snd_nxt
|
|
|
|
* to snd_max after the fast-retransmit.
|
|
|
|
*
|
|
|
|
* In the normal retransmit-FIN-only case, however, snd_nxt will
|
|
|
|
* be set to snd_una, the offset will be 0, and the length may
|
|
|
|
* wind up 0.
|
2004-08-16 18:32:07 +00:00
|
|
|
*
|
2004-06-23 21:04:37 +00:00
|
|
|
* If sack_rxmit is true we are retransmitting from the scoreboard
|
2004-08-16 18:32:07 +00:00
|
|
|
* in which case len is already set.
|
2002-10-10 19:21:50 +00:00
|
|
|
*/
|
2004-10-05 18:36:24 +00:00
|
|
|
if (sack_rxmit == 0) {
|
|
|
|
if (sack_bytes_rxmt == 0)
|
2017-07-05 16:10:30 +00:00
|
|
|
len = ((int32_t)min(sbavail(&so->so_snd), sendwin) -
|
2014-11-12 09:57:15 +00:00
|
|
|
off);
|
2004-10-05 18:36:24 +00:00
|
|
|
else {
|
2016-10-06 16:28:34 +00:00
|
|
|
int32_t cwin;
|
2004-10-05 18:36:24 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We are inside of a SACK recovery episode and are
|
|
|
|
* sending new data, having retransmitted all the
|
|
|
|
* data possible in the scoreboard.
|
|
|
|
*/
|
2016-10-06 16:28:34 +00:00
|
|
|
len = ((int32_t)min(sbavail(&so->so_snd), tp->snd_wnd) -
|
2014-11-12 09:57:15 +00:00
|
|
|
off);
|
2005-01-12 21:40:51 +00:00
|
|
|
/*
|
|
|
|
* Don't remove this (len > 0) check !
|
2020-02-12 13:31:36 +00:00
|
|
|
* We explicitly check for len > 0 here (although it
|
|
|
|
* isn't really necessary), to work around a gcc
|
2005-01-12 21:40:51 +00:00
|
|
|
* optimization issue - to force gcc to compute
|
|
|
|
* len above. Without this check, the computation
|
|
|
|
* of len is bungled by the optimizer.
|
|
|
|
*/
|
|
|
|
if (len > 0) {
|
2020-02-12 13:31:36 +00:00
|
|
|
cwin = tp->snd_cwnd -
|
2020-02-13 15:14:46 +00:00
|
|
|
(tp->snd_nxt - tp->snd_recover) -
|
2005-01-12 21:40:51 +00:00
|
|
|
sack_bytes_rxmt;
|
|
|
|
if (cwin < 0)
|
|
|
|
cwin = 0;
|
2016-10-06 16:28:34 +00:00
|
|
|
len = imin(len, cwin);
|
2005-01-12 21:40:51 +00:00
|
|
|
}
|
2004-10-05 18:36:24 +00:00
|
|
|
}
|
|
|
|
}
|
1995-02-09 23:13:27 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Lop off SYN bit if it has already been sent. However, if this
|
|
|
|
* is SYN-SENT state and if segment contains data and if we don't
|
|
|
|
* know that foreign host supports TAO, suppress sending segment.
|
|
|
|
*/
|
|
|
|
if ((flags & TH_SYN) && SEQ_GT(tp->snd_nxt, tp->snd_una)) {
|
2006-02-23 21:14:34 +00:00
|
|
|
if (tp->t_state != TCPS_SYN_RECEIVED)
|
|
|
|
flags &= ~TH_SYN;
|
2015-12-24 19:09:48 +00:00
|
|
|
/*
|
|
|
|
* When sending additional segments following a TFO SYN|ACK,
|
|
|
|
* do not include the SYN bit.
|
|
|
|
*/
|
2016-10-12 19:06:50 +00:00
|
|
|
if (IS_FASTOPEN(tp->t_flags) &&
|
2015-12-24 19:09:48 +00:00
|
|
|
(tp->t_state == TCPS_SYN_RECEIVED))
|
|
|
|
flags &= ~TH_SYN;
|
1995-02-09 23:13:27 +00:00
|
|
|
off--, len++;
|
|
|
|
}
|
|
|
|
|
1996-01-17 09:35:23 +00:00
|
|
|
/*
|
2004-11-02 22:22:22 +00:00
|
|
|
* Be careful not to send data and/or FIN on SYN segments.
|
1996-01-17 09:35:23 +00:00
|
|
|
* This measure is needed to prevent interoperability problems
|
|
|
|
* with not fully conformant TCP implementations.
|
|
|
|
*/
|
2004-11-02 22:22:22 +00:00
|
|
|
if ((flags & TH_SYN) && (tp->t_flags & TF_NOOPT)) {
|
1996-01-17 09:35:23 +00:00
|
|
|
len = 0;
|
|
|
|
flags &= ~TH_FIN;
|
|
|
|
}
|
|
|
|
|
2015-12-24 19:09:48 +00:00
|
|
|
/*
|
2018-02-26 02:53:22 +00:00
|
|
|
* On TFO sockets, ensure no data is sent in the following cases:
|
|
|
|
*
|
|
|
|
* - When retransmitting SYN|ACK on a passively-created socket
|
|
|
|
*
|
|
|
|
* - When retransmitting SYN on an actively created socket
|
|
|
|
*
|
|
|
|
* - When sending a zero-length cookie (cookie request) on an
|
|
|
|
* actively created socket
|
|
|
|
*
|
|
|
|
* - When the socket is in the CLOSED state (RST is being sent)
|
2015-12-24 19:09:48 +00:00
|
|
|
*/
|
2016-10-12 19:06:50 +00:00
|
|
|
if (IS_FASTOPEN(tp->t_flags) &&
|
2018-02-26 02:53:22 +00:00
|
|
|
(((flags & TH_SYN) && (tp->t_rxtshift > 0)) ||
|
|
|
|
((tp->t_state == TCPS_SYN_SENT) &&
|
|
|
|
(tp->t_tfo_client_cookie_len == 0)) ||
|
2015-12-24 19:09:48 +00:00
|
|
|
(flags & TH_RST)))
|
|
|
|
len = 0;
|
2015-07-21 23:42:15 +00:00
|
|
|
if (len <= 0) {
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* If FIN has been sent but not acked,
|
|
|
|
* but we haven't been called to retransmit,
|
2002-10-10 19:21:50 +00:00
|
|
|
* len will be < 0. Otherwise, window shrank
|
1994-05-24 10:09:53 +00:00
|
|
|
* after we sent into it. If window shrank to 0,
|
1996-04-15 03:46:33 +00:00
|
|
|
* cancel pending retransmit, pull snd_nxt back
|
|
|
|
* to (closed) window, and set the persist timer
|
|
|
|
* if it isn't already going. If the window didn't
|
|
|
|
* close completely, just wait for an ACK.
|
2015-07-21 23:42:15 +00:00
|
|
|
*
|
|
|
|
* We also do a general check here to ensure that
|
|
|
|
* we will set the persist timer when we have data
|
|
|
|
* to send, but a 0-byte window. This makes sure
|
|
|
|
* the persist timer is set even if the packet
|
|
|
|
* hits one of the "goto send" lines below.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
len = 0;
|
2015-07-21 23:42:15 +00:00
|
|
|
if ((sendwin == 0) && (TCPS_HAVEESTABLISHED(tp->t_state)) &&
|
|
|
|
(off < (int) sbavail(&so->so_snd))) {
|
2007-04-11 09:45:16 +00:00
|
|
|
tcp_timer_activate(tp, TT_REXMT, 0);
|
1996-04-15 03:46:33 +00:00
|
|
|
tp->t_rxtshift = 0;
|
1994-05-24 10:09:53 +00:00
|
|
|
tp->snd_nxt = tp->snd_una;
|
2007-04-11 09:45:16 +00:00
|
|
|
if (!tcp_timer_active(tp, TT_PERSIST))
|
1996-04-15 03:46:33 +00:00
|
|
|
tcp_setpersist(tp);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
}
|
2002-10-10 19:21:50 +00:00
|
|
|
|
2007-02-01 18:32:13 +00:00
|
|
|
/* len will be >= 0 after this point. */
|
2008-09-07 11:38:30 +00:00
|
|
|
KASSERT(len >= 0, ("[%s:%d]: len < 0", __func__, __LINE__));
|
2007-02-01 18:32:13 +00:00
|
|
|
|
2017-12-07 22:36:58 +00:00
|
|
|
tcp_sndbuf_autoscale(tp, so, sendwin);
|
2007-02-01 18:32:13 +00:00
|
|
|
|
|
|
|
/*
|
2010-09-17 22:05:27 +00:00
|
|
|
* Decide if we can use TCP Segmentation Offloading (if supported by
|
|
|
|
* hardware).
|
2006-09-07 12:53:01 +00:00
|
|
|
*
|
|
|
|
* TSO may only be used if we are in a pure bulk sending state. The
|
|
|
|
* presence of TCP-MD5, SACK retransmits, SACK advertizements and
|
|
|
|
* IP options prevent using TSO. With TSO the TCP header is the same
|
|
|
|
* (except for the sequence number) for all generated packets. This
|
|
|
|
* makes it impossible to transmit any options which vary per generated
|
|
|
|
* segment or packet.
|
2016-12-11 23:14:47 +00:00
|
|
|
*
|
|
|
|
* IPv4 handling has a clear separation of ip options and ip header
|
|
|
|
* flags while IPv6 combines both in in6p_outputopts. ip6_optlen() does
|
|
|
|
* the right thing below to provide length of just ip options and thus
|
|
|
|
* checking for ipoptlen is enough to decide if ip options are present.
|
2002-10-10 19:21:50 +00:00
|
|
|
*/
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT)
|
2007-11-21 22:30:14 +00:00
|
|
|
/*
|
|
|
|
* Pre-calculate here as we save another lookup into the darknesses
|
|
|
|
* of IPsec that way and can actually decide if TSO is ok.
|
|
|
|
*/
|
2017-02-06 08:49:57 +00:00
|
|
|
#ifdef INET6
|
|
|
|
if (isipv6 && IPSEC_ENABLED(ipv6))
|
|
|
|
ipsec_optlen = IPSEC_HDRSIZE(ipv6, tp->t_inpcb);
|
|
|
|
#ifdef INET
|
|
|
|
else
|
2007-11-21 22:30:14 +00:00
|
|
|
#endif
|
2017-02-06 08:49:57 +00:00
|
|
|
#endif /* INET6 */
|
|
|
|
#ifdef INET
|
|
|
|
if (IPSEC_ENABLED(ipv4))
|
|
|
|
ipsec_optlen = IPSEC_HDRSIZE(ipv4, tp->t_inpcb);
|
|
|
|
#endif /* INET */
|
|
|
|
#endif /* IPSEC */
|
2016-12-11 23:14:47 +00:00
|
|
|
#ifdef INET6
|
|
|
|
if (isipv6)
|
|
|
|
ipoptlen = ip6_optlen(tp->t_inpcb);
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
if (tp->t_inpcb->inp_options)
|
|
|
|
ipoptlen = tp->t_inpcb->inp_options->m_len -
|
|
|
|
offsetof(struct ipoption, ipopt_list);
|
|
|
|
else
|
|
|
|
ipoptlen = 0;
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT)
|
2016-12-11 23:14:47 +00:00
|
|
|
ipoptlen += ipsec_optlen;
|
|
|
|
#endif
|
|
|
|
|
2010-09-17 22:05:27 +00:00
|
|
|
if ((tp->t_flags & TF_TSO) && V_tcp_do_tso && len > tp->t_maxseg &&
|
|
|
|
((tp->t_flags & TF_SIGNATURE) == 0) &&
|
|
|
|
tp->rcv_numsacks == 0 && sack_rxmit == 0 &&
|
2018-02-26 02:53:22 +00:00
|
|
|
ipoptlen == 0 && !(flags & TH_SYN))
|
2010-09-17 22:05:27 +00:00
|
|
|
tso = 1;
|
2010-08-14 21:41:33 +00:00
|
|
|
|
2004-07-28 02:15:14 +00:00
|
|
|
if (sack_rxmit) {
|
2014-11-12 09:57:15 +00:00
|
|
|
if (SEQ_LT(p->rxmit + len, tp->snd_una + sbused(&so->so_snd)))
|
2004-07-28 02:15:14 +00:00
|
|
|
flags &= ~TH_FIN;
|
2004-08-16 18:32:07 +00:00
|
|
|
} else {
|
2014-11-12 09:57:15 +00:00
|
|
|
if (SEQ_LT(tp->snd_nxt + len, tp->snd_una +
|
|
|
|
sbused(&so->so_snd)))
|
2004-07-28 02:15:14 +00:00
|
|
|
flags &= ~TH_FIN;
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2016-10-06 16:28:34 +00:00
|
|
|
recwin = lmin(lmax(sbspace(&so->so_rcv), 0),
|
|
|
|
(long)TCP_MAXWIN << tp->rcv_scale);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
2001-12-02 08:49:29 +00:00
|
|
|
* Sender silly window avoidance. We transmit under the following
|
|
|
|
* conditions when len is non-zero:
|
|
|
|
*
|
2006-09-07 12:53:01 +00:00
|
|
|
* - We have a full segment (or more with TSO)
|
2001-12-13 04:02:31 +00:00
|
|
|
* - This is the last buffer in a write()/send() and we are
|
|
|
|
* either idle or running NODELAY
|
|
|
|
* - we've timed out (e.g. persist timer)
|
|
|
|
* - we have more then 1/2 the maximum send window's worth of
|
|
|
|
* data (receiver may be limited the window size)
|
|
|
|
* - we need to retransmit
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
if (len) {
|
2006-09-07 12:53:01 +00:00
|
|
|
if (len >= tp->t_maxseg)
|
1994-05-24 10:09:53 +00:00
|
|
|
goto send;
|
2020-09-25 10:38:19 +00:00
|
|
|
/*
|
|
|
|
* As the TCP header options are now
|
|
|
|
* considered when setting up the initial
|
|
|
|
* window, we would not send the last segment
|
|
|
|
* if we skip considering the option length here.
|
|
|
|
* Note: this may not work when tcp headers change
|
|
|
|
* very dynamically in the future.
|
|
|
|
*/
|
|
|
|
if ((((tp->t_flags & TF_SIGNATURE) ?
|
|
|
|
PADTCPOLEN(TCPOLEN_SIGNATURE) : 0) +
|
|
|
|
((tp->t_flags & TF_RCVD_TSTMP) ?
|
|
|
|
PADTCPOLEN(TCPOLEN_TIMESTAMP) : 0) +
|
|
|
|
len) >= tp->t_maxseg)
|
|
|
|
goto send;
|
2001-12-02 08:49:29 +00:00
|
|
|
/*
|
|
|
|
* NOTE! on localhost connections an 'ack' from the remote
|
|
|
|
* end may occur synchronously with the output and cause
|
|
|
|
* us to flush a buffer queued with moretocome. XXX
|
|
|
|
*
|
|
|
|
* note: the len + off check is almost certainly unnecessary.
|
|
|
|
*/
|
2001-12-13 04:02:31 +00:00
|
|
|
if (!(tp->t_flags & TF_MORETOCOME) && /* normal case */
|
2001-12-02 08:49:29 +00:00
|
|
|
(idle || (tp->t_flags & TF_NODELAY)) &&
|
2016-10-06 16:28:34 +00:00
|
|
|
(uint32_t)len + (uint32_t)off >= sbavail(&so->so_snd) &&
|
2001-12-02 08:49:29 +00:00
|
|
|
(tp->t_flags & TF_NOPUSH) == 0) {
|
1994-05-24 10:09:53 +00:00
|
|
|
goto send;
|
2001-12-02 08:49:29 +00:00
|
|
|
}
|
2005-05-21 00:38:29 +00:00
|
|
|
if (tp->t_flags & TF_FORCEDATA) /* typ. timeout case */
|
1994-05-24 10:09:53 +00:00
|
|
|
goto send;
|
1995-02-09 23:13:27 +00:00
|
|
|
if (len >= tp->max_sndwnd / 2 && tp->max_sndwnd > 0)
|
1994-05-24 10:09:53 +00:00
|
|
|
goto send;
|
2001-12-02 08:49:29 +00:00
|
|
|
if (SEQ_LT(tp->snd_nxt, tp->snd_max)) /* retransmit case */
|
1994-05-24 10:09:53 +00:00
|
|
|
goto send;
|
2004-06-23 21:04:37 +00:00
|
|
|
if (sack_rxmit)
|
|
|
|
goto send;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2012-10-28 17:40:35 +00:00
|
|
|
* Sending of standalone window updates.
|
|
|
|
*
|
2012-10-29 13:16:33 +00:00
|
|
|
* Window updates are important when we close our window due to a
|
|
|
|
* full socket buffer and are opening it again after the application
|
2012-10-28 17:40:35 +00:00
|
|
|
* reads data from it. Once the window has opened again and the
|
|
|
|
* remote end starts to send again the ACK clock takes over and
|
|
|
|
* provides the most current window information.
|
|
|
|
*
|
2012-10-29 13:16:33 +00:00
|
|
|
* We must avoid the silly window syndrome whereas every read
|
2012-10-28 17:40:35 +00:00
|
|
|
* from the receive buffer, no matter how small, causes a window
|
|
|
|
* update to be sent. We also should avoid sending a flurry of
|
|
|
|
* window updates when the socket buffer had queued a lot of data
|
|
|
|
* and the application is doing small reads.
|
|
|
|
*
|
|
|
|
* Prevent a flurry of pointless window updates by only sending
|
|
|
|
* an update when we can increase the advertized window by more
|
|
|
|
* than 1/4th of the socket buffer capacity. When the buffer is
|
|
|
|
* getting full or is very small be more aggressive and send an
|
|
|
|
* update whenever we can increase by two mss sized segments.
|
|
|
|
* In all other situations the ACK's to new incoming data will
|
|
|
|
* carry further window increases.
|
2012-10-28 17:30:28 +00:00
|
|
|
*
|
|
|
|
* Don't send an independent window update if a delayed
|
|
|
|
* ACK is pending (it will get piggy-backed on it) or the
|
|
|
|
* remote side already has done a half-close and won't send
|
2012-10-28 17:40:35 +00:00
|
|
|
* more data. Skip this if the connection is in T/TCP
|
|
|
|
* half-open state.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2007-06-09 19:39:14 +00:00
|
|
|
if (recwin > 0 && !(tp->t_flags & TF_NEEDSYN) &&
|
2012-10-28 17:30:28 +00:00
|
|
|
!(tp->t_flags & TF_DELACK) &&
|
2007-06-09 19:39:14 +00:00
|
|
|
!TCPS_HAVERCVDFIN(tp->t_state)) {
|
1995-05-30 08:16:23 +00:00
|
|
|
/*
|
2012-10-28 17:40:35 +00:00
|
|
|
* "adv" is the amount we could increase the window,
|
1994-05-24 10:09:53 +00:00
|
|
|
* taking into account that we are limited by
|
|
|
|
* TCP_MAXWIN << tp->rcv_scale.
|
|
|
|
*/
|
2016-10-06 16:28:34 +00:00
|
|
|
int32_t adv;
|
Handle a rare edge case with nearly full TCP receive buffers. If a TCP
buffer fills up causing the remote sender to enter into persist mode, but
there is still room available in the receive buffer when a window probe
arrives (either due to window scaling, or due to the local application
very slowing draining data from the receive buffer), then the single byte
of data in the window probe is accepted. However, this can cause rcv_nxt
to be greater than rcv_adv. This condition will only last until the next
ACK packet is pushed out via tcp_output(), and since the previous ACK
advertised a zero window, the ACK should be pushed out while the TCP
pcb is write-locked.
During the window while rcv_nxt is greather than rcv_adv, a few places
would compute the remaining receive window via rcv_adv - rcv_nxt.
However, this value was then (uint32_t)-1. On a 64 bit machine this
could expand to a positive 2^32 - 1 when cast to a long. In particular,
when calculating the receive window in tcp_output(), the result would be
that the receive window was computed as 2^32 - 1 resulting in advertising
a far larger window to the remote peer than actually existed.
Fix various places that compute the remaining receive window to either
assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the
window as full if rcv_nxt is greather than rcv_adv.
Reviewed by: bz
MFC after: 1 month
2011-05-02 21:05:52 +00:00
|
|
|
int oldwin;
|
|
|
|
|
2016-10-06 16:28:34 +00:00
|
|
|
adv = recwin;
|
Handle a rare edge case with nearly full TCP receive buffers. If a TCP
buffer fills up causing the remote sender to enter into persist mode, but
there is still room available in the receive buffer when a window probe
arrives (either due to window scaling, or due to the local application
very slowing draining data from the receive buffer), then the single byte
of data in the window probe is accepted. However, this can cause rcv_nxt
to be greater than rcv_adv. This condition will only last until the next
ACK packet is pushed out via tcp_output(), and since the previous ACK
advertised a zero window, the ACK should be pushed out while the TCP
pcb is write-locked.
During the window while rcv_nxt is greather than rcv_adv, a few places
would compute the remaining receive window via rcv_adv - rcv_nxt.
However, this value was then (uint32_t)-1. On a 64 bit machine this
could expand to a positive 2^32 - 1 when cast to a long. In particular,
when calculating the receive window in tcp_output(), the result would be
that the receive window was computed as 2^32 - 1 resulting in advertising
a far larger window to the remote peer than actually existed.
Fix various places that compute the remaining receive window to either
assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the
window as full if rcv_nxt is greather than rcv_adv.
Reviewed by: bz
MFC after: 1 month
2011-05-02 21:05:52 +00:00
|
|
|
if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt)) {
|
|
|
|
oldwin = (tp->rcv_adv - tp->rcv_nxt);
|
2020-06-12 19:56:19 +00:00
|
|
|
if (adv > oldwin)
|
|
|
|
adv -= oldwin;
|
|
|
|
else
|
|
|
|
adv = 0;
|
Handle a rare edge case with nearly full TCP receive buffers. If a TCP
buffer fills up causing the remote sender to enter into persist mode, but
there is still room available in the receive buffer when a window probe
arrives (either due to window scaling, or due to the local application
very slowing draining data from the receive buffer), then the single byte
of data in the window probe is accepted. However, this can cause rcv_nxt
to be greater than rcv_adv. This condition will only last until the next
ACK packet is pushed out via tcp_output(), and since the previous ACK
advertised a zero window, the ACK should be pushed out while the TCP
pcb is write-locked.
During the window while rcv_nxt is greather than rcv_adv, a few places
would compute the remaining receive window via rcv_adv - rcv_nxt.
However, this value was then (uint32_t)-1. On a 64 bit machine this
could expand to a positive 2^32 - 1 when cast to a long. In particular,
when calculating the receive window in tcp_output(), the result would be
that the receive window was computed as 2^32 - 1 resulting in advertising
a far larger window to the remote peer than actually existed.
Fix various places that compute the remaining receive window to either
assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the
window as full if rcv_nxt is greather than rcv_adv.
Reviewed by: bz
MFC after: 1 month
2011-05-02 21:05:52 +00:00
|
|
|
} else
|
|
|
|
oldwin = 0;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2020-02-12 13:31:36 +00:00
|
|
|
/*
|
2016-10-06 16:09:45 +00:00
|
|
|
* If the new window size ends up being the same as or less
|
|
|
|
* than the old size when it is scaled, then don't force
|
|
|
|
* a window update.
|
2011-04-18 17:43:16 +00:00
|
|
|
*/
|
2016-10-06 16:09:45 +00:00
|
|
|
if (oldwin >> tp->rcv_scale >= (adv + oldwin) >> tp->rcv_scale)
|
2011-04-18 17:43:16 +00:00
|
|
|
goto dontupdate;
|
2012-10-28 17:40:35 +00:00
|
|
|
|
2016-10-06 16:28:34 +00:00
|
|
|
if (adv >= (int32_t)(2 * tp->t_maxseg) &&
|
|
|
|
(adv >= (int32_t)(so->so_rcv.sb_hiwat / 4) ||
|
|
|
|
recwin <= (so->so_rcv.sb_hiwat / 8) ||
|
2019-01-15 17:40:19 +00:00
|
|
|
so->so_rcv.sb_hiwat <= 8 * tp->t_maxseg ||
|
|
|
|
adv >= TCP_MAXWIN << tp->rcv_scale))
|
1994-05-24 10:09:53 +00:00
|
|
|
goto send;
|
2017-02-23 18:14:36 +00:00
|
|
|
if (2 * adv >= (int32_t)so->so_rcv.sb_hiwat)
|
|
|
|
goto send;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2011-04-18 17:43:16 +00:00
|
|
|
dontupdate:
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
2002-10-10 19:21:50 +00:00
|
|
|
* Send if we owe the peer an ACK, RST, SYN, or urgent data. ACKNOW
|
|
|
|
* is also a catch-all for the retransmit timer timeout case.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
if (tp->t_flags & TF_ACKNOW)
|
|
|
|
goto send;
|
1995-02-09 23:13:27 +00:00
|
|
|
if ((flags & TH_RST) ||
|
|
|
|
((flags & TH_SYN) && (tp->t_flags & TF_NEEDSYN) == 0))
|
1994-05-24 10:09:53 +00:00
|
|
|
goto send;
|
|
|
|
if (SEQ_GT(tp->snd_up, tp->snd_una))
|
|
|
|
goto send;
|
|
|
|
/*
|
|
|
|
* If our state indicates that FIN should be sent
|
2002-10-10 19:21:50 +00:00
|
|
|
* and we have not yet done so, then we need to send.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
if (flags & TH_FIN &&
|
|
|
|
((tp->t_flags & TF_SENTFIN) == 0 || tp->snd_nxt == tp->snd_una))
|
|
|
|
goto send;
|
2004-06-23 21:04:37 +00:00
|
|
|
/*
|
|
|
|
* In SACK, it is possible for tcp_output to fail to send a segment
|
|
|
|
* after the retransmission timer has been turned off. Make sure
|
|
|
|
* that the retransmission timer is set.
|
|
|
|
*/
|
2007-05-06 15:56:31 +00:00
|
|
|
if ((tp->t_flags & TF_SACK_PERMIT) &&
|
|
|
|
SEQ_GT(tp->snd_max, tp->snd_una) &&
|
2007-04-11 09:45:16 +00:00
|
|
|
!tcp_timer_active(tp, TT_REXMT) &&
|
|
|
|
!tcp_timer_active(tp, TT_PERSIST)) {
|
|
|
|
tcp_timer_activate(tp, TT_REXMT, tp->t_rxtcur);
|
2004-10-09 16:48:51 +00:00
|
|
|
goto just_return;
|
2020-02-12 13:31:36 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* TCP window updates are not reliable, rather a polling protocol
|
|
|
|
* using ``persist'' packets is used to insure receipt of window
|
|
|
|
* updates. The three ``states'' for the output side are:
|
|
|
|
* idle not doing retransmits or persists
|
|
|
|
* persisting to move a small or zero window
|
|
|
|
* (re)transmitting and thereby not persisting
|
|
|
|
*
|
2007-04-11 09:45:16 +00:00
|
|
|
* tcp_timer_active(tp, TT_PERSIST)
|
1999-08-30 21:17:07 +00:00
|
|
|
* is true when we are in persist state.
|
2005-05-21 00:38:29 +00:00
|
|
|
* (tp->t_flags & TF_FORCEDATA)
|
1994-05-24 10:09:53 +00:00
|
|
|
* is set when we are called to send a persist packet.
|
2007-04-11 09:45:16 +00:00
|
|
|
* tcp_timer_active(tp, TT_REXMT)
|
1994-05-24 10:09:53 +00:00
|
|
|
* is set when we are retransmitting
|
|
|
|
* The output side is idle when both timers are zero.
|
|
|
|
*
|
|
|
|
* If send window is too small, there is data to transmit, and no
|
|
|
|
* retransmit or persist is pending, then go to persist state.
|
|
|
|
* If nothing happens soon, send when timer expires:
|
|
|
|
* if window is nonzero, transmit what we can,
|
|
|
|
* otherwise force out a byte.
|
|
|
|
*/
|
2014-11-12 09:57:15 +00:00
|
|
|
if (sbavail(&so->so_snd) && !tcp_timer_active(tp, TT_REXMT) &&
|
2007-04-11 09:45:16 +00:00
|
|
|
!tcp_timer_active(tp, TT_PERSIST)) {
|
1994-05-24 10:09:53 +00:00
|
|
|
tp->t_rxtshift = 0;
|
|
|
|
tcp_setpersist(tp);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* No reason to send a segment, just return.
|
|
|
|
*/
|
2004-10-09 16:48:51 +00:00
|
|
|
just_return:
|
|
|
|
SOCKBUF_UNLOCK(&so->so_snd);
|
1994-05-24 10:09:53 +00:00
|
|
|
return (0);
|
|
|
|
|
|
|
|
send:
|
2004-10-09 16:48:51 +00:00
|
|
|
SOCKBUF_LOCK_ASSERT(&so->so_snd);
|
2014-10-07 21:50:28 +00:00
|
|
|
if (len > 0) {
|
|
|
|
if (len >= tp->t_maxseg)
|
|
|
|
tp->t_flags2 |= TF2_PLPMTU_MAXSEGSNT;
|
|
|
|
else
|
|
|
|
tp->t_flags2 &= ~TF2_PLPMTU_MAXSEGSNT;
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Before ESTABLISHED, force sending of initial options
|
|
|
|
* unless TCP set not to do any options.
|
|
|
|
* NOTE: we assume that the IP/TCP header plus TCP options
|
|
|
|
* always fit in a single mbuf, leaving room for a maximum
|
|
|
|
* link header, i.e.
|
2001-06-11 12:39:29 +00:00
|
|
|
* max_linkhdr + sizeof (struct tcpiphdr) + optlen <= MCLBYTES
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
optlen = 0;
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
|
|
|
if (isipv6)
|
|
|
|
hdrlen = sizeof (struct ip6_hdr) + sizeof (struct tcphdr);
|
|
|
|
else
|
|
|
|
#endif
|
2011-04-30 11:21:29 +00:00
|
|
|
hdrlen = sizeof (struct tcpiphdr);
|
1995-05-30 08:16:23 +00:00
|
|
|
|
2004-08-16 18:32:07 +00:00
|
|
|
/*
|
2007-03-15 15:59:28 +00:00
|
|
|
* Compute options for segment.
|
|
|
|
* We only have to care about SYN and established connection
|
|
|
|
* segments. Options for SYN-ACK segments are handled in TCP
|
|
|
|
* syncache.
|
2004-08-16 18:32:07 +00:00
|
|
|
*/
|
2016-01-14 10:22:45 +00:00
|
|
|
to.to_flags = 0;
|
2007-03-15 15:59:28 +00:00
|
|
|
if ((tp->t_flags & TF_NOOPT) == 0) {
|
|
|
|
/* Maximum segment size. */
|
|
|
|
if (flags & TH_SYN) {
|
|
|
|
tp->snd_nxt = tp->iss;
|
|
|
|
to.to_mss = tcp_mssopt(&tp->t_inpcb->inp_inc);
|
|
|
|
to.to_flags |= TOF_MSS;
|
2018-02-26 03:03:41 +00:00
|
|
|
|
2015-12-24 19:09:48 +00:00
|
|
|
/*
|
2018-02-26 02:53:22 +00:00
|
|
|
* On SYN or SYN|ACK transmits on TFO connections,
|
|
|
|
* only include the TFO option if it is not a
|
|
|
|
* retransmit, as the presence of the TFO option may
|
|
|
|
* have caused the original SYN or SYN|ACK to have
|
|
|
|
* been dropped by a middlebox.
|
2015-12-24 19:09:48 +00:00
|
|
|
*/
|
2016-10-12 19:06:50 +00:00
|
|
|
if (IS_FASTOPEN(tp->t_flags) &&
|
2015-12-24 19:09:48 +00:00
|
|
|
(tp->t_rxtshift == 0)) {
|
2018-02-26 02:53:22 +00:00
|
|
|
if (tp->t_state == TCPS_SYN_RECEIVED) {
|
|
|
|
to.to_tfo_len = TCP_FASTOPEN_COOKIE_LEN;
|
|
|
|
to.to_tfo_cookie =
|
|
|
|
(u_int8_t *)&tp->t_tfo_cookie.server;
|
|
|
|
to.to_flags |= TOF_FASTOPEN;
|
|
|
|
wanted_cookie = 1;
|
|
|
|
} else if (tp->t_state == TCPS_SYN_SENT) {
|
|
|
|
to.to_tfo_len =
|
|
|
|
tp->t_tfo_client_cookie_len;
|
|
|
|
to.to_tfo_cookie =
|
|
|
|
tp->t_tfo_cookie.client;
|
|
|
|
to.to_flags |= TOF_FASTOPEN;
|
|
|
|
wanted_cookie = 1;
|
|
|
|
/*
|
|
|
|
* If we wind up having more data to
|
|
|
|
* send with the SYN than can fit in
|
|
|
|
* one segment, don't send any more
|
|
|
|
* until the SYN|ACK comes back from
|
|
|
|
* the other end.
|
|
|
|
*/
|
|
|
|
dont_sendalot = 1;
|
|
|
|
}
|
2015-12-24 19:09:48 +00:00
|
|
|
}
|
2005-04-21 20:26:07 +00:00
|
|
|
}
|
2007-03-15 15:59:28 +00:00
|
|
|
/* Window scaling. */
|
|
|
|
if ((flags & TH_SYN) && (tp->t_flags & TF_REQ_SCALE)) {
|
|
|
|
to.to_wscale = tp->request_r_scale;
|
|
|
|
to.to_flags |= TOF_SCALE;
|
|
|
|
}
|
|
|
|
/* Timestamps. */
|
|
|
|
if ((tp->t_flags & TF_RCVD_TSTMP) ||
|
|
|
|
((flags & TH_SYN) && (tp->t_flags & TF_REQ_TSTMP))) {
|
2018-05-08 02:22:34 +00:00
|
|
|
curticks = tcp_ts_getticks();
|
|
|
|
to.to_tsval = curticks + tp->ts_offset;
|
2007-03-15 15:59:28 +00:00
|
|
|
to.to_tsecr = tp->ts_recent;
|
|
|
|
to.to_flags |= TOF_TS;
|
2018-05-08 02:22:34 +00:00
|
|
|
if (tp->t_rxtshift == 1)
|
|
|
|
tp->t_badrxtwin = curticks;
|
2007-03-15 15:59:28 +00:00
|
|
|
}
|
2017-04-10 08:19:35 +00:00
|
|
|
|
|
|
|
/* Set receive buffer autosizing timestamp. */
|
|
|
|
if (tp->rfbuf_ts == 0 &&
|
|
|
|
(so->so_rcv.sb_flags & SB_AUTOSIZE))
|
|
|
|
tp->rfbuf_ts = tcp_ts_getticks();
|
|
|
|
|
2007-03-15 15:59:28 +00:00
|
|
|
/* Selective ACK's. */
|
2007-05-06 15:56:31 +00:00
|
|
|
if (tp->t_flags & TF_SACK_PERMIT) {
|
2007-03-15 15:59:28 +00:00
|
|
|
if (flags & TH_SYN)
|
|
|
|
to.to_flags |= TOF_SACKPERM;
|
|
|
|
else if (TCPS_HAVEESTABLISHED(tp->t_state) &&
|
|
|
|
tp->rcv_numsacks > 0) {
|
|
|
|
to.to_flags |= TOF_SACK;
|
|
|
|
to.to_nsacks = tp->rcv_numsacks;
|
|
|
|
to.to_sacks = (u_char *)tp->sackblks;
|
2005-04-21 20:26:07 +00:00
|
|
|
}
|
|
|
|
}
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
|
2007-03-15 15:59:28 +00:00
|
|
|
/* TCP-MD5 (RFC2385). */
|
2017-02-06 08:49:57 +00:00
|
|
|
/*
|
|
|
|
* Check that TCP_MD5SIG is enabled in tcpcb to
|
|
|
|
* account the size needed to set this TCP option.
|
|
|
|
*/
|
2007-03-15 15:59:28 +00:00
|
|
|
if (tp->t_flags & TF_SIGNATURE)
|
|
|
|
to.to_flags |= TOF_SIGNATURE;
|
|
|
|
#endif /* TCP_SIGNATURE */
|
2005-04-21 20:26:07 +00:00
|
|
|
|
2007-03-15 15:59:28 +00:00
|
|
|
/* Processing the options. */
|
2007-11-28 13:33:27 +00:00
|
|
|
hdrlen += optlen = tcp_addoptions(&to, opt);
|
2018-02-26 02:53:22 +00:00
|
|
|
/*
|
|
|
|
* If we wanted a TFO option to be added, but it was unable
|
|
|
|
* to fit, ensure no data is sent.
|
|
|
|
*/
|
|
|
|
if (IS_FASTOPEN(tp->t_flags) && wanted_cookie &&
|
|
|
|
!(to.to_flags & TOF_FASTOPEN))
|
|
|
|
len = 0;
|
2005-04-21 20:26:07 +00:00
|
|
|
}
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Adjust data length if insertion of options will
|
2016-01-07 00:14:42 +00:00
|
|
|
* bump the packet length beyond the t_maxseg length.
|
1995-02-09 23:13:27 +00:00
|
|
|
* Clear the FIN bit because we cut off the tail of
|
|
|
|
* the segment.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2016-01-07 00:14:42 +00:00
|
|
|
if (len + optlen + ipoptlen > tp->t_maxseg) {
|
1995-01-23 17:58:27 +00:00
|
|
|
flags &= ~TH_FIN;
|
2010-09-17 22:05:27 +00:00
|
|
|
|
2006-09-07 12:53:01 +00:00
|
|
|
if (tso) {
|
2014-09-22 08:27:27 +00:00
|
|
|
u_int if_hw_tsomax;
|
|
|
|
u_int moff;
|
|
|
|
int max_len;
|
|
|
|
|
|
|
|
/* extract TSO information */
|
|
|
|
if_hw_tsomax = tp->t_tsomax;
|
|
|
|
if_hw_tsomaxsegcount = tp->t_tsomaxsegcount;
|
|
|
|
if_hw_tsomaxsegsize = tp->t_tsomaxsegsize;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Limit a TSO burst to prevent it from
|
|
|
|
* overflowing or exceeding the maximum length
|
|
|
|
* allowed by the network interface:
|
|
|
|
*/
|
2010-09-17 22:05:27 +00:00
|
|
|
KASSERT(ipoptlen == 0,
|
|
|
|
("%s: TSO can't do IP options", __func__));
|
|
|
|
|
|
|
|
/*
|
2014-09-22 08:27:27 +00:00
|
|
|
* Check if we should limit by maximum payload
|
|
|
|
* length:
|
2010-09-17 22:05:27 +00:00
|
|
|
*/
|
2014-09-22 08:27:27 +00:00
|
|
|
if (if_hw_tsomax != 0) {
|
|
|
|
/* compute maximum TSO length */
|
2015-09-14 08:36:22 +00:00
|
|
|
max_len = (if_hw_tsomax - hdrlen -
|
|
|
|
max_linkhdr);
|
2014-09-22 08:27:27 +00:00
|
|
|
if (max_len <= 0) {
|
|
|
|
len = 0;
|
2014-11-11 12:05:59 +00:00
|
|
|
} else if (len > max_len) {
|
2014-09-22 08:27:27 +00:00
|
|
|
sendalot = 1;
|
2014-11-11 12:05:59 +00:00
|
|
|
len = max_len;
|
2014-09-22 08:27:27 +00:00
|
|
|
}
|
|
|
|
}
|
2018-10-22 21:17:36 +00:00
|
|
|
|
2010-09-17 22:05:27 +00:00
|
|
|
/*
|
|
|
|
* Prevent the last segment from being
|
2014-09-22 08:27:27 +00:00
|
|
|
* fractional unless the send sockbuf can be
|
|
|
|
* emptied:
|
|
|
|
*/
|
2016-01-07 00:14:42 +00:00
|
|
|
max_len = (tp->t_maxseg - optlen);
|
2016-10-06 16:28:34 +00:00
|
|
|
if (((uint32_t)off + (uint32_t)len) <
|
|
|
|
sbavail(&so->so_snd)) {
|
2014-11-11 12:05:59 +00:00
|
|
|
moff = len % max_len;
|
2014-09-22 08:27:27 +00:00
|
|
|
if (moff != 0) {
|
|
|
|
len -= moff;
|
|
|
|
sendalot = 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* In case there are too many small fragments
|
|
|
|
* don't use TSO:
|
2010-09-17 22:05:27 +00:00
|
|
|
*/
|
2014-11-11 12:05:59 +00:00
|
|
|
if (len <= max_len) {
|
|
|
|
len = max_len;
|
2006-09-07 12:53:01 +00:00
|
|
|
sendalot = 1;
|
2014-09-22 08:27:27 +00:00
|
|
|
tso = 0;
|
2010-09-17 22:05:27 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Send the FIN in a separate segment
|
|
|
|
* after the bulk sending is done.
|
|
|
|
* We don't trust the TSO implementations
|
|
|
|
* to clear the FIN flag on all but the
|
|
|
|
* last segment.
|
|
|
|
*/
|
|
|
|
if (tp->t_flags & TF_NEEDFIN)
|
2006-09-07 12:53:01 +00:00
|
|
|
sendalot = 1;
|
|
|
|
} else {
|
2019-09-29 10:45:13 +00:00
|
|
|
if (optlen + ipoptlen >= tp->t_maxseg) {
|
|
|
|
/*
|
|
|
|
* Since we don't have enough space to put
|
|
|
|
* the IP header chain and the TCP header in
|
|
|
|
* one packet as required by RFC 7112, don't
|
|
|
|
* send it. Also ensure that at least one
|
|
|
|
* byte of the payload can be put into the
|
|
|
|
* TCP segment.
|
|
|
|
*/
|
|
|
|
SOCKBUF_UNLOCK(&so->so_snd);
|
|
|
|
error = EMSGSIZE;
|
|
|
|
sack_rxmit = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
2016-01-07 00:14:42 +00:00
|
|
|
len = tp->t_maxseg - optlen - ipoptlen;
|
2006-09-07 12:53:01 +00:00
|
|
|
sendalot = 1;
|
2018-02-26 02:53:22 +00:00
|
|
|
if (dont_sendalot)
|
|
|
|
sendalot = 0;
|
2006-09-07 12:53:01 +00:00
|
|
|
}
|
2010-09-17 22:05:27 +00:00
|
|
|
} else
|
|
|
|
tso = 0;
|
|
|
|
|
|
|
|
KASSERT(len + hdrlen + ipoptlen <= IP_MAXPACKET,
|
|
|
|
("%s: len > IP_MAXPACKET", __func__));
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1995-02-09 23:13:27 +00:00
|
|
|
/*#ifdef DIAGNOSTIC*/
|
2000-02-09 00:34:40 +00:00
|
|
|
#ifdef INET6
|
2004-08-16 18:32:07 +00:00
|
|
|
if (max_linkhdr + hdrlen > MCLBYTES)
|
2000-02-09 00:34:40 +00:00
|
|
|
#else
|
2004-08-16 18:32:07 +00:00
|
|
|
if (max_linkhdr + hdrlen > MHLEN)
|
2000-02-09 00:34:40 +00:00
|
|
|
#endif
|
2002-06-23 21:25:36 +00:00
|
|
|
panic("tcphdr too big");
|
1995-02-09 23:13:27 +00:00
|
|
|
/*#endif*/
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2008-09-07 11:38:30 +00:00
|
|
|
/*
|
|
|
|
* This KASSERT is here to catch edge cases at a well defined place.
|
|
|
|
* Before, those had triggered (random) panic conditions further down.
|
|
|
|
*/
|
|
|
|
KASSERT(len >= 0, ("[%s:%d]: len < 0", __func__, __LINE__));
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Grab a header mbuf, attaching a copy of data to
|
|
|
|
* be transmitted, and initialize the header from
|
|
|
|
* the template for sends on this connection.
|
|
|
|
*/
|
|
|
|
if (len) {
|
2007-03-19 18:35:13 +00:00
|
|
|
struct mbuf *mb;
|
2018-06-21 21:03:58 +00:00
|
|
|
struct sockbuf *msb;
|
2007-03-19 18:35:13 +00:00
|
|
|
u_int moff;
|
|
|
|
|
2019-12-02 20:58:04 +00:00
|
|
|
if ((tp->t_flags & TF_FORCEDATA) && len == 1) {
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sndprobe);
|
2019-12-02 20:58:04 +00:00
|
|
|
#ifdef STATS
|
|
|
|
if (SEQ_LT(tp->snd_nxt, tp->snd_max))
|
|
|
|
stats_voi_update_abs_u32(tp->t_stats,
|
|
|
|
VOI_TCP_RETXPB, len);
|
|
|
|
else
|
|
|
|
stats_voi_update_abs_u64(tp->t_stats,
|
|
|
|
VOI_TCP_TXPB, len);
|
|
|
|
#endif /* STATS */
|
|
|
|
} else if (SEQ_LT(tp->snd_nxt, tp->snd_max) || sack_rxmit) {
|
2010-11-17 18:55:12 +00:00
|
|
|
tp->t_sndrexmitpack++;
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sndrexmitpack);
|
|
|
|
TCPSTAT_ADD(tcps_sndrexmitbyte, len);
|
2019-12-02 20:58:04 +00:00
|
|
|
#ifdef STATS
|
|
|
|
stats_voi_update_abs_u32(tp->t_stats, VOI_TCP_RETXPB,
|
|
|
|
len);
|
|
|
|
#endif /* STATS */
|
1994-05-24 10:09:53 +00:00
|
|
|
} else {
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sndpack);
|
|
|
|
TCPSTAT_ADD(tcps_sndbyte, len);
|
2019-12-02 20:58:04 +00:00
|
|
|
#ifdef STATS
|
|
|
|
stats_voi_update_abs_u64(tp->t_stats, VOI_TCP_TXPB,
|
|
|
|
len);
|
|
|
|
#endif /* STATS */
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2013-03-15 12:53:53 +00:00
|
|
|
#ifdef INET6
|
|
|
|
if (MHLEN < hdrlen + max_linkhdr)
|
|
|
|
m = m_getcl(M_NOWAIT, MT_DATA, M_PKTHDR);
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
m = m_gethdr(M_NOWAIT, MT_DATA);
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
if (m == NULL) {
|
2004-10-09 16:48:51 +00:00
|
|
|
SOCKBUF_UNLOCK(&so->so_snd);
|
1994-05-24 10:09:53 +00:00
|
|
|
error = ENOBUFS;
|
Fix tcp_output() so that tcpcb is updated in the same manner when an
mbuf allocation fails, as in a case when ip_output() returns error.
To achieve that, move large block of code that updates tcpcb below
the out: label.
This fixes a panic, that requires the following sequence to happen:
1) The SYN was sent to the network, tp->snd_nxt = iss + 1, tp->snd_una = iss
2) The retransmit timeout happened for the SYN we had sent,
tcp_timer_rexmt() sets tp->snd_nxt = tp->snd_una, and calls tcp_output().
In tcp_output m_get() fails.
3) Later on the SYN|ACK for the SYN sent in step 1) came,
tcp_input sets tp->snd_una += 1, which leads to
tp->snd_una > tp->snd_nxt inconsistency, that later panics in
socket buffer code.
For reference, this bug fixed in DragonflyBSD repo:
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/1ff9b7d322dc5a26f7173aa8c38ecb79da80e419
Reviewed by: andre
Tested by: pho
Sponsored by: Nginx, Inc.
PR: kern/177456
Submitted by: HouYeFei&XiBoLiu <lglion718 163.com>
2013-04-11 18:23:56 +00:00
|
|
|
sack_rxmit = 0;
|
1994-05-24 10:09:53 +00:00
|
|
|
goto out;
|
|
|
|
}
|
2013-03-15 12:53:53 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
m->m_data += max_linkhdr;
|
|
|
|
m->m_len = hdrlen;
|
2007-03-19 18:35:13 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Start the m_copy functions from the closest mbuf
|
|
|
|
* to the offset in the socket buffer chain.
|
|
|
|
*/
|
2018-06-21 21:03:58 +00:00
|
|
|
mb = sbsndptr_noadv(&so->so_snd, off, &moff);
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
if (len <= MHLEN - hdrlen - max_linkhdr && !hw_tls) {
|
2016-10-06 16:28:34 +00:00
|
|
|
m_copydata(mb, moff, len,
|
1994-05-24 10:09:53 +00:00
|
|
|
mtod(m, caddr_t) + hdrlen);
|
2018-06-21 21:03:58 +00:00
|
|
|
if (SEQ_LT(tp->snd_nxt, tp->snd_max))
|
|
|
|
sbsndptr_adv(&so->so_snd, mb, len);
|
1994-05-24 10:09:53 +00:00
|
|
|
m->m_len += len;
|
|
|
|
} else {
|
2018-06-21 21:03:58 +00:00
|
|
|
if (SEQ_LT(tp->snd_nxt, tp->snd_max))
|
|
|
|
msb = NULL;
|
|
|
|
else
|
|
|
|
msb = &so->so_snd;
|
|
|
|
m->m_next = tcp_m_copym(mb, moff,
|
|
|
|
&len, if_hw_tsomaxsegcount,
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
if_hw_tsomaxsegsize, msb, hw_tls);
|
2018-06-21 21:03:58 +00:00
|
|
|
if (len <= (tp->t_maxseg - optlen)) {
|
2020-02-12 13:31:36 +00:00
|
|
|
/*
|
2018-06-21 21:03:58 +00:00
|
|
|
* Must have ran out of mbufs for the copy
|
|
|
|
* shorten it to no longer need tso. Lets
|
|
|
|
* not put on sendalot since we are low on
|
|
|
|
* mbufs.
|
|
|
|
*/
|
|
|
|
tso = 0;
|
|
|
|
}
|
2007-03-19 18:35:13 +00:00
|
|
|
if (m->m_next == NULL) {
|
2004-10-09 16:48:51 +00:00
|
|
|
SOCKBUF_UNLOCK(&so->so_snd);
|
1995-09-22 20:05:58 +00:00
|
|
|
(void) m_free(m);
|
1995-09-13 17:36:31 +00:00
|
|
|
error = ENOBUFS;
|
Fix tcp_output() so that tcpcb is updated in the same manner when an
mbuf allocation fails, as in a case when ip_output() returns error.
To achieve that, move large block of code that updates tcpcb below
the out: label.
This fixes a panic, that requires the following sequence to happen:
1) The SYN was sent to the network, tp->snd_nxt = iss + 1, tp->snd_una = iss
2) The retransmit timeout happened for the SYN we had sent,
tcp_timer_rexmt() sets tp->snd_nxt = tp->snd_una, and calls tcp_output().
In tcp_output m_get() fails.
3) Later on the SYN|ACK for the SYN sent in step 1) came,
tcp_input sets tp->snd_una += 1, which leads to
tp->snd_una > tp->snd_nxt inconsistency, that later panics in
socket buffer code.
For reference, this bug fixed in DragonflyBSD repo:
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/1ff9b7d322dc5a26f7173aa8c38ecb79da80e419
Reviewed by: andre
Tested by: pho
Sponsored by: Nginx, Inc.
PR: kern/177456
Submitted by: HouYeFei&XiBoLiu <lglion718 163.com>
2013-04-11 18:23:56 +00:00
|
|
|
sack_rxmit = 0;
|
1995-09-13 17:36:31 +00:00
|
|
|
goto out;
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2011-07-05 18:49:55 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* If we're sending everything we've got, set PUSH.
|
|
|
|
* (This will keep happy those implementations which only
|
|
|
|
* give data to the user when a buffer fills or
|
|
|
|
* a PUSH comes in.)
|
|
|
|
*/
|
2016-10-06 16:28:34 +00:00
|
|
|
if (((uint32_t)off + (uint32_t)len == sbused(&so->so_snd)) &&
|
|
|
|
!(flags & TH_SYN))
|
1994-05-24 10:09:53 +00:00
|
|
|
flags |= TH_PUSH;
|
2004-10-09 16:48:51 +00:00
|
|
|
SOCKBUF_UNLOCK(&so->so_snd);
|
1994-05-24 10:09:53 +00:00
|
|
|
} else {
|
2004-10-09 16:48:51 +00:00
|
|
|
SOCKBUF_UNLOCK(&so->so_snd);
|
1994-05-24 10:09:53 +00:00
|
|
|
if (tp->t_flags & TF_ACKNOW)
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sndacks);
|
1994-05-24 10:09:53 +00:00
|
|
|
else if (flags & (TH_SYN|TH_FIN|TH_RST))
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sndctrl);
|
1994-05-24 10:09:53 +00:00
|
|
|
else if (SEQ_GT(tp->snd_up, tp->snd_una))
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sndurg);
|
1994-05-24 10:09:53 +00:00
|
|
|
else
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sndwinup);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2013-03-16 08:58:28 +00:00
|
|
|
m = m_gethdr(M_NOWAIT, MT_DATA);
|
1994-05-24 10:09:53 +00:00
|
|
|
if (m == NULL) {
|
|
|
|
error = ENOBUFS;
|
Fix tcp_output() so that tcpcb is updated in the same manner when an
mbuf allocation fails, as in a case when ip_output() returns error.
To achieve that, move large block of code that updates tcpcb below
the out: label.
This fixes a panic, that requires the following sequence to happen:
1) The SYN was sent to the network, tp->snd_nxt = iss + 1, tp->snd_una = iss
2) The retransmit timeout happened for the SYN we had sent,
tcp_timer_rexmt() sets tp->snd_nxt = tp->snd_una, and calls tcp_output().
In tcp_output m_get() fails.
3) Later on the SYN|ACK for the SYN sent in step 1) came,
tcp_input sets tp->snd_una += 1, which leads to
tp->snd_una > tp->snd_nxt inconsistency, that later panics in
socket buffer code.
For reference, this bug fixed in DragonflyBSD repo:
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/1ff9b7d322dc5a26f7173aa8c38ecb79da80e419
Reviewed by: andre
Tested by: pho
Sponsored by: Nginx, Inc.
PR: kern/177456
Submitted by: HouYeFei&XiBoLiu <lglion718 163.com>
2013-04-11 18:23:56 +00:00
|
|
|
sack_rxmit = 0;
|
1994-05-24 10:09:53 +00:00
|
|
|
goto out;
|
|
|
|
}
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
|
|
|
if (isipv6 && (MHLEN < hdrlen + max_linkhdr) &&
|
|
|
|
MHLEN >= hdrlen) {
|
To ease changes to underlying mbuf structure and the mbuf allocator, reduce
the knowledge of mbuf layout, and in particular constants such as M_EXT,
MLEN, MHLEN, and so on, in mbuf consumers by unifying various alignment
utility functions (M_ALIGN(), MH_ALIGN(), MEXT_ALIGN() in a single
M_ALIGN() macro, implemented by a now-inlined m_align() function:
- Move m_align() from uipc_mbuf.c to mbuf.h; mark as __inline.
- Reimplement M_ALIGN(), MH_ALIGN(), and MEXT_ALIGN() using m_align().
- Update consumers around the tree to simply use M_ALIGN().
This change eliminates a number of cases where mbuf consumers must be aware
of whether or not mbufs returned by the allocator use external storage, but
also assumptions about the size of the returned mbuf. This will make it
easier to introduce changes in how we use external storage, as well as
features such as variable-size mbufs.
Differential Revision: https://reviews.freebsd.org/D1436
Reviewed by: glebius, trasz, gnn, bz
Sponsored by: EMC / Isilon Storage Division
2015-01-05 09:58:32 +00:00
|
|
|
M_ALIGN(m, hdrlen);
|
2000-01-09 19:17:30 +00:00
|
|
|
} else
|
|
|
|
#endif
|
1994-05-24 10:09:53 +00:00
|
|
|
m->m_data += max_linkhdr;
|
|
|
|
m->m_len = hdrlen;
|
|
|
|
}
|
2004-10-09 16:48:51 +00:00
|
|
|
SOCKBUF_UNLOCK_ASSERT(&so->so_snd);
|
1994-05-24 10:09:53 +00:00
|
|
|
m->m_pkthdr.rcvif = (struct ifnet *)0;
|
2002-07-31 19:06:49 +00:00
|
|
|
#ifdef MAC
|
2007-10-24 19:04:04 +00:00
|
|
|
mac_inpcb_create_mbuf(tp->t_inpcb, m);
|
2002-07-31 19:06:49 +00:00
|
|
|
#endif
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
|
|
|
if (isipv6) {
|
|
|
|
ip6 = mtod(m, struct ip6_hdr *);
|
|
|
|
th = (struct tcphdr *)(ip6 + 1);
|
2003-02-19 22:18:06 +00:00
|
|
|
tcpip_fillheaders(tp->t_inpcb, ip6, th);
|
2000-01-09 19:17:30 +00:00
|
|
|
} else
|
|
|
|
#endif /* INET6 */
|
2004-08-16 18:32:07 +00:00
|
|
|
{
|
|
|
|
ip = mtod(m, struct ip *);
|
2017-12-25 04:48:39 +00:00
|
|
|
#ifdef TCPDEBUG
|
2004-08-16 18:32:07 +00:00
|
|
|
ipov = (struct ipovly *)ip;
|
2017-12-25 04:48:39 +00:00
|
|
|
#endif
|
2004-08-16 18:32:07 +00:00
|
|
|
th = (struct tcphdr *)(ip + 1);
|
|
|
|
tcpip_fillheaders(tp->t_inpcb, ip, th);
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Fill in fields, remembering maximum advertised
|
|
|
|
* window for use in delaying messages about window sizes.
|
|
|
|
* If resending a FIN, be sure not to use a new sequence number.
|
|
|
|
*/
|
1995-05-30 08:16:23 +00:00
|
|
|
if (flags & TH_FIN && tp->t_flags & TF_SENTFIN &&
|
1994-05-24 10:09:53 +00:00
|
|
|
tp->snd_nxt == tp->snd_max)
|
|
|
|
tp->snd_nxt--;
|
2008-07-31 15:10:09 +00:00
|
|
|
/*
|
|
|
|
* If we are starting a connection, send ECN setup
|
|
|
|
* SYN packet. If we are on a retransmit, we may
|
|
|
|
* resend those bits a number of times as per
|
|
|
|
* RFC 3168.
|
|
|
|
*/
|
2016-05-19 22:20:35 +00:00
|
|
|
if (tp->t_state == TCPS_SYN_SENT && V_tcp_do_ecn == 1) {
|
2008-07-31 15:10:09 +00:00
|
|
|
if (tp->t_rxtshift >= 1) {
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if (tp->t_rxtshift <= V_tcp_ecn_maxretries)
|
2008-07-31 15:10:09 +00:00
|
|
|
flags |= TH_ECE|TH_CWR;
|
|
|
|
} else
|
|
|
|
flags |= TH_ECE|TH_CWR;
|
|
|
|
}
|
2020-05-21 21:15:25 +00:00
|
|
|
/* Handle parallel SYN for ECN */
|
|
|
|
if ((tp->t_state == TCPS_SYN_RECEIVED) &&
|
|
|
|
(tp->t_flags2 & TF2_ECN_SND_ECE)) {
|
|
|
|
flags |= TH_ECE;
|
|
|
|
tp->t_flags2 &= ~TF2_ECN_SND_ECE;
|
|
|
|
}
|
2020-02-12 13:31:36 +00:00
|
|
|
|
2008-07-31 15:10:09 +00:00
|
|
|
if (tp->t_state == TCPS_ESTABLISHED &&
|
2019-12-01 21:01:33 +00:00
|
|
|
(tp->t_flags2 & TF2_ECN_PERMIT)) {
|
2008-07-31 15:10:09 +00:00
|
|
|
/*
|
|
|
|
* If the peer has ECN, mark data packets with
|
|
|
|
* ECN capable transmission (ECT).
|
|
|
|
* Ignore pure ack packets, retransmissions and window probes.
|
|
|
|
*/
|
|
|
|
if (len > 0 && SEQ_GEQ(tp->snd_nxt, tp->snd_max) &&
|
2020-01-25 13:34:29 +00:00
|
|
|
(sack_rxmit == 0) &&
|
2020-05-21 21:33:15 +00:00
|
|
|
!((tp->t_flags & TF_FORCEDATA) && len == 1 &&
|
|
|
|
SEQ_LT(tp->snd_una, tp->snd_max))) {
|
2008-07-31 15:10:09 +00:00
|
|
|
#ifdef INET6
|
|
|
|
if (isipv6)
|
|
|
|
ip6->ip6_flow |= htonl(IPTOS_ECN_ECT0 << 20);
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
ip->ip_tos |= IPTOS_ECN_ECT0;
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_ecn_ect0);
|
2020-05-21 21:33:15 +00:00
|
|
|
/*
|
|
|
|
* Reply with proper ECN notifications.
|
|
|
|
* Only set CWR on new data segments.
|
|
|
|
*/
|
|
|
|
if (tp->t_flags2 & TF2_ECN_SND_CWR) {
|
|
|
|
flags |= TH_CWR;
|
|
|
|
tp->t_flags2 &= ~TF2_ECN_SND_CWR;
|
|
|
|
}
|
2020-02-12 13:31:36 +00:00
|
|
|
}
|
2019-12-01 21:01:33 +00:00
|
|
|
if (tp->t_flags2 & TF2_ECN_SND_ECE)
|
2008-07-31 15:10:09 +00:00
|
|
|
flags |= TH_ECE;
|
|
|
|
}
|
2020-02-12 13:31:36 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* If we are doing retransmissions, then snd_nxt will
|
|
|
|
* not reflect the first unsent octet. For ACK only
|
|
|
|
* packets, we do not want the sequence number of the
|
|
|
|
* retransmitted packet, we want the sequence number
|
|
|
|
* of the next unsent octet. So, if there is no data
|
|
|
|
* (and no SYN or FIN), use snd_max instead of snd_nxt
|
|
|
|
* when filling in ti_seq. But if we are in persist
|
|
|
|
* state, snd_max might reflect one byte beyond the
|
|
|
|
* right edge of the window, so use snd_nxt in that
|
|
|
|
* case, since we know we aren't doing a retransmission.
|
|
|
|
* (retransmit and persist are mutually exclusive...)
|
|
|
|
*/
|
2004-10-05 18:36:24 +00:00
|
|
|
if (sack_rxmit == 0) {
|
2007-04-11 09:45:16 +00:00
|
|
|
if (len || (flags & (TH_SYN|TH_FIN)) ||
|
|
|
|
tcp_timer_active(tp, TT_PERSIST))
|
2004-10-05 18:36:24 +00:00
|
|
|
th->th_seq = htonl(tp->snd_nxt);
|
|
|
|
else
|
|
|
|
th->th_seq = htonl(tp->snd_max);
|
|
|
|
} else {
|
2004-06-23 21:04:37 +00:00
|
|
|
th->th_seq = htonl(p->rxmit);
|
|
|
|
p->rxmit += len;
|
2005-05-11 21:37:42 +00:00
|
|
|
tp->sackhint.sack_bytes_rexmit += len;
|
2004-06-23 21:04:37 +00:00
|
|
|
}
|
2000-01-09 19:17:30 +00:00
|
|
|
th->th_ack = htonl(tp->rcv_nxt);
|
1994-05-24 10:09:53 +00:00
|
|
|
if (optlen) {
|
2000-01-09 19:17:30 +00:00
|
|
|
bcopy(opt, th + 1, optlen);
|
|
|
|
th->th_off = (sizeof (struct tcphdr) + optlen) >> 2;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2000-01-09 19:17:30 +00:00
|
|
|
th->th_flags = flags;
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Calculate receive window. Don't shrink window,
|
|
|
|
* but avoid silly window syndrome.
|
2018-11-22 19:49:52 +00:00
|
|
|
* If a RST segment is sent, advertise a window of zero.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
2018-11-22 19:49:52 +00:00
|
|
|
if (flags & TH_RST) {
|
2004-01-22 23:22:14 +00:00
|
|
|
recwin = 0;
|
2018-11-22 19:49:52 +00:00
|
|
|
} else {
|
|
|
|
if (recwin < (so->so_rcv.sb_hiwat / 4) &&
|
|
|
|
recwin < tp->t_maxseg)
|
|
|
|
recwin = 0;
|
|
|
|
if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt) &&
|
|
|
|
recwin < (tp->rcv_adv - tp->rcv_nxt))
|
|
|
|
recwin = (tp->rcv_adv - tp->rcv_nxt);
|
|
|
|
}
|
2007-06-09 21:19:12 +00:00
|
|
|
/*
|
|
|
|
* According to RFC1323 the window field in a SYN (i.e., a <SYN>
|
|
|
|
* or <SYN,ACK>) segment itself is never scaled. The <SYN,ACK>
|
|
|
|
* case is handled in syncache.
|
|
|
|
*/
|
|
|
|
if (flags & TH_SYN)
|
|
|
|
th->th_win = htons((u_short)
|
|
|
|
(min(sbspace(&so->so_rcv), TCP_MAXWIN)));
|
2020-04-29 22:01:33 +00:00
|
|
|
else {
|
|
|
|
/* Avoid shrinking window with window scaling. */
|
|
|
|
recwin = roundup2(recwin, 1 << tp->rcv_scale);
|
2007-06-09 21:19:12 +00:00
|
|
|
th->th_win = htons((u_short)(recwin >> tp->rcv_scale));
|
2020-04-29 22:01:33 +00:00
|
|
|
}
|
2001-12-02 08:49:29 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Adjust the RXWIN0SENT flag - indicate that we have advertised
|
|
|
|
* a 0 window. This may cause the remote transmitter to stall. This
|
|
|
|
* flag tells soreceive() to disable delayed acknowledgements when
|
|
|
|
* draining the buffer. This can occur if the receiver is attempting
|
2008-07-15 10:32:35 +00:00
|
|
|
* to read more data than can be buffered prior to transmitting on
|
2001-12-02 08:49:29 +00:00
|
|
|
* the connection.
|
|
|
|
*/
|
2010-11-17 18:55:12 +00:00
|
|
|
if (th->th_win == 0) {
|
|
|
|
tp->t_sndzerowin++;
|
2001-12-02 08:49:29 +00:00
|
|
|
tp->t_flags |= TF_RXWIN0SENT;
|
2010-11-17 18:55:12 +00:00
|
|
|
} else
|
2001-12-02 08:49:29 +00:00
|
|
|
tp->t_flags &= ~TF_RXWIN0SENT;
|
1994-05-24 10:09:53 +00:00
|
|
|
if (SEQ_GT(tp->snd_up, tp->snd_nxt)) {
|
2000-01-09 19:17:30 +00:00
|
|
|
th->th_urp = htons((u_short)(tp->snd_up - tp->snd_nxt));
|
|
|
|
th->th_flags |= TH_URG;
|
1994-05-24 10:09:53 +00:00
|
|
|
} else
|
|
|
|
/*
|
|
|
|
* If no urgent pointer to send, then we pull
|
|
|
|
* the urgent pointer to the left edge of the send window
|
|
|
|
* so that it doesn't drift into the send window on sequence
|
|
|
|
* number wraparound.
|
|
|
|
*/
|
|
|
|
tp->snd_up = tp->snd_una; /* drag it along */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Put TCP length in extended header, and then
|
|
|
|
* checksum extended header and data.
|
|
|
|
*/
|
2000-01-09 19:17:30 +00:00
|
|
|
m->m_pkthdr.len = hdrlen + len; /* in6_cksum() need this */
|
2012-05-25 02:23:26 +00:00
|
|
|
m->m_pkthdr.csum_data = offsetof(struct tcphdr, th_sum);
|
2017-02-06 08:49:57 +00:00
|
|
|
|
|
|
|
#if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
|
|
|
|
if (to.to_flags & TOF_SIGNATURE) {
|
|
|
|
/*
|
|
|
|
* Calculate MD5 signature and put it into the place
|
|
|
|
* determined before.
|
|
|
|
* NOTE: since TCP options buffer doesn't point into
|
|
|
|
* mbuf's data, calculate offset and use it.
|
|
|
|
*/
|
2017-12-14 12:54:20 +00:00
|
|
|
if (!TCPMD5_ENABLED() || (error = TCPMD5_OUTPUT(m, th,
|
|
|
|
(u_char *)(th + 1) + (to.to_signature - opt))) != 0) {
|
2017-02-06 08:49:57 +00:00
|
|
|
/*
|
|
|
|
* Do not send segment if the calculation of MD5
|
|
|
|
* digest has failed.
|
|
|
|
*/
|
2017-12-14 12:54:20 +00:00
|
|
|
m_freem(m);
|
2017-02-06 08:49:57 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
2012-05-25 02:23:26 +00:00
|
|
|
if (isipv6) {
|
2000-01-09 19:17:30 +00:00
|
|
|
/*
|
2017-01-30 04:51:18 +00:00
|
|
|
* There is no need to fill in ip6_plen right now.
|
|
|
|
* It will be filled later by ip6_output.
|
2000-01-09 19:17:30 +00:00
|
|
|
*/
|
It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.
To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.
Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.
This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.
Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.
Individual driver updates will have to follow, as will SCTP.
Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958
2012-05-28 09:30:13 +00:00
|
|
|
m->m_pkthdr.csum_flags = CSUM_TCP_IPV6;
|
2012-05-25 02:23:26 +00:00
|
|
|
th->th_sum = in6_cksum_pseudo(ip6, sizeof(struct tcphdr) +
|
|
|
|
optlen + len, IPPROTO_TCP, 0);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
#if defined(INET6) && defined(INET)
|
2000-01-09 19:17:30 +00:00
|
|
|
else
|
2012-05-25 02:23:26 +00:00
|
|
|
#endif
|
|
|
|
#ifdef INET
|
2004-08-16 18:32:07 +00:00
|
|
|
{
|
It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.
To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.
Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.
This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.
Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.
Individual driver updates will have to follow, as will SCTP.
Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958
2012-05-28 09:30:13 +00:00
|
|
|
m->m_pkthdr.csum_flags = CSUM_TCP;
|
2004-08-16 18:32:07 +00:00
|
|
|
th->th_sum = in_pseudo(ip->ip_src.s_addr, ip->ip_dst.s_addr,
|
|
|
|
htons(sizeof(struct tcphdr) + IPPROTO_TCP + len + optlen));
|
|
|
|
|
|
|
|
/* IP version must be set here for ipv4/ipv6 checking later */
|
|
|
|
KASSERT(ip->ip_v == IPVERSION,
|
|
|
|
("%s: IP version incorrect: %d", __func__, ip->ip_v));
|
|
|
|
}
|
2012-05-25 02:23:26 +00:00
|
|
|
#endif
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2006-09-07 12:53:01 +00:00
|
|
|
/*
|
|
|
|
* Enable TSO and specify the size of the segments.
|
|
|
|
* The TCP pseudo header checksum is always provided.
|
|
|
|
*/
|
|
|
|
if (tso) {
|
2016-01-07 00:14:42 +00:00
|
|
|
KASSERT(len > tp->t_maxseg - optlen,
|
2010-08-14 21:41:33 +00:00
|
|
|
("%s: len <= tso_segsz", __func__));
|
2010-04-19 15:15:36 +00:00
|
|
|
m->m_pkthdr.csum_flags |= CSUM_TSO;
|
2016-01-07 00:14:42 +00:00
|
|
|
m->m_pkthdr.tso_segsz = tp->t_maxseg - optlen;
|
2006-09-07 12:53:01 +00:00
|
|
|
}
|
|
|
|
|
2019-03-23 09:56:41 +00:00
|
|
|
KASSERT(len + hdrlen == m_length(m, NULL),
|
|
|
|
("%s: mbuf chain shorter than expected: %d + %u != %u",
|
|
|
|
__func__, len, hdrlen, m_length(m, NULL)));
|
2010-09-17 22:05:27 +00:00
|
|
|
|
In the TCP stack, the hhook(9) framework provides hooks for kernel modules
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.
Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.
This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.
Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.
Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8185
2016-10-12 02:16:42 +00:00
|
|
|
#ifdef TCP_HHOOK
|
2010-12-28 12:13:30 +00:00
|
|
|
/* Run HHOOK_TCP_ESTABLISHED_OUT helper hooks. */
|
|
|
|
hhook_run_tcp_est_out(tp, th, &to, len, tso);
|
In the TCP stack, the hhook(9) framework provides hooks for kernel modules
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.
Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.
This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.
Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.
Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D8185
2016-10-12 02:16:42 +00:00
|
|
|
#endif
|
2010-12-28 12:13:30 +00:00
|
|
|
|
1994-09-15 10:36:56 +00:00
|
|
|
#ifdef TCPDEBUG
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Trace.
|
|
|
|
*/
|
2003-08-13 08:50:42 +00:00
|
|
|
if (so->so_options & SO_DEBUG) {
|
2004-06-18 09:53:58 +00:00
|
|
|
u_short save = 0;
|
2004-06-18 03:31:07 +00:00
|
|
|
#ifdef INET6
|
|
|
|
if (!isipv6)
|
|
|
|
#endif
|
|
|
|
{
|
|
|
|
save = ipov->ih_len;
|
|
|
|
ipov->ih_len = htons(m->m_pkthdr.len /* - hdrlen + (th->th_off << 2) */);
|
|
|
|
}
|
2000-01-09 19:17:30 +00:00
|
|
|
tcp_trace(TA_OUTPUT, tp->t_state, tp, mtod(m, void *), th, 0);
|
2004-06-18 03:31:07 +00:00
|
|
|
#ifdef INET6
|
|
|
|
if (!isipv6)
|
|
|
|
#endif
|
2003-08-13 08:50:42 +00:00
|
|
|
ipov->ih_len = save;
|
|
|
|
}
|
2011-04-30 11:21:29 +00:00
|
|
|
#endif /* TCPDEBUG */
|
2017-01-04 02:19:13 +00:00
|
|
|
TCP_PROBE3(debug__output, tp, th, m);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2018-04-10 15:54:29 +00:00
|
|
|
/* We're getting ready to send; log now. */
|
|
|
|
TCP_LOG_EVENT(tp, th, &so->so_rcv, &so->so_snd, TCP_LOG_OUT, ERRNO_UNK,
|
|
|
|
len, NULL, false);
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* Fill in IP length and desired time to live and
|
|
|
|
* send to IP level. There should be a better way
|
|
|
|
* to handle ttl and tos; we could keep them in
|
|
|
|
* the template, but need a way to checksum without them.
|
|
|
|
*/
|
2000-01-09 19:17:30 +00:00
|
|
|
/*
|
2014-07-03 23:12:43 +00:00
|
|
|
* m->m_pkthdr.len should have been set before checksum calculation,
|
2000-01-09 19:17:30 +00:00
|
|
|
* because in6_cksum() need it.
|
|
|
|
*/
|
|
|
|
#ifdef INET6
|
|
|
|
if (isipv6) {
|
2000-07-04 16:35:15 +00:00
|
|
|
/*
|
2000-01-09 19:17:30 +00:00
|
|
|
* we separately set hoplimit for every segment, since the
|
|
|
|
* user might want to change the value via setsockopt.
|
|
|
|
* Also, desired default hop limit might be changed via
|
2000-07-04 16:35:15 +00:00
|
|
|
* Neighbor Discovery.
|
|
|
|
*/
|
2003-11-20 20:07:39 +00:00
|
|
|
ip6->ip6_hlim = in6_selecthlim(tp->t_inpcb, NULL);
|
2000-01-09 19:17:30 +00:00
|
|
|
|
2013-08-25 21:54:41 +00:00
|
|
|
/*
|
|
|
|
* Set the packet size here for the benefit of DTrace probes.
|
|
|
|
* ip6_output() will set it properly; it's supposed to include
|
|
|
|
* the option header lengths as well.
|
|
|
|
*/
|
|
|
|
ip6->ip6_plen = htons(m->m_pkthdr.len - sizeof(*ip6));
|
|
|
|
|
2016-01-07 00:14:42 +00:00
|
|
|
if (V_path_mtu_discovery && tp->t_maxseg > V_tcp_minmss)
|
2014-10-13 21:05:29 +00:00
|
|
|
tp->t_flags2 |= TF2_PLPMTU_PMTUD;
|
|
|
|
else
|
|
|
|
tp->t_flags2 &= ~TF2_PLPMTU_PMTUD;
|
|
|
|
|
2013-08-25 21:54:41 +00:00
|
|
|
if (tp->t_state == TCPS_SYN_SENT)
|
2013-11-26 08:46:27 +00:00
|
|
|
TCP_PROBE5(connect__request, NULL, tp, ip6, tp, th);
|
2013-08-25 21:54:41 +00:00
|
|
|
|
|
|
|
TCP_PROBE5(send, NULL, tp, ip6, tp, th);
|
|
|
|
|
There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.
It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.
To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).
There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.
I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.
The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.
Differential Revision: D3100
Submitted by: Jonathan Looney <jlooney at juniper dot net>
Reviewed by: gnn, hiren
2015-10-14 00:35:37 +00:00
|
|
|
#ifdef TCPPCAP
|
|
|
|
/* Save packet, if requested. */
|
|
|
|
tcp_pcap_add(th, m, &(tp->t_outpkts));
|
|
|
|
#endif
|
|
|
|
|
2000-01-09 19:17:30 +00:00
|
|
|
/* TODO: IPv6 IP6TOS_ECT bit on */
|
2017-03-27 23:48:36 +00:00
|
|
|
error = ip6_output(m, tp->t_inpcb->in6p_outputopts,
|
|
|
|
&tp->t_inpcb->inp_route6,
|
2012-07-16 07:08:34 +00:00
|
|
|
((so->so_options & SO_DONTROUTE) ? IP_ROUTETOIF : 0),
|
|
|
|
NULL, NULL, tp->t_inpcb);
|
|
|
|
|
2020-04-25 09:06:11 +00:00
|
|
|
if (error == EMSGSIZE && tp->t_inpcb->inp_route6.ro_nh != NULL)
|
|
|
|
mtu = tp->t_inpcb->inp_route6.ro_nh->nh_mtu;
|
2011-04-30 11:21:29 +00:00
|
|
|
}
|
2000-01-09 19:17:30 +00:00
|
|
|
#endif /* INET6 */
|
2011-04-30 11:21:29 +00:00
|
|
|
#if defined(INET) && defined(INET6)
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
#ifdef INET
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2012-10-22 21:09:03 +00:00
|
|
|
ip->ip_len = htons(m->m_pkthdr.len);
|
2000-07-04 16:35:15 +00:00
|
|
|
#ifdef INET6
|
2008-11-27 13:19:42 +00:00
|
|
|
if (tp->t_inpcb->inp_vflag & INP_IPV6PROTO)
|
2004-08-16 18:32:07 +00:00
|
|
|
ip->ip_ttl = in6_selecthlim(tp->t_inpcb, NULL);
|
2000-07-04 16:35:15 +00:00
|
|
|
#endif /* INET6 */
|
1995-09-20 21:00:59 +00:00
|
|
|
/*
|
2003-11-20 20:07:39 +00:00
|
|
|
* If we do path MTU discovery, then we set DF on every packet.
|
|
|
|
* This might not be the best thing to do according to RFC3390
|
|
|
|
* Section 2. However the tcp hostcache migitates the problem
|
|
|
|
* so it affects only the first tcp connection with a host.
|
2010-08-15 13:25:18 +00:00
|
|
|
*
|
|
|
|
* NB: Don't set DF on small MTU/MSS to have a safe fallback.
|
1995-09-20 21:00:59 +00:00
|
|
|
*/
|
2016-01-07 00:14:42 +00:00
|
|
|
if (V_path_mtu_discovery && tp->t_maxseg > V_tcp_minmss) {
|
2012-10-22 21:09:03 +00:00
|
|
|
ip->ip_off |= htons(IP_DF);
|
2014-10-07 21:50:28 +00:00
|
|
|
tp->t_flags2 |= TF2_PLPMTU_PMTUD;
|
|
|
|
} else {
|
|
|
|
tp->t_flags2 &= ~TF2_PLPMTU_PMTUD;
|
|
|
|
}
|
2003-11-20 20:07:39 +00:00
|
|
|
|
2013-08-25 21:54:41 +00:00
|
|
|
if (tp->t_state == TCPS_SYN_SENT)
|
2013-11-26 08:46:27 +00:00
|
|
|
TCP_PROBE5(connect__request, NULL, tp, ip, tp, th);
|
2013-08-25 21:54:41 +00:00
|
|
|
|
|
|
|
TCP_PROBE5(send, NULL, tp, ip, tp, th);
|
|
|
|
|
There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.
It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.
To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).
There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.
I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.
The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.
Differential Revision: D3100
Submitted by: Jonathan Looney <jlooney at juniper dot net>
Reviewed by: gnn, hiren
2015-10-14 00:35:37 +00:00
|
|
|
#ifdef TCPPCAP
|
|
|
|
/* Save packet, if requested. */
|
|
|
|
tcp_pcap_add(th, m, &(tp->t_outpkts));
|
|
|
|
#endif
|
|
|
|
|
2016-03-24 07:54:56 +00:00
|
|
|
error = ip_output(m, tp->t_inpcb->inp_options, &tp->t_inpcb->inp_route,
|
2004-09-05 02:34:12 +00:00
|
|
|
((so->so_options & SO_DONTROUTE) ? IP_ROUTETOIF : 0), 0,
|
|
|
|
tp->t_inpcb);
|
2012-07-16 07:08:34 +00:00
|
|
|
|
2020-04-25 09:06:11 +00:00
|
|
|
if (error == EMSGSIZE && tp->t_inpcb->inp_route.ro_nh != NULL)
|
|
|
|
mtu = tp->t_inpcb->inp_route.ro_nh->nh_mtu;
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2011-04-30 11:21:29 +00:00
|
|
|
#endif /* INET */
|
Fix tcp_output() so that tcpcb is updated in the same manner when an
mbuf allocation fails, as in a case when ip_output() returns error.
To achieve that, move large block of code that updates tcpcb below
the out: label.
This fixes a panic, that requires the following sequence to happen:
1) The SYN was sent to the network, tp->snd_nxt = iss + 1, tp->snd_una = iss
2) The retransmit timeout happened for the SYN we had sent,
tcp_timer_rexmt() sets tp->snd_nxt = tp->snd_una, and calls tcp_output().
In tcp_output m_get() fails.
3) Later on the SYN|ACK for the SYN sent in step 1) came,
tcp_input sets tp->snd_una += 1, which leads to
tp->snd_una > tp->snd_nxt inconsistency, that later panics in
socket buffer code.
For reference, this bug fixed in DragonflyBSD repo:
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/1ff9b7d322dc5a26f7173aa8c38ecb79da80e419
Reviewed by: andre
Tested by: pho
Sponsored by: Nginx, Inc.
PR: kern/177456
Submitted by: HouYeFei&XiBoLiu <lglion718 163.com>
2013-04-11 18:23:56 +00:00
|
|
|
|
|
|
|
out:
|
|
|
|
/*
|
|
|
|
* In transmit state, time the transmission and arrange for
|
|
|
|
* the retransmit. In persist state, just set snd_max.
|
|
|
|
*/
|
2020-02-12 13:31:36 +00:00
|
|
|
if ((tp->t_flags & TF_FORCEDATA) == 0 ||
|
Fix tcp_output() so that tcpcb is updated in the same manner when an
mbuf allocation fails, as in a case when ip_output() returns error.
To achieve that, move large block of code that updates tcpcb below
the out: label.
This fixes a panic, that requires the following sequence to happen:
1) The SYN was sent to the network, tp->snd_nxt = iss + 1, tp->snd_una = iss
2) The retransmit timeout happened for the SYN we had sent,
tcp_timer_rexmt() sets tp->snd_nxt = tp->snd_una, and calls tcp_output().
In tcp_output m_get() fails.
3) Later on the SYN|ACK for the SYN sent in step 1) came,
tcp_input sets tp->snd_una += 1, which leads to
tp->snd_una > tp->snd_nxt inconsistency, that later panics in
socket buffer code.
For reference, this bug fixed in DragonflyBSD repo:
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/1ff9b7d322dc5a26f7173aa8c38ecb79da80e419
Reviewed by: andre
Tested by: pho
Sponsored by: Nginx, Inc.
PR: kern/177456
Submitted by: HouYeFei&XiBoLiu <lglion718 163.com>
2013-04-11 18:23:56 +00:00
|
|
|
!tcp_timer_active(tp, TT_PERSIST)) {
|
|
|
|
tcp_seq startseq = tp->snd_nxt;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Advance snd_nxt over sequence space of this segment.
|
|
|
|
*/
|
|
|
|
if (flags & (TH_SYN|TH_FIN)) {
|
|
|
|
if (flags & TH_SYN)
|
|
|
|
tp->snd_nxt++;
|
|
|
|
if (flags & TH_FIN) {
|
|
|
|
tp->snd_nxt++;
|
|
|
|
tp->t_flags |= TF_SENTFIN;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (sack_rxmit)
|
|
|
|
goto timer;
|
|
|
|
tp->snd_nxt += len;
|
|
|
|
if (SEQ_GT(tp->snd_nxt, tp->snd_max)) {
|
|
|
|
tp->snd_max = tp->snd_nxt;
|
|
|
|
/*
|
|
|
|
* Time this transmission if not a retransmission and
|
|
|
|
* not currently timing anything.
|
|
|
|
*/
|
2020-06-24 13:42:42 +00:00
|
|
|
tp->t_sndtime = ticks;
|
Fix tcp_output() so that tcpcb is updated in the same manner when an
mbuf allocation fails, as in a case when ip_output() returns error.
To achieve that, move large block of code that updates tcpcb below
the out: label.
This fixes a panic, that requires the following sequence to happen:
1) The SYN was sent to the network, tp->snd_nxt = iss + 1, tp->snd_una = iss
2) The retransmit timeout happened for the SYN we had sent,
tcp_timer_rexmt() sets tp->snd_nxt = tp->snd_una, and calls tcp_output().
In tcp_output m_get() fails.
3) Later on the SYN|ACK for the SYN sent in step 1) came,
tcp_input sets tp->snd_una += 1, which leads to
tp->snd_una > tp->snd_nxt inconsistency, that later panics in
socket buffer code.
For reference, this bug fixed in DragonflyBSD repo:
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/1ff9b7d322dc5a26f7173aa8c38ecb79da80e419
Reviewed by: andre
Tested by: pho
Sponsored by: Nginx, Inc.
PR: kern/177456
Submitted by: HouYeFei&XiBoLiu <lglion718 163.com>
2013-04-11 18:23:56 +00:00
|
|
|
if (tp->t_rtttime == 0) {
|
|
|
|
tp->t_rtttime = ticks;
|
|
|
|
tp->t_rtseq = startseq;
|
|
|
|
TCPSTAT_INC(tcps_segstimed);
|
|
|
|
}
|
2019-12-02 20:58:04 +00:00
|
|
|
#ifdef STATS
|
|
|
|
if (!(tp->t_flags & TF_GPUTINPROG) && len) {
|
|
|
|
tp->t_flags |= TF_GPUTINPROG;
|
|
|
|
tp->gput_seq = startseq;
|
|
|
|
tp->gput_ack = startseq +
|
|
|
|
ulmin(sbavail(&so->so_snd) - off, sendwin);
|
|
|
|
tp->gput_ts = tcp_ts_getticks();
|
|
|
|
}
|
|
|
|
#endif /* STATS */
|
Fix tcp_output() so that tcpcb is updated in the same manner when an
mbuf allocation fails, as in a case when ip_output() returns error.
To achieve that, move large block of code that updates tcpcb below
the out: label.
This fixes a panic, that requires the following sequence to happen:
1) The SYN was sent to the network, tp->snd_nxt = iss + 1, tp->snd_una = iss
2) The retransmit timeout happened for the SYN we had sent,
tcp_timer_rexmt() sets tp->snd_nxt = tp->snd_una, and calls tcp_output().
In tcp_output m_get() fails.
3) Later on the SYN|ACK for the SYN sent in step 1) came,
tcp_input sets tp->snd_una += 1, which leads to
tp->snd_una > tp->snd_nxt inconsistency, that later panics in
socket buffer code.
For reference, this bug fixed in DragonflyBSD repo:
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/1ff9b7d322dc5a26f7173aa8c38ecb79da80e419
Reviewed by: andre
Tested by: pho
Sponsored by: Nginx, Inc.
PR: kern/177456
Submitted by: HouYeFei&XiBoLiu <lglion718 163.com>
2013-04-11 18:23:56 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Set retransmit timer if not currently set,
|
|
|
|
* and not doing a pure ack or a keep-alive probe.
|
|
|
|
* Initial value for retransmit timer is smoothed
|
|
|
|
* round-trip time + 2 * round-trip time variance.
|
|
|
|
* Initialize shift counter which is used for backoff
|
|
|
|
* of retransmit time.
|
|
|
|
*/
|
|
|
|
timer:
|
|
|
|
if (!tcp_timer_active(tp, TT_REXMT) &&
|
|
|
|
((sack_rxmit && tp->snd_nxt != tp->snd_max) ||
|
|
|
|
(tp->snd_nxt != tp->snd_una))) {
|
|
|
|
if (tcp_timer_active(tp, TT_PERSIST)) {
|
|
|
|
tcp_timer_activate(tp, TT_PERSIST, 0);
|
|
|
|
tp->t_rxtshift = 0;
|
|
|
|
}
|
|
|
|
tcp_timer_activate(tp, TT_REXMT, tp->t_rxtcur);
|
2015-06-29 21:23:54 +00:00
|
|
|
} else if (len == 0 && sbavail(&so->so_snd) &&
|
|
|
|
!tcp_timer_active(tp, TT_REXMT) &&
|
|
|
|
!tcp_timer_active(tp, TT_PERSIST)) {
|
|
|
|
/*
|
|
|
|
* Avoid a situation where we do not set persist timer
|
|
|
|
* after a zero window condition. For example:
|
|
|
|
* 1) A -> B: packet with enough data to fill the window
|
|
|
|
* 2) B -> A: ACK for #1 + new data (0 window
|
|
|
|
* advertisement)
|
|
|
|
* 3) A -> B: ACK for #2, 0 len packet
|
|
|
|
*
|
|
|
|
* In this case, A will not activate the persist timer,
|
|
|
|
* because it chose to send a packet. Unless tcp_output
|
|
|
|
* is called for some other reason (delayed ack timer,
|
|
|
|
* another input packet from B, socket syscall), A will
|
|
|
|
* not send zero window probes.
|
|
|
|
*
|
|
|
|
* So, if you send a 0-length packet, but there is data
|
|
|
|
* in the socket buffer, and neither the rexmt or
|
|
|
|
* persist timer is already set, then activate the
|
|
|
|
* persist timer.
|
|
|
|
*/
|
|
|
|
tp->t_rxtshift = 0;
|
|
|
|
tcp_setpersist(tp);
|
Fix tcp_output() so that tcpcb is updated in the same manner when an
mbuf allocation fails, as in a case when ip_output() returns error.
To achieve that, move large block of code that updates tcpcb below
the out: label.
This fixes a panic, that requires the following sequence to happen:
1) The SYN was sent to the network, tp->snd_nxt = iss + 1, tp->snd_una = iss
2) The retransmit timeout happened for the SYN we had sent,
tcp_timer_rexmt() sets tp->snd_nxt = tp->snd_una, and calls tcp_output().
In tcp_output m_get() fails.
3) Later on the SYN|ACK for the SYN sent in step 1) came,
tcp_input sets tp->snd_una += 1, which leads to
tp->snd_una > tp->snd_nxt inconsistency, that later panics in
socket buffer code.
For reference, this bug fixed in DragonflyBSD repo:
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/1ff9b7d322dc5a26f7173aa8c38ecb79da80e419
Reviewed by: andre
Tested by: pho
Sponsored by: Nginx, Inc.
PR: kern/177456
Submitted by: HouYeFei&XiBoLiu <lglion718 163.com>
2013-04-11 18:23:56 +00:00
|
|
|
}
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Persist case, update snd_max but since we are in
|
|
|
|
* persist mode (no window) we do not update snd_nxt.
|
|
|
|
*/
|
|
|
|
int xlen = len;
|
|
|
|
if (flags & TH_SYN)
|
|
|
|
++xlen;
|
|
|
|
if (flags & TH_FIN) {
|
|
|
|
++xlen;
|
|
|
|
tp->t_flags |= TF_SENTFIN;
|
|
|
|
}
|
|
|
|
if (SEQ_GT(tp->snd_nxt + xlen, tp->snd_max))
|
2016-10-06 16:00:48 +00:00
|
|
|
tp->snd_max = tp->snd_nxt + xlen;
|
Fix tcp_output() so that tcpcb is updated in the same manner when an
mbuf allocation fails, as in a case when ip_output() returns error.
To achieve that, move large block of code that updates tcpcb below
the out: label.
This fixes a panic, that requires the following sequence to happen:
1) The SYN was sent to the network, tp->snd_nxt = iss + 1, tp->snd_una = iss
2) The retransmit timeout happened for the SYN we had sent,
tcp_timer_rexmt() sets tp->snd_nxt = tp->snd_una, and calls tcp_output().
In tcp_output m_get() fails.
3) Later on the SYN|ACK for the SYN sent in step 1) came,
tcp_input sets tp->snd_una += 1, which leads to
tp->snd_una > tp->snd_nxt inconsistency, that later panics in
socket buffer code.
For reference, this bug fixed in DragonflyBSD repo:
http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/1ff9b7d322dc5a26f7173aa8c38ecb79da80e419
Reviewed by: andre
Tested by: pho
Sponsored by: Nginx, Inc.
PR: kern/177456
Submitted by: HouYeFei&XiBoLiu <lglion718 163.com>
2013-04-11 18:23:56 +00:00
|
|
|
}
|
2019-07-14 16:05:47 +00:00
|
|
|
if ((error == 0) &&
|
|
|
|
(TCPS_HAVEESTABLISHED(tp->t_state) &&
|
|
|
|
(tp->t_flags & TF_SACK_PERMIT) &&
|
|
|
|
tp->rcv_numsacks > 0)) {
|
|
|
|
/* Clean up any DSACK's sent */
|
|
|
|
tcp_clean_dsack_blocks(tp);
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
if (error) {
|
2018-03-22 09:40:08 +00:00
|
|
|
/* Record the error. */
|
|
|
|
TCP_LOG_EVENT(tp, NULL, &so->so_rcv, &so->so_snd, TCP_LOG_OUT,
|
|
|
|
error, 0, NULL, false);
|
2000-08-03 23:23:36 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We know that the packet was lost, so back out the
|
|
|
|
* sequence number advance, if any.
|
2006-09-28 18:02:46 +00:00
|
|
|
*
|
|
|
|
* If the error is EPERM the packet got blocked by the
|
|
|
|
* local firewall. Normally we should terminate the
|
|
|
|
* connection but the blocking may have been spurious
|
|
|
|
* due to a firewall reconfiguration cycle. So we treat
|
|
|
|
* it like a packet loss and let the retransmit timer and
|
|
|
|
* timeouts do their work over time.
|
|
|
|
* XXX: It is a POLA question whether calling tcp_drop right
|
|
|
|
* away would be the really correct behavior instead.
|
2000-08-03 23:23:36 +00:00
|
|
|
*/
|
2007-02-28 12:41:49 +00:00
|
|
|
if (((tp->t_flags & TF_FORCEDATA) == 0 ||
|
2007-04-11 09:45:16 +00:00
|
|
|
!tcp_timer_active(tp, TT_PERSIST)) &&
|
2007-02-28 12:41:49 +00:00
|
|
|
((flags & TH_SYN) == 0) &&
|
|
|
|
(error != EPERM)) {
|
|
|
|
if (sack_rxmit) {
|
|
|
|
p->rxmit -= len;
|
|
|
|
tp->sackhint.sack_bytes_rexmit -= len;
|
|
|
|
KASSERT(tp->sackhint.sack_bytes_rexmit >= 0,
|
|
|
|
("sackhint bytes rtx >= 0"));
|
|
|
|
} else
|
|
|
|
tp->snd_nxt -= len;
|
2006-09-28 18:02:46 +00:00
|
|
|
}
|
2004-10-09 16:48:51 +00:00
|
|
|
SOCKBUF_UNLOCK_ASSERT(&so->so_snd); /* Check gotos. */
|
2007-02-28 12:41:49 +00:00
|
|
|
switch (error) {
|
2017-02-06 08:49:57 +00:00
|
|
|
case EACCES:
|
2007-02-28 12:41:49 +00:00
|
|
|
case EPERM:
|
|
|
|
tp->t_softerror = error;
|
|
|
|
return (error);
|
|
|
|
case ENOBUFS:
|
tcp: Don't prematurely drop receiving-only connections
If the connection was persistent and receiving-only, several (12)
sporadic device insufficient buffers would cause the connection be
dropped prematurely:
Upon ENOBUFS in tcp_output() for an ACK, retransmission timer is
started. No one will stop this retransmission timer for receiving-
only connection, so the retransmission timer promises to expire and
t_rxtshift is promised to be increased. And t_rxtshift will not be
reset to 0, since no RTT measurement will be done for receiving-only
connection. If this receiving-only connection lived long enough
(e.g. >350sec, given the RTO starts from 200ms), and it suffered 12
sporadic device insufficient buffers, i.e. t_rxtshift >= 12, this
receiving-only connection would be dropped prematurely by the
retransmission timer.
We now assert that for data segments, SYNs or FINs either rexmit or
persist timer was wired upon ENOBUFS. And don't set rexmit timer
for other cases, i.e. ENOBUFS upon ACKs.
Discussed with: lstewart, hiren, jtl, Mike Karels
MFC after: 3 weeks
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5872
2016-05-30 03:31:37 +00:00
|
|
|
TCP_XMIT_TIMER_ASSERT(tp, len, flags);
|
2005-04-21 12:37:12 +00:00
|
|
|
tp->snd_cwnd = tp->t_maxseg;
|
1994-05-24 10:09:53 +00:00
|
|
|
return (0);
|
2007-02-28 12:41:49 +00:00
|
|
|
case EMSGSIZE:
|
1995-10-16 18:21:26 +00:00
|
|
|
/*
|
2006-09-07 12:53:01 +00:00
|
|
|
* For some reason the interface we used initially
|
|
|
|
* to send segments changed to another or lowered
|
|
|
|
* its MTU.
|
|
|
|
* If TSO was active we either got an interface
|
|
|
|
* without TSO capabilits or TSO was turned off.
|
2012-07-16 07:08:34 +00:00
|
|
|
* If we obtained mtu from ip_output() then update
|
|
|
|
* it and try again.
|
1995-10-16 18:21:26 +00:00
|
|
|
*/
|
2006-09-07 12:53:01 +00:00
|
|
|
if (tso)
|
|
|
|
tp->t_flags &= ~TF_TSO;
|
2012-07-16 07:08:34 +00:00
|
|
|
if (mtu != 0) {
|
|
|
|
tcp_mss_update(tp, -1, mtu, NULL, NULL);
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
return (error);
|
2007-02-28 12:47:49 +00:00
|
|
|
case EHOSTDOWN:
|
2007-02-28 12:41:49 +00:00
|
|
|
case EHOSTUNREACH:
|
|
|
|
case ENETDOWN:
|
2007-02-28 12:47:49 +00:00
|
|
|
case ENETUNREACH:
|
2007-02-28 12:41:49 +00:00
|
|
|
if (TCPS_HAVERCVDSYN(tp->t_state)) {
|
|
|
|
tp->t_softerror = error;
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
/* FALLTHROUGH */
|
|
|
|
default:
|
|
|
|
return (error);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
}
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sndtotal);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Data sent (as far as we can tell).
|
|
|
|
* If this advertises a larger window than any other segment,
|
|
|
|
* then remember the size of the advertised window.
|
|
|
|
* Any pending ACK has now been sent.
|
|
|
|
*/
|
2016-10-06 16:28:34 +00:00
|
|
|
if (SEQ_GT(tp->rcv_nxt + recwin, tp->rcv_adv))
|
2004-01-22 23:22:14 +00:00
|
|
|
tp->rcv_adv = tp->rcv_nxt + recwin;
|
1994-05-24 10:09:53 +00:00
|
|
|
tp->last_ack_sent = tp->rcv_nxt;
|
2003-02-19 21:18:23 +00:00
|
|
|
tp->t_flags &= ~(TF_ACKNOW | TF_DELACK);
|
2007-04-11 09:45:16 +00:00
|
|
|
if (tcp_timer_active(tp, TT_DELACK))
|
|
|
|
tcp_timer_activate(tp, TT_DELACK, 0);
|
2001-11-30 21:33:39 +00:00
|
|
|
#if 0
|
|
|
|
/*
|
|
|
|
* This completely breaks TCP if newreno is turned on. What happens
|
|
|
|
* is that if delayed-acks are turned on on the receiver, this code
|
|
|
|
* on the transmitter effectively destroys the TCP window, forcing
|
|
|
|
* it to four packets (1.5Kx4 = 6K window).
|
|
|
|
*/
|
2010-11-12 06:41:55 +00:00
|
|
|
if (sendalot && --maxburst)
|
1994-05-24 10:09:53 +00:00
|
|
|
goto again;
|
2001-11-30 21:33:39 +00:00
|
|
|
#endif
|
|
|
|
if (sendalot)
|
|
|
|
goto again;
|
1994-05-24 10:09:53 +00:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2007-03-21 19:37:55 +00:00
|
|
|
tcp_setpersist(struct tcpcb *tp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
1999-08-30 21:17:07 +00:00
|
|
|
int t = ((tp->t_srtt >> 2) + tp->t_rttvar) >> 1;
|
|
|
|
int tt;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2011-04-29 15:40:12 +00:00
|
|
|
tp->t_flags &= ~TF_PREVVALID;
|
2007-04-11 09:45:16 +00:00
|
|
|
if (tcp_timer_active(tp, TT_REXMT))
|
1999-08-30 21:17:07 +00:00
|
|
|
panic("tcp_setpersist: retransmit pending");
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
2016-05-03 18:05:43 +00:00
|
|
|
* Start/restart persistence timer.
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
1999-08-30 21:17:07 +00:00
|
|
|
TCPT_RANGESET(tt, t * tcp_backoff[tp->t_rxtshift],
|
2016-01-26 16:33:38 +00:00
|
|
|
tcp_persmin, tcp_persmax);
|
2007-04-11 09:45:16 +00:00
|
|
|
tcp_timer_activate(tp, TT_PERSIST, tt);
|
1994-05-24 10:09:53 +00:00
|
|
|
if (tp->t_rxtshift < TCP_MAXRXTSHIFT)
|
|
|
|
tp->t_rxtshift++;
|
|
|
|
}
|
2007-03-15 15:59:28 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Insert TCP options according to the supplied parameters to the place
|
|
|
|
* optp in a consistent way. Can handle unaligned destinations.
|
|
|
|
*
|
|
|
|
* The order of the option processing is crucial for optimal packing and
|
|
|
|
* alignment for the scarce option space.
|
|
|
|
*
|
|
|
|
* The optimal order for a SYN/SYN-ACK segment is:
|
|
|
|
* MSS (4) + NOP (1) + Window scale (3) + SACK permitted (2) +
|
|
|
|
* Timestamp (10) + Signature (18) = 38 bytes out of a maximum of 40.
|
|
|
|
*
|
|
|
|
* The SACK options should be last. SACK blocks consume 8*n+2 bytes.
|
|
|
|
* So a full size SACK blocks option is 34 bytes (with 4 SACK blocks).
|
|
|
|
* At minimum we need 10 bytes (to generate 1 SACK block). If both
|
|
|
|
* TCP Timestamps (12 bytes) and TCP Signatures (18 bytes) are present,
|
|
|
|
* we only have 10 bytes for SACK options (40 - (12 + 18)).
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
tcp_addoptions(struct tcpopt *to, u_char *optp)
|
|
|
|
{
|
2016-03-22 15:55:17 +00:00
|
|
|
u_int32_t mask, optlen = 0;
|
2007-03-15 15:59:28 +00:00
|
|
|
|
|
|
|
for (mask = 1; mask < TOF_MAXOPT; mask <<= 1) {
|
|
|
|
if ((to->to_flags & mask) != mask)
|
|
|
|
continue;
|
2008-04-07 19:09:23 +00:00
|
|
|
if (optlen == TCP_MAXOLEN)
|
|
|
|
break;
|
2007-03-15 15:59:28 +00:00
|
|
|
switch (to->to_flags & mask) {
|
|
|
|
case TOF_MSS:
|
|
|
|
while (optlen % 4) {
|
|
|
|
optlen += TCPOLEN_NOP;
|
|
|
|
*optp++ = TCPOPT_NOP;
|
|
|
|
}
|
2008-04-07 19:09:23 +00:00
|
|
|
if (TCP_MAXOLEN - optlen < TCPOLEN_MAXSEG)
|
|
|
|
continue;
|
2007-03-15 15:59:28 +00:00
|
|
|
optlen += TCPOLEN_MAXSEG;
|
|
|
|
*optp++ = TCPOPT_MAXSEG;
|
|
|
|
*optp++ = TCPOLEN_MAXSEG;
|
|
|
|
to->to_mss = htons(to->to_mss);
|
|
|
|
bcopy((u_char *)&to->to_mss, optp, sizeof(to->to_mss));
|
|
|
|
optp += sizeof(to->to_mss);
|
|
|
|
break;
|
|
|
|
case TOF_SCALE:
|
|
|
|
while (!optlen || optlen % 2 != 1) {
|
|
|
|
optlen += TCPOLEN_NOP;
|
|
|
|
*optp++ = TCPOPT_NOP;
|
|
|
|
}
|
2008-04-07 19:09:23 +00:00
|
|
|
if (TCP_MAXOLEN - optlen < TCPOLEN_WINDOW)
|
|
|
|
continue;
|
2007-03-15 15:59:28 +00:00
|
|
|
optlen += TCPOLEN_WINDOW;
|
|
|
|
*optp++ = TCPOPT_WINDOW;
|
|
|
|
*optp++ = TCPOLEN_WINDOW;
|
|
|
|
*optp++ = to->to_wscale;
|
|
|
|
break;
|
|
|
|
case TOF_SACKPERM:
|
|
|
|
while (optlen % 2) {
|
|
|
|
optlen += TCPOLEN_NOP;
|
|
|
|
*optp++ = TCPOPT_NOP;
|
|
|
|
}
|
2008-04-07 19:09:23 +00:00
|
|
|
if (TCP_MAXOLEN - optlen < TCPOLEN_SACK_PERMITTED)
|
|
|
|
continue;
|
2007-03-15 15:59:28 +00:00
|
|
|
optlen += TCPOLEN_SACK_PERMITTED;
|
|
|
|
*optp++ = TCPOPT_SACK_PERMITTED;
|
|
|
|
*optp++ = TCPOLEN_SACK_PERMITTED;
|
|
|
|
break;
|
|
|
|
case TOF_TS:
|
|
|
|
while (!optlen || optlen % 4 != 2) {
|
|
|
|
optlen += TCPOLEN_NOP;
|
|
|
|
*optp++ = TCPOPT_NOP;
|
|
|
|
}
|
2008-04-07 19:09:23 +00:00
|
|
|
if (TCP_MAXOLEN - optlen < TCPOLEN_TIMESTAMP)
|
|
|
|
continue;
|
2007-03-15 15:59:28 +00:00
|
|
|
optlen += TCPOLEN_TIMESTAMP;
|
|
|
|
*optp++ = TCPOPT_TIMESTAMP;
|
|
|
|
*optp++ = TCPOLEN_TIMESTAMP;
|
|
|
|
to->to_tsval = htonl(to->to_tsval);
|
|
|
|
to->to_tsecr = htonl(to->to_tsecr);
|
|
|
|
bcopy((u_char *)&to->to_tsval, optp, sizeof(to->to_tsval));
|
|
|
|
optp += sizeof(to->to_tsval);
|
|
|
|
bcopy((u_char *)&to->to_tsecr, optp, sizeof(to->to_tsecr));
|
|
|
|
optp += sizeof(to->to_tsecr);
|
|
|
|
break;
|
|
|
|
case TOF_SIGNATURE:
|
|
|
|
{
|
|
|
|
int siglen = TCPOLEN_SIGNATURE - 2;
|
|
|
|
|
|
|
|
while (!optlen || optlen % 4 != 2) {
|
|
|
|
optlen += TCPOLEN_NOP;
|
|
|
|
*optp++ = TCPOPT_NOP;
|
|
|
|
}
|
2017-02-06 08:49:57 +00:00
|
|
|
if (TCP_MAXOLEN - optlen < TCPOLEN_SIGNATURE) {
|
|
|
|
to->to_flags &= ~TOF_SIGNATURE;
|
2007-03-15 15:59:28 +00:00
|
|
|
continue;
|
2017-02-06 08:49:57 +00:00
|
|
|
}
|
2007-03-15 15:59:28 +00:00
|
|
|
optlen += TCPOLEN_SIGNATURE;
|
|
|
|
*optp++ = TCPOPT_SIGNATURE;
|
|
|
|
*optp++ = TCPOLEN_SIGNATURE;
|
|
|
|
to->to_signature = optp;
|
|
|
|
while (siglen--)
|
|
|
|
*optp++ = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case TOF_SACK:
|
|
|
|
{
|
|
|
|
int sackblks = 0;
|
|
|
|
struct sackblk *sack = (struct sackblk *)to->to_sacks;
|
|
|
|
tcp_seq sack_seq;
|
|
|
|
|
|
|
|
while (!optlen || optlen % 4 != 2) {
|
|
|
|
optlen += TCPOLEN_NOP;
|
|
|
|
*optp++ = TCPOPT_NOP;
|
|
|
|
}
|
2008-04-07 19:09:23 +00:00
|
|
|
if (TCP_MAXOLEN - optlen < TCPOLEN_SACKHDR + TCPOLEN_SACK)
|
2007-03-15 15:59:28 +00:00
|
|
|
continue;
|
|
|
|
optlen += TCPOLEN_SACKHDR;
|
|
|
|
*optp++ = TCPOPT_SACK;
|
|
|
|
sackblks = min(to->to_nsacks,
|
2007-04-20 15:08:09 +00:00
|
|
|
(TCP_MAXOLEN - optlen) / TCPOLEN_SACK);
|
2007-03-15 15:59:28 +00:00
|
|
|
*optp++ = TCPOLEN_SACKHDR + sackblks * TCPOLEN_SACK;
|
|
|
|
while (sackblks--) {
|
|
|
|
sack_seq = htonl(sack->start);
|
|
|
|
bcopy((u_char *)&sack_seq, optp, sizeof(sack_seq));
|
|
|
|
optp += sizeof(sack_seq);
|
|
|
|
sack_seq = htonl(sack->end);
|
|
|
|
bcopy((u_char *)&sack_seq, optp, sizeof(sack_seq));
|
|
|
|
optp += sizeof(sack_seq);
|
|
|
|
optlen += TCPOLEN_SACK;
|
|
|
|
sack++;
|
|
|
|
}
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sack_send_blocks);
|
2007-03-15 15:59:28 +00:00
|
|
|
break;
|
|
|
|
}
|
2015-12-24 19:09:48 +00:00
|
|
|
case TOF_FASTOPEN:
|
|
|
|
{
|
|
|
|
int total_len;
|
|
|
|
|
|
|
|
/* XXX is there any point to aligning this option? */
|
|
|
|
total_len = TCPOLEN_FAST_OPEN_EMPTY + to->to_tfo_len;
|
2018-02-26 02:53:22 +00:00
|
|
|
if (TCP_MAXOLEN - optlen < total_len) {
|
|
|
|
to->to_flags &= ~TOF_FASTOPEN;
|
2015-12-24 19:09:48 +00:00
|
|
|
continue;
|
2018-02-26 02:53:22 +00:00
|
|
|
}
|
2015-12-24 19:09:48 +00:00
|
|
|
*optp++ = TCPOPT_FAST_OPEN;
|
|
|
|
*optp++ = total_len;
|
|
|
|
if (to->to_tfo_len > 0) {
|
|
|
|
bcopy(to->to_tfo_cookie, optp, to->to_tfo_len);
|
|
|
|
optp += to->to_tfo_len;
|
|
|
|
}
|
|
|
|
optlen += total_len;
|
|
|
|
break;
|
|
|
|
}
|
2007-03-15 15:59:28 +00:00
|
|
|
default:
|
|
|
|
panic("%s: unknown TCP option type", __func__);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Terminate and pad TCP options to a 4 byte boundary. */
|
|
|
|
if (optlen % 4) {
|
|
|
|
optlen += TCPOLEN_EOL;
|
|
|
|
*optp++ = TCPOPT_EOL;
|
|
|
|
}
|
2008-03-09 13:26:50 +00:00
|
|
|
/*
|
|
|
|
* According to RFC 793 (STD0007):
|
|
|
|
* "The content of the header beyond the End-of-Option option
|
|
|
|
* must be header padding (i.e., zero)."
|
|
|
|
* and later: "The padding is composed of zeros."
|
|
|
|
*/
|
2007-03-15 15:59:28 +00:00
|
|
|
while (optlen % 4) {
|
2008-04-07 18:43:59 +00:00
|
|
|
optlen += TCPOLEN_PAD;
|
|
|
|
*optp++ = TCPOPT_PAD;
|
2007-03-15 15:59:28 +00:00
|
|
|
}
|
|
|
|
|
2007-04-20 15:08:09 +00:00
|
|
|
KASSERT(optlen <= TCP_MAXOLEN, ("%s: TCP options too long", __func__));
|
2007-03-15 15:59:28 +00:00
|
|
|
return (optlen);
|
|
|
|
}
|
2017-12-07 22:36:58 +00:00
|
|
|
|
2018-06-07 18:18:13 +00:00
|
|
|
/*
|
|
|
|
* This is a copy of m_copym(), taking the TSO segment size/limit
|
|
|
|
* constraints into account, and advancing the sndptr as it goes.
|
|
|
|
*/
|
|
|
|
struct mbuf *
|
|
|
|
tcp_m_copym(struct mbuf *m, int32_t off0, int32_t *plen,
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
int32_t seglimit, int32_t segsize, struct sockbuf *sb, bool hw_tls)
|
2018-06-07 18:18:13 +00:00
|
|
|
{
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#ifdef KERN_TLS
|
|
|
|
struct ktls_session *tls, *ntls;
|
|
|
|
struct mbuf *start;
|
|
|
|
#endif
|
2018-06-07 18:18:13 +00:00
|
|
|
struct mbuf *n, **np;
|
|
|
|
struct mbuf *top;
|
|
|
|
int32_t off = off0;
|
|
|
|
int32_t len = *plen;
|
|
|
|
int32_t fragsize;
|
|
|
|
int32_t len_cp = 0;
|
|
|
|
int32_t *pkthdrlen;
|
|
|
|
uint32_t mlen, frags;
|
|
|
|
bool copyhdr;
|
|
|
|
|
|
|
|
KASSERT(off >= 0, ("tcp_m_copym, negative off %d", off));
|
|
|
|
KASSERT(len >= 0, ("tcp_m_copym, negative len %d", len));
|
|
|
|
if (off == 0 && m->m_flags & M_PKTHDR)
|
|
|
|
copyhdr = true;
|
|
|
|
else
|
|
|
|
copyhdr = false;
|
|
|
|
while (off > 0) {
|
|
|
|
KASSERT(m != NULL, ("tcp_m_copym, offset > size of mbuf chain"));
|
|
|
|
if (off < m->m_len)
|
|
|
|
break;
|
|
|
|
off -= m->m_len;
|
|
|
|
if ((sb) && (m == sb->sb_sndptr)) {
|
|
|
|
sb->sb_sndptroff += m->m_len;
|
|
|
|
sb->sb_sndptr = m->m_next;
|
|
|
|
}
|
|
|
|
m = m->m_next;
|
|
|
|
}
|
|
|
|
np = ⊤
|
|
|
|
top = NULL;
|
|
|
|
pkthdrlen = NULL;
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#ifdef KERN_TLS
|
2020-05-03 00:21:11 +00:00
|
|
|
if (hw_tls && (m->m_flags & M_EXTPG))
|
2020-05-03 00:12:56 +00:00
|
|
|
tls = m->m_epg_tls;
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
else
|
|
|
|
tls = NULL;
|
|
|
|
start = m;
|
|
|
|
#endif
|
2018-06-07 18:18:13 +00:00
|
|
|
while (len > 0) {
|
|
|
|
if (m == NULL) {
|
|
|
|
KASSERT(len == M_COPYALL,
|
|
|
|
("tcp_m_copym, length > size of mbuf chain"));
|
|
|
|
*plen = len_cp;
|
|
|
|
if (pkthdrlen != NULL)
|
|
|
|
*pkthdrlen = len_cp;
|
|
|
|
break;
|
|
|
|
}
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#ifdef KERN_TLS
|
|
|
|
if (hw_tls) {
|
2020-05-03 00:21:11 +00:00
|
|
|
if (m->m_flags & M_EXTPG)
|
2020-05-03 00:12:56 +00:00
|
|
|
ntls = m->m_epg_tls;
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
else
|
|
|
|
ntls = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Avoid mixing TLS records with handshake
|
|
|
|
* data or TLS records from different
|
|
|
|
* sessions.
|
|
|
|
*/
|
|
|
|
if (tls != ntls) {
|
|
|
|
MPASS(m != start);
|
|
|
|
*plen = len_cp;
|
|
|
|
if (pkthdrlen != NULL)
|
|
|
|
*pkthdrlen = len_cp;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
2018-06-07 18:18:13 +00:00
|
|
|
mlen = min(len, m->m_len - off);
|
|
|
|
if (seglimit) {
|
|
|
|
/*
|
2020-05-03 00:21:11 +00:00
|
|
|
* For M_EXTPG mbufs, add 3 segments
|
2018-06-07 18:18:13 +00:00
|
|
|
* + 1 in case we are crossing page boundaries
|
|
|
|
* + 2 in case the TLS hdr/trailer are used
|
|
|
|
* It is cheaper to just add the segments
|
|
|
|
* than it is to take the cache miss to look
|
|
|
|
* at the mbuf ext_pgs state in detail.
|
|
|
|
*/
|
2020-05-03 00:21:11 +00:00
|
|
|
if (m->m_flags & M_EXTPG) {
|
2018-06-07 18:18:13 +00:00
|
|
|
fragsize = min(segsize, PAGE_SIZE);
|
|
|
|
frags = 3;
|
|
|
|
} else {
|
|
|
|
fragsize = segsize;
|
|
|
|
frags = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Break if we really can't fit anymore. */
|
|
|
|
if ((frags + 1) >= seglimit) {
|
|
|
|
*plen = len_cp;
|
|
|
|
if (pkthdrlen != NULL)
|
|
|
|
*pkthdrlen = len_cp;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Reduce size if you can't copy the whole
|
|
|
|
* mbuf. If we can't copy the whole mbuf, also
|
|
|
|
* adjust len so the loop will end after this
|
|
|
|
* mbuf.
|
|
|
|
*/
|
|
|
|
if ((frags + howmany(mlen, fragsize)) >= seglimit) {
|
|
|
|
mlen = (seglimit - frags - 1) * fragsize;
|
|
|
|
len = mlen;
|
|
|
|
*plen = len_cp + len;
|
|
|
|
if (pkthdrlen != NULL)
|
|
|
|
*pkthdrlen = *plen;
|
|
|
|
}
|
|
|
|
frags += howmany(mlen, fragsize);
|
|
|
|
if (frags == 0)
|
|
|
|
frags++;
|
|
|
|
seglimit -= frags;
|
|
|
|
KASSERT(seglimit > 0,
|
|
|
|
("%s: seglimit went too low", __func__));
|
|
|
|
}
|
|
|
|
if (copyhdr)
|
|
|
|
n = m_gethdr(M_NOWAIT, m->m_type);
|
|
|
|
else
|
|
|
|
n = m_get(M_NOWAIT, m->m_type);
|
|
|
|
*np = n;
|
|
|
|
if (n == NULL)
|
|
|
|
goto nospace;
|
|
|
|
if (copyhdr) {
|
|
|
|
if (!m_dup_pkthdr(n, m, M_NOWAIT))
|
|
|
|
goto nospace;
|
|
|
|
if (len == M_COPYALL)
|
|
|
|
n->m_pkthdr.len -= off0;
|
|
|
|
else
|
|
|
|
n->m_pkthdr.len = len;
|
|
|
|
pkthdrlen = &n->m_pkthdr.len;
|
|
|
|
copyhdr = false;
|
|
|
|
}
|
|
|
|
n->m_len = mlen;
|
|
|
|
len_cp += n->m_len;
|
2020-05-03 00:37:16 +00:00
|
|
|
if (m->m_flags & (M_EXT|M_EXTPG)) {
|
2018-06-07 18:18:13 +00:00
|
|
|
n->m_data = m->m_data + off;
|
|
|
|
mb_dupcl(n, m);
|
|
|
|
} else
|
|
|
|
bcopy(mtod(m, caddr_t)+off, mtod(n, caddr_t),
|
|
|
|
(u_int)n->m_len);
|
|
|
|
|
|
|
|
if (sb && (sb->sb_sndptr == m) &&
|
|
|
|
((n->m_len + off) >= m->m_len) && m->m_next) {
|
|
|
|
sb->sb_sndptroff += m->m_len;
|
|
|
|
sb->sb_sndptr = m->m_next;
|
|
|
|
}
|
|
|
|
off = 0;
|
|
|
|
if (len != M_COPYALL) {
|
|
|
|
len -= n->m_len;
|
|
|
|
}
|
|
|
|
m = m->m_next;
|
|
|
|
np = &n->m_next;
|
|
|
|
}
|
|
|
|
return (top);
|
|
|
|
nospace:
|
|
|
|
m_freem(top);
|
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
2017-12-07 22:36:58 +00:00
|
|
|
void
|
|
|
|
tcp_sndbuf_autoscale(struct tcpcb *tp, struct socket *so, uint32_t sendwin)
|
|
|
|
{
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Automatic sizing of send socket buffer. Often the send buffer
|
|
|
|
* size is not optimally adjusted to the actual network conditions
|
|
|
|
* at hand (delay bandwidth product). Setting the buffer size too
|
|
|
|
* small limits throughput on links with high bandwidth and high
|
|
|
|
* delay (eg. trans-continental/oceanic links). Setting the
|
|
|
|
* buffer size too big consumes too much real kernel memory,
|
|
|
|
* especially with many connections on busy servers.
|
|
|
|
*
|
|
|
|
* The criteria to step up the send buffer one notch are:
|
|
|
|
* 1. receive window of remote host is larger than send buffer
|
|
|
|
* (with a fudge factor of 5/4th);
|
|
|
|
* 2. send buffer is filled to 7/8th with data (so we actually
|
|
|
|
* have data to make use of it);
|
|
|
|
* 3. send buffer fill has not hit maximal automatic size;
|
|
|
|
* 4. our send window (slow start and cogestion controlled) is
|
|
|
|
* larger than sent but unacknowledged data in send buffer.
|
|
|
|
*
|
|
|
|
* The remote host receive window scaling factor may limit the
|
|
|
|
* growing of the send buffer before it reaches its allowed
|
|
|
|
* maximum.
|
|
|
|
*
|
|
|
|
* It scales directly with slow start or congestion window
|
|
|
|
* and does at most one step per received ACK. This fast
|
|
|
|
* scaling has the drawback of growing the send buffer beyond
|
|
|
|
* what is strictly necessary to make full use of a given
|
|
|
|
* delay*bandwidth product. However testing has shown this not
|
|
|
|
* to be much of an problem. At worst we are trading wasting
|
|
|
|
* of available bandwidth (the non-use of it) for wasting some
|
|
|
|
* socket buffer memory.
|
|
|
|
*
|
|
|
|
* TODO: Shrink send buffer during idle periods together
|
|
|
|
* with congestion window. Requires another timer. Has to
|
|
|
|
* wait for upcoming tcp timer rewrite.
|
|
|
|
*
|
|
|
|
* XXXGL: should there be used sbused() or sbavail()?
|
|
|
|
*/
|
|
|
|
if (V_tcp_do_autosndbuf && so->so_snd.sb_flags & SB_AUTOSIZE) {
|
|
|
|
int lowat;
|
|
|
|
|
|
|
|
lowat = V_tcp_sendbuf_auto_lowat ? so->so_snd.sb_lowat : 0;
|
|
|
|
if ((tp->snd_wnd / 4 * 5) >= so->so_snd.sb_hiwat - lowat &&
|
|
|
|
sbused(&so->so_snd) >=
|
|
|
|
(so->so_snd.sb_hiwat / 8 * 7) - lowat &&
|
|
|
|
sbused(&so->so_snd) < V_tcp_autosndbuf_max &&
|
|
|
|
sendwin >= (sbused(&so->so_snd) -
|
|
|
|
(tp->snd_nxt - tp->snd_una))) {
|
|
|
|
if (!sbreserve_locked(&so->so_snd,
|
|
|
|
min(so->so_snd.sb_hiwat + V_tcp_autosndbuf_inc,
|
|
|
|
V_tcp_autosndbuf_max), so, curthread))
|
|
|
|
so->so_snd.sb_flags &= ~SB_AUTOSIZE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|