2005-01-07 01:45:51 +00:00
|
|
|
/*-
|
2017-11-20 19:43:44 +00:00
|
|
|
* SPDX-License-Identifier: BSD-3-Clause
|
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Copyright (c) 1982, 1986, 1988, 1993
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
* The Regents of the University of California.
|
2007-02-17 21:02:38 +00:00
|
|
|
* Copyright (c) 2006-2007 Robert N. M. Watson
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
* Copyright (c) 2010-2011 Juniper Networks, Inc.
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
* All rights reserved.
|
1994-05-24 10:09:53 +00:00
|
|
|
*
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
* Portions of this software were developed by Robert N. M. Watson under
|
|
|
|
* contract to Juniper Networks, Inc.
|
|
|
|
*
|
1994-05-24 10:09:53 +00:00
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
2017-02-28 23:42:47 +00:00
|
|
|
* 3. Neither the name of the University nor the names of its contributors
|
1994-05-24 10:09:53 +00:00
|
|
|
* may be used to endorse or promote products derived from this software
|
|
|
|
* without specific prior written permission.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*
|
1995-02-16 01:42:45 +00:00
|
|
|
* From: @(#)tcp_usrreq.c 8.2 (Berkeley) 1/3/94
|
1994-05-24 10:09:53 +00:00
|
|
|
*/
|
|
|
|
|
2007-10-07 20:44:24 +00:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
2007-02-17 21:02:38 +00:00
|
|
|
#include "opt_ddb.h"
|
Initial import of RFC 2385 (TCP-MD5) digest support.
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.
For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.
Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.
There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.
Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.
This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.
Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.
Sponsored by: sentex.net
2004-02-11 04:26:04 +00:00
|
|
|
#include "opt_inet.h"
|
2000-01-09 19:17:30 +00:00
|
|
|
#include "opt_inet6.h"
|
2017-02-06 08:49:57 +00:00
|
|
|
#include "opt_ipsec.h"
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#include "opt_kern_tls.h"
|
1997-09-16 18:36:06 +00:00
|
|
|
#include "opt_tcpdebug.h"
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/param.h>
|
|
|
|
#include <sys/systm.h>
|
2019-12-02 20:58:04 +00:00
|
|
|
#include <sys/arb.h>
|
2012-02-05 16:53:02 +00:00
|
|
|
#include <sys/limits.h>
|
2002-06-10 20:05:46 +00:00
|
|
|
#include <sys/malloc.h>
|
2015-12-16 00:56:45 +00:00
|
|
|
#include <sys/refcount.h>
|
1995-02-17 00:29:42 +00:00
|
|
|
#include <sys/kernel.h>
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#include <sys/ktls.h>
|
2019-12-02 20:58:04 +00:00
|
|
|
#include <sys/qmath.h>
|
1995-11-09 20:23:09 +00:00
|
|
|
#include <sys/sysctl.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/mbuf.h>
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
|
|
|
#include <sys/domain.h>
|
|
|
|
#endif /* INET6 */
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <sys/socket.h>
|
|
|
|
#include <sys/socketvar.h>
|
|
|
|
#include <sys/protosw.h>
|
2001-02-21 06:39:57 +00:00
|
|
|
#include <sys/proc.h>
|
|
|
|
#include <sys/jail.h>
|
2016-10-18 07:16:49 +00:00
|
|
|
#include <sys/syslog.h>
|
2019-12-02 20:58:04 +00:00
|
|
|
#include <sys/stats.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2007-02-17 21:02:38 +00:00
|
|
|
#ifdef DDB
|
|
|
|
#include <ddb/ddb.h>
|
|
|
|
#endif
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <net/if.h>
|
2013-10-26 17:58:36 +00:00
|
|
|
#include <net/if_var.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <net/route.h>
|
2009-08-01 19:26:27 +00:00
|
|
|
#include <net/vnet.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
#include <netinet/in.h>
|
2015-09-13 15:50:55 +00:00
|
|
|
#include <netinet/in_kdtrace.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <netinet/in_pcb.h>
|
2011-04-30 11:21:29 +00:00
|
|
|
#include <netinet/in_systm.h>
|
1995-03-16 18:17:34 +00:00
|
|
|
#include <netinet/in_var.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <netinet/ip_var.h>
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
2011-04-30 11:21:29 +00:00
|
|
|
#include <netinet/ip6.h>
|
|
|
|
#include <netinet6/in6_pcb.h>
|
2000-01-09 19:17:30 +00:00
|
|
|
#include <netinet6/ip6_var.h>
|
2005-07-25 12:31:43 +00:00
|
|
|
#include <netinet6/scope6_var.h>
|
2000-01-09 19:17:30 +00:00
|
|
|
#endif
|
2016-01-21 22:34:51 +00:00
|
|
|
#include <netinet/tcp.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <netinet/tcp_fsm.h>
|
|
|
|
#include <netinet/tcp_seq.h>
|
|
|
|
#include <netinet/tcp_timer.h>
|
|
|
|
#include <netinet/tcp_var.h>
|
2018-03-22 09:40:08 +00:00
|
|
|
#include <netinet/tcp_log_buf.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <netinet/tcpip.h>
|
2016-01-27 17:59:39 +00:00
|
|
|
#include <netinet/cc/cc.h>
|
2018-02-26 02:53:22 +00:00
|
|
|
#include <netinet/tcp_fastopen.h>
|
2018-04-19 15:03:48 +00:00
|
|
|
#include <netinet/tcp_hpts.h>
|
There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.
It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.
To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).
There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.
I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.
The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.
Differential Revision: D3100
Submitted by: Jonathan Looney <jlooney at juniper dot net>
Reviewed by: gnn, hiren
2015-10-14 00:35:37 +00:00
|
|
|
#ifdef TCPPCAP
|
|
|
|
#include <netinet/tcp_pcap.h>
|
|
|
|
#endif
|
1994-09-15 10:36:56 +00:00
|
|
|
#ifdef TCPDEBUG
|
1994-05-24 10:09:53 +00:00
|
|
|
#include <netinet/tcp_debug.h>
|
1994-09-15 10:36:56 +00:00
|
|
|
#endif
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
2007-12-18 22:59:07 +00:00
|
|
|
#include <netinet/tcp_offload.h>
|
2012-06-19 07:34:13 +00:00
|
|
|
#endif
|
2017-02-06 08:49:57 +00:00
|
|
|
#include <netipsec/ipsec_support.h>
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2019-12-02 20:58:04 +00:00
|
|
|
#include <vm/vm.h>
|
|
|
|
#include <vm/vm_param.h>
|
|
|
|
#include <vm/pmap.h>
|
|
|
|
#include <vm/vm_extern.h>
|
|
|
|
#include <vm/vm_map.h>
|
|
|
|
#include <vm/vm_page.h>
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
/*
|
|
|
|
* TCP protocol interface to socket abstraction.
|
|
|
|
*/
|
2011-04-30 11:21:29 +00:00
|
|
|
#ifdef INET
|
2002-03-24 10:19:10 +00:00
|
|
|
static int tcp_connect(struct tcpcb *, struct sockaddr *,
|
|
|
|
struct thread *td);
|
2011-04-30 11:21:29 +00:00
|
|
|
#endif /* INET */
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
2002-03-19 21:25:46 +00:00
|
|
|
static int tcp6_connect(struct tcpcb *, struct sockaddr *,
|
2002-03-24 10:19:10 +00:00
|
|
|
struct thread *td);
|
2000-01-09 19:17:30 +00:00
|
|
|
#endif /* INET6 */
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
static void tcp_disconnect(struct tcpcb *);
|
|
|
|
static void tcp_usrclosed(struct tcpcb *);
|
2004-11-26 18:58:46 +00:00
|
|
|
static void tcp_fill_info(struct tcpcb *, struct tcp_info *);
|
1996-07-11 16:32:50 +00:00
|
|
|
|
2020-05-04 20:19:57 +00:00
|
|
|
static int tcp_pru_options_support(struct tcpcb *tp, int flags);
|
|
|
|
|
1996-07-11 16:32:50 +00:00
|
|
|
#ifdef TCPDEBUG
|
2001-03-12 02:57:42 +00:00
|
|
|
#define TCPDEBUG0 int ostate = 0
|
1996-07-11 16:32:50 +00:00
|
|
|
#define TCPDEBUG1() ostate = tp ? tp->t_state : 0
|
2002-05-31 11:52:35 +00:00
|
|
|
#define TCPDEBUG2(req) if (tp && (so->so_options & SO_DEBUG)) \
|
|
|
|
tcp_trace(TA_USER, ostate, tp, 0, 0, req)
|
1996-07-11 16:32:50 +00:00
|
|
|
#else
|
|
|
|
#define TCPDEBUG0
|
|
|
|
#define TCPDEBUG1()
|
|
|
|
#define TCPDEBUG2(req)
|
|
|
|
#endif
|
|
|
|
|
2020-05-18 22:53:12 +00:00
|
|
|
/*
|
|
|
|
* tcp_require_unique port requires a globally-unique source port for each
|
|
|
|
* outgoing connection. The default is to require the 4-tuple to be unique.
|
|
|
|
*/
|
|
|
|
VNET_DEFINE(int, tcp_require_unique_port) = 0;
|
|
|
|
SYSCTL_INT(_net_inet_tcp, OID_AUTO, require_unique_port,
|
|
|
|
CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(tcp_require_unique_port), 0,
|
|
|
|
"Require globally-unique ephemeral port for outgoing connections");
|
|
|
|
#define V_tcp_require_unique_port VNET(tcp_require_unique_port)
|
|
|
|
|
1996-07-11 16:32:50 +00:00
|
|
|
/*
|
|
|
|
* TCP attaches to socket via pru_attach(), reserving space,
|
|
|
|
* and an internet control block.
|
|
|
|
*/
|
|
|
|
static int
|
2001-09-12 08:38:13 +00:00
|
|
|
tcp_usr_attach(struct socket *so, int proto, struct thread *td)
|
1996-07-11 16:32:50 +00:00
|
|
|
{
|
2002-06-10 20:05:46 +00:00
|
|
|
struct inpcb *inp;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
struct tcpcb *tp = NULL;
|
|
|
|
int error;
|
1996-07-11 16:32:50 +00:00
|
|
|
TCPDEBUG0;
|
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp == NULL, ("tcp_usr_attach: inp != NULL"));
|
1996-07-11 16:32:50 +00:00
|
|
|
TCPDEBUG1();
|
|
|
|
|
2020-01-22 05:54:58 +00:00
|
|
|
if (so->so_snd.sb_hiwat == 0 || so->so_rcv.sb_hiwat == 0) {
|
|
|
|
error = soreserve(so, V_tcp_sendspace, V_tcp_recvspace);
|
|
|
|
if (error)
|
|
|
|
goto out;
|
|
|
|
}
|
1996-07-11 16:32:50 +00:00
|
|
|
|
2020-01-22 05:54:58 +00:00
|
|
|
so->so_rcv.sb_flags |= SB_AUTOSIZE;
|
|
|
|
so->so_snd.sb_flags |= SB_AUTOSIZE;
|
|
|
|
error = in_pcballoc(so, &V_tcbinfo);
|
2020-01-22 06:01:26 +00:00
|
|
|
if (error)
|
2020-01-22 05:54:58 +00:00
|
|
|
goto out;
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
#ifdef INET6
|
|
|
|
if (inp->inp_vflag & INP_IPV6PROTO) {
|
|
|
|
inp->inp_vflag |= INP_IPV6;
|
|
|
|
if ((inp->inp_flags & IN6P_IPV6_V6ONLY) == 0)
|
|
|
|
inp->inp_vflag |= INP_IPV4;
|
|
|
|
inp->in6p_hops = -1; /* use kernel default */
|
|
|
|
}
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
inp->inp_vflag |= INP_IPV4;
|
|
|
|
tp = tcp_newtcpcb(inp);
|
|
|
|
if (tp == NULL) {
|
2020-01-22 06:01:26 +00:00
|
|
|
error = ENOBUFS;
|
2020-01-22 05:54:58 +00:00
|
|
|
in_pcbdetach(inp);
|
|
|
|
in_pcbfree(inp);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
tp->t_state = TCPS_CLOSED;
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
TCPSTATES_INC(TCPS_CLOSED);
|
1996-07-11 16:32:50 +00:00
|
|
|
out:
|
|
|
|
TCPDEBUG2(PRU_ATTACH);
|
2015-09-13 15:50:55 +00:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_ATTACH);
|
2020-01-22 05:54:58 +00:00
|
|
|
return (error);
|
1996-07-11 16:32:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2020-01-22 06:06:27 +00:00
|
|
|
* tcp_usr_detach is called when the socket layer loses its final reference
|
2006-07-21 17:11:15 +00:00
|
|
|
* to the socket, be it a file descriptor reference, a reference from TCP,
|
|
|
|
* etc. At this point, there is only one case in which we will keep around
|
|
|
|
* inpcb state: time wait.
|
1996-07-11 16:32:50 +00:00
|
|
|
*/
|
Chance protocol switch method pru_detach() so that it returns void
rather than an error. Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.
soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF. so_pcb is now entirely owned and
managed by the protocol code. Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.
Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.
In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.
netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit. In their current state they may leak
memory or panic.
MFC after: 3 months
2006-04-01 15:42:02 +00:00
|
|
|
static void
|
2020-01-22 06:06:27 +00:00
|
|
|
tcp_usr_detach(struct socket *so)
|
1996-07-11 16:32:50 +00:00
|
|
|
{
|
2020-01-22 06:06:27 +00:00
|
|
|
struct inpcb *inp;
|
1996-07-11 16:32:50 +00:00
|
|
|
struct tcpcb *tp;
|
|
|
|
|
2020-01-22 06:06:27 +00:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("%s: inp == NULL", __func__));
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
KASSERT(so->so_pcb == inp && inp->inp_socket == so,
|
|
|
|
("%s: socket %p inp %p mismatch", __func__, so, inp));
|
2006-04-02 16:42:51 +00:00
|
|
|
|
2006-07-21 17:11:15 +00:00
|
|
|
tp = intotcpcb(inp);
|
|
|
|
|
2009-03-15 09:58:31 +00:00
|
|
|
if (inp->inp_flags & INP_TIMEWAIT) {
|
2006-07-21 17:11:15 +00:00
|
|
|
/*
|
|
|
|
* There are two cases to handle: one in which the time wait
|
|
|
|
* state is being discarded (INP_DROPPED), and one in which
|
|
|
|
* this connection will remain in timewait. In the former,
|
|
|
|
* it is time to discard all state (except tcptw, which has
|
|
|
|
* already been discarded by the timewait close code, which
|
|
|
|
* should be further up the call stack somewhere). In the
|
|
|
|
* latter case, we detach from the socket, but leave the pcb
|
|
|
|
* present until timewait ends.
|
|
|
|
*
|
|
|
|
* XXXRW: Would it be cleaner to free the tcptw here?
|
2014-10-30 08:53:56 +00:00
|
|
|
*
|
|
|
|
* Astute question indeed, from twtcp perspective there are
|
2018-03-21 20:59:30 +00:00
|
|
|
* four cases to consider:
|
2014-10-30 08:53:56 +00:00
|
|
|
*
|
2020-01-22 06:06:27 +00:00
|
|
|
* #1 tcp_usr_detach is called at tcptw creation time by
|
2014-10-30 08:53:56 +00:00
|
|
|
* tcp_twstart, then do not discard the newly created tcptw
|
|
|
|
* and leave inpcb present until timewait ends
|
2020-01-22 06:06:27 +00:00
|
|
|
* #2 tcp_usr_detach is called at tcptw creation time by
|
2018-03-21 20:59:30 +00:00
|
|
|
* tcp_twstart, but connection is local and tw will be
|
|
|
|
* discarded immediately
|
2020-01-22 06:06:27 +00:00
|
|
|
* #3 tcp_usr_detach is called at timewait end (or reuse) by
|
2014-10-30 08:53:56 +00:00
|
|
|
* tcp_twclose, then the tcptw has already been discarded
|
2015-08-03 12:13:54 +00:00
|
|
|
* (or reused) and inpcb is freed here
|
2020-01-22 06:06:27 +00:00
|
|
|
* #4 tcp_usr_detach is called() after timewait ends (or reuse)
|
2014-10-30 08:53:56 +00:00
|
|
|
* (e.g. by soclose), then tcptw has already been discarded
|
2015-08-03 12:13:54 +00:00
|
|
|
* (or reused) and inpcb is freed here
|
2014-10-30 08:53:56 +00:00
|
|
|
*
|
|
|
|
* In all three cases the tcptw should not be freed here.
|
2006-07-21 17:11:15 +00:00
|
|
|
*/
|
2009-03-15 09:58:31 +00:00
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
2008-11-26 20:52:26 +00:00
|
|
|
in_pcbdetach(inp);
|
2016-10-18 07:16:49 +00:00
|
|
|
if (__predict_true(tp == NULL)) {
|
|
|
|
in_pcbfree(inp);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* This case should not happen as in TIMEWAIT
|
|
|
|
* state the inp should not be destroyed before
|
|
|
|
* its tcptw. If INVARIANTS is defined, panic.
|
|
|
|
*/
|
|
|
|
#ifdef INVARIANTS
|
|
|
|
panic("%s: Panic before an inp double-free: "
|
|
|
|
"INP_TIMEWAIT && INP_DROPPED && tp != NULL"
|
|
|
|
, __func__);
|
|
|
|
#else
|
|
|
|
log(LOG_ERR, "%s: Avoid an inp double-free: "
|
|
|
|
"INP_TIMEWAIT && INP_DROPPED && tp != NULL"
|
|
|
|
, __func__);
|
|
|
|
#endif
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
}
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
} else {
|
2008-11-26 20:52:26 +00:00
|
|
|
in_pcbdetach(inp);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
}
|
|
|
|
} else {
|
2006-04-03 12:43:56 +00:00
|
|
|
/*
|
2006-07-21 17:11:15 +00:00
|
|
|
* If the connection is not in timewait, we consider two
|
|
|
|
* two conditions: one in which no further processing is
|
|
|
|
* necessary (dropped || embryonic), and one in which TCP is
|
|
|
|
* not yet done, but no longer requires the socket, so the
|
|
|
|
* pcb will persist for the time being.
|
|
|
|
*
|
|
|
|
* XXXRW: Does the second case still occur?
|
2006-04-03 12:43:56 +00:00
|
|
|
*/
|
2009-03-15 09:58:31 +00:00
|
|
|
if (inp->inp_flags & INP_DROPPED ||
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
tp->t_state < TCPS_SYN_SENT) {
|
|
|
|
tcp_discardcb(tp);
|
2008-11-26 20:52:26 +00:00
|
|
|
in_pcbdetach(inp);
|
2008-11-27 12:04:35 +00:00
|
|
|
in_pcbfree(inp);
|
2012-01-06 18:29:40 +00:00
|
|
|
} else {
|
2008-11-27 12:04:35 +00:00
|
|
|
in_pcbdetach(inp);
|
2012-01-06 18:29:40 +00:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
}
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
}
|
2006-04-24 08:20:02 +00:00
|
|
|
}
|
|
|
|
|
2011-04-30 11:21:29 +00:00
|
|
|
#ifdef INET
|
1996-07-11 16:32:50 +00:00
|
|
|
/*
|
|
|
|
* Give the socket an address.
|
|
|
|
*/
|
|
|
|
static int
|
2001-09-12 08:38:13 +00:00
|
|
|
tcp_usr_bind(struct socket *so, struct sockaddr *nam, struct thread *td)
|
1996-07-11 16:32:50 +00:00
|
|
|
{
|
|
|
|
int error = 0;
|
2002-06-10 20:05:46 +00:00
|
|
|
struct inpcb *inp;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
struct tcpcb *tp = NULL;
|
1996-07-11 16:32:50 +00:00
|
|
|
struct sockaddr_in *sinp;
|
|
|
|
|
2004-04-04 20:14:55 +00:00
|
|
|
sinp = (struct sockaddr_in *)nam;
|
2021-05-31 18:53:34 -04:00
|
|
|
if (nam->sa_family != AF_INET) {
|
|
|
|
/*
|
|
|
|
* Preserve compatibility with old programs.
|
|
|
|
*/
|
|
|
|
if (nam->sa_family != AF_UNSPEC ||
|
2021-08-05 13:42:30 +02:00
|
|
|
nam->sa_len < offsetof(struct sockaddr_in, sin_zero) ||
|
2021-05-31 18:53:34 -04:00
|
|
|
sinp->sin_addr.s_addr != INADDR_ANY)
|
|
|
|
return (EAFNOSUPPORT);
|
|
|
|
nam->sa_family = AF_INET;
|
|
|
|
}
|
2021-05-03 12:51:04 -04:00
|
|
|
if (nam->sa_len != sizeof(*sinp))
|
2004-04-04 20:14:55 +00:00
|
|
|
return (EINVAL);
|
2021-05-03 12:51:04 -04:00
|
|
|
|
1996-07-11 16:32:50 +00:00
|
|
|
/*
|
|
|
|
* Must check for multicast addresses and disallow binding
|
|
|
|
* to them.
|
|
|
|
*/
|
2021-05-03 12:51:04 -04:00
|
|
|
if (IN_MULTICAST(ntohl(sinp->sin_addr.s_addr)))
|
2004-04-04 20:14:55 +00:00
|
|
|
return (EAFNOSUPPORT);
|
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
TCPDEBUG0;
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("tcp_usr_bind: inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2009-03-15 09:58:31 +00:00
|
|
|
if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
error = EINVAL;
|
1996-07-11 16:32:50 +00:00
|
|
|
goto out;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
}
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
TCPDEBUG1();
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK(&V_tcbinfo);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
error = in_pcbbind(inp, nam, td->td_ucred);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
out:
|
|
|
|
TCPDEBUG2(PRU_BIND);
|
2015-09-13 15:50:55 +00:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_BIND);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
|
|
|
|
return (error);
|
1996-07-11 16:32:50 +00:00
|
|
|
}
|
2011-04-30 11:21:29 +00:00
|
|
|
#endif /* INET */
|
1996-07-11 16:32:50 +00:00
|
|
|
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
|
|
|
static int
|
2001-09-12 08:38:13 +00:00
|
|
|
tcp6_usr_bind(struct socket *so, struct sockaddr *nam, struct thread *td)
|
2000-01-09 19:17:30 +00:00
|
|
|
{
|
|
|
|
int error = 0;
|
2002-06-10 20:05:46 +00:00
|
|
|
struct inpcb *inp;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
struct tcpcb *tp = NULL;
|
2019-08-02 07:41:36 +00:00
|
|
|
struct sockaddr_in6 *sin6;
|
2019-10-24 20:05:10 +00:00
|
|
|
u_char vflagsav;
|
2000-01-09 19:17:30 +00:00
|
|
|
|
2019-08-02 07:41:36 +00:00
|
|
|
sin6 = (struct sockaddr_in6 *)nam;
|
2021-05-03 12:51:04 -04:00
|
|
|
if (nam->sa_family != AF_INET6)
|
|
|
|
return (EAFNOSUPPORT);
|
|
|
|
if (nam->sa_len != sizeof(*sin6))
|
2004-04-04 20:14:55 +00:00
|
|
|
return (EINVAL);
|
2021-05-03 12:51:04 -04:00
|
|
|
|
2000-01-09 19:17:30 +00:00
|
|
|
/*
|
|
|
|
* Must check for multicast addresses and disallow binding
|
|
|
|
* to them.
|
|
|
|
*/
|
2021-05-03 12:51:04 -04:00
|
|
|
if (IN6_IS_ADDR_MULTICAST(&sin6->sin6_addr))
|
2004-04-04 20:14:55 +00:00
|
|
|
return (EAFNOSUPPORT);
|
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
TCPDEBUG0;
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("tcp6_usr_bind: inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2019-10-24 20:05:10 +00:00
|
|
|
vflagsav = inp->inp_vflag;
|
2009-03-15 09:58:31 +00:00
|
|
|
if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
error = EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
TCPDEBUG1();
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK(&V_tcbinfo);
|
2000-01-09 19:17:30 +00:00
|
|
|
inp->inp_vflag &= ~INP_IPV4;
|
|
|
|
inp->inp_vflag |= INP_IPV6;
|
2011-04-30 11:21:29 +00:00
|
|
|
#ifdef INET
|
2002-07-25 18:10:04 +00:00
|
|
|
if ((inp->inp_flags & IN6P_IPV6_V6ONLY) == 0) {
|
2019-08-02 07:41:36 +00:00
|
|
|
if (IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr))
|
2000-01-09 19:17:30 +00:00
|
|
|
inp->inp_vflag |= INP_IPV4;
|
2019-08-02 07:41:36 +00:00
|
|
|
else if (IN6_IS_ADDR_V4MAPPED(&sin6->sin6_addr)) {
|
2000-01-09 19:17:30 +00:00
|
|
|
struct sockaddr_in sin;
|
|
|
|
|
2019-08-02 07:41:36 +00:00
|
|
|
in6_sin6_2_sin(&sin, sin6);
|
2018-07-30 21:27:26 +00:00
|
|
|
if (IN_MULTICAST(ntohl(sin.sin_addr.s_addr))) {
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
|
|
|
goto out;
|
|
|
|
}
|
2000-01-09 19:17:30 +00:00
|
|
|
inp->inp_vflag |= INP_IPV4;
|
|
|
|
inp->inp_vflag &= ~INP_IPV6;
|
2004-03-27 21:05:46 +00:00
|
|
|
error = in_pcbbind(inp, (struct sockaddr *)&sin,
|
|
|
|
td->td_ucred);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
2000-01-09 19:17:30 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
2011-04-30 11:21:29 +00:00
|
|
|
#endif
|
2004-03-27 21:05:46 +00:00
|
|
|
error = in6_pcbbind(inp, nam, td->td_ucred);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
out:
|
2019-10-24 20:05:10 +00:00
|
|
|
if (error != 0)
|
|
|
|
inp->inp_vflag = vflagsav;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
TCPDEBUG2(PRU_BIND);
|
2015-09-13 15:50:55 +00:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_BIND);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
return (error);
|
2000-01-09 19:17:30 +00:00
|
|
|
}
|
|
|
|
#endif /* INET6 */
|
|
|
|
|
2011-04-30 11:21:29 +00:00
|
|
|
#ifdef INET
|
1996-07-11 16:32:50 +00:00
|
|
|
/*
|
|
|
|
* Prepare to accept connections.
|
|
|
|
*/
|
|
|
|
static int
|
2005-10-30 19:44:40 +00:00
|
|
|
tcp_usr_listen(struct socket *so, int backlog, struct thread *td)
|
1996-07-11 16:32:50 +00:00
|
|
|
{
|
|
|
|
int error = 0;
|
2002-06-10 20:05:46 +00:00
|
|
|
struct inpcb *inp;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
struct tcpcb *tp = NULL;
|
1996-07-11 16:32:50 +00:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
TCPDEBUG0;
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("tcp_usr_listen: inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2009-03-15 09:58:31 +00:00
|
|
|
if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
error = EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
TCPDEBUG1();
|
In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
2005-02-21 21:58:17 +00:00
|
|
|
SOCK_LOCK(so);
|
|
|
|
error = solisten_proto_check(so);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK(&V_tcbinfo);
|
In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
2005-02-21 21:58:17 +00:00
|
|
|
if (error == 0 && inp->inp_lport == 0)
|
2004-03-27 21:05:46 +00:00
|
|
|
error = in_pcbbind(inp, (struct sockaddr *)0, td->td_ucred);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
2005-02-21 21:58:17 +00:00
|
|
|
if (error == 0) {
|
2013-08-25 21:54:41 +00:00
|
|
|
tcp_state_change(tp, TCPS_LISTEN);
|
2005-10-30 19:44:40 +00:00
|
|
|
solisten_proto(so, backlog);
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
2013-01-25 20:23:33 +00:00
|
|
|
if ((so->so_options & SO_NO_OFFLOAD) == 0)
|
|
|
|
tcp_offload_listen_start(tp);
|
2012-06-19 07:34:13 +00:00
|
|
|
#endif
|
In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
2005-02-21 21:58:17 +00:00
|
|
|
}
|
|
|
|
SOCK_UNLOCK(so);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
|
2016-10-12 19:06:50 +00:00
|
|
|
if (IS_FASTOPEN(tp->t_flags))
|
2015-12-24 19:09:48 +00:00
|
|
|
tp->t_tfo_pending = tcp_fastopen_alloc_counter();
|
2018-02-26 03:03:41 +00:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
out:
|
|
|
|
TCPDEBUG2(PRU_LISTEN);
|
2015-09-13 15:50:55 +00:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_LISTEN);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
return (error);
|
1996-07-11 16:32:50 +00:00
|
|
|
}
|
2011-04-30 11:21:29 +00:00
|
|
|
#endif /* INET */
|
1996-07-11 16:32:50 +00:00
|
|
|
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
|
|
|
static int
|
2005-10-30 19:44:40 +00:00
|
|
|
tcp6_usr_listen(struct socket *so, int backlog, struct thread *td)
|
2000-01-09 19:17:30 +00:00
|
|
|
{
|
|
|
|
int error = 0;
|
2002-06-10 20:05:46 +00:00
|
|
|
struct inpcb *inp;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
struct tcpcb *tp = NULL;
|
2019-10-24 20:05:10 +00:00
|
|
|
u_char vflagsav;
|
2000-01-09 19:17:30 +00:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
TCPDEBUG0;
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("tcp6_usr_listen: inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2009-03-15 09:58:31 +00:00
|
|
|
if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
error = EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
2019-10-24 20:05:10 +00:00
|
|
|
vflagsav = inp->inp_vflag;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
tp = intotcpcb(inp);
|
|
|
|
TCPDEBUG1();
|
In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
2005-02-21 21:58:17 +00:00
|
|
|
SOCK_LOCK(so);
|
|
|
|
error = solisten_proto_check(so);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK(&V_tcbinfo);
|
In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
2005-02-21 21:58:17 +00:00
|
|
|
if (error == 0 && inp->inp_lport == 0) {
|
2000-01-09 19:17:30 +00:00
|
|
|
inp->inp_vflag &= ~INP_IPV4;
|
2002-07-25 18:10:04 +00:00
|
|
|
if ((inp->inp_flags & IN6P_IPV6_V6ONLY) == 0)
|
2000-01-09 19:17:30 +00:00
|
|
|
inp->inp_vflag |= INP_IPV4;
|
2004-03-27 21:05:46 +00:00
|
|
|
error = in6_pcbbind(inp, (struct sockaddr *)0, td->td_ucred);
|
2000-01-09 19:17:30 +00:00
|
|
|
}
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
2005-02-21 21:58:17 +00:00
|
|
|
if (error == 0) {
|
2013-08-25 21:54:41 +00:00
|
|
|
tcp_state_change(tp, TCPS_LISTEN);
|
2005-10-30 19:44:40 +00:00
|
|
|
solisten_proto(so, backlog);
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
2013-01-25 20:23:33 +00:00
|
|
|
if ((so->so_options & SO_NO_OFFLOAD) == 0)
|
|
|
|
tcp_offload_listen_start(tp);
|
2012-06-19 07:34:13 +00:00
|
|
|
#endif
|
In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.
This change does the following:
- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().
- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket
layer.
This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.
Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
2005-02-21 21:58:17 +00:00
|
|
|
}
|
|
|
|
SOCK_UNLOCK(so);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
|
2016-10-12 19:06:50 +00:00
|
|
|
if (IS_FASTOPEN(tp->t_flags))
|
2015-12-24 19:09:48 +00:00
|
|
|
tp->t_tfo_pending = tcp_fastopen_alloc_counter();
|
2018-02-26 03:03:41 +00:00
|
|
|
|
2019-10-24 20:05:10 +00:00
|
|
|
if (error != 0)
|
|
|
|
inp->inp_vflag = vflagsav;
|
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
out:
|
|
|
|
TCPDEBUG2(PRU_LISTEN);
|
2015-09-13 15:50:55 +00:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_LISTEN);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
return (error);
|
2000-01-09 19:17:30 +00:00
|
|
|
}
|
|
|
|
#endif /* INET6 */
|
|
|
|
|
2011-04-30 11:21:29 +00:00
|
|
|
#ifdef INET
|
1996-07-11 16:32:50 +00:00
|
|
|
/*
|
|
|
|
* Initiate connection to peer.
|
|
|
|
* Create a template for use in transmissions on this connection.
|
|
|
|
* Enter SYN_SENT state, and mark socket as connecting.
|
|
|
|
* Start keep-alive timer, and seed output sequence space.
|
|
|
|
* Send initial segment on connection.
|
|
|
|
*/
|
|
|
|
static int
|
2001-09-12 08:38:13 +00:00
|
|
|
tcp_usr_connect(struct socket *so, struct sockaddr *nam, struct thread *td)
|
1996-07-11 16:32:50 +00:00
|
|
|
{
|
2020-01-22 05:53:16 +00:00
|
|
|
struct epoch_tracker et;
|
1996-07-11 16:32:50 +00:00
|
|
|
int error = 0;
|
2002-06-10 20:05:46 +00:00
|
|
|
struct inpcb *inp;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
struct tcpcb *tp = NULL;
|
1996-07-11 16:32:50 +00:00
|
|
|
struct sockaddr_in *sinp;
|
|
|
|
|
1997-08-16 19:16:27 +00:00
|
|
|
sinp = (struct sockaddr_in *)nam;
|
2021-05-03 12:51:04 -04:00
|
|
|
if (nam->sa_family != AF_INET)
|
|
|
|
return (EAFNOSUPPORT);
|
2004-01-10 08:53:00 +00:00
|
|
|
if (nam->sa_len != sizeof (*sinp))
|
|
|
|
return (EINVAL);
|
2021-05-03 12:51:04 -04:00
|
|
|
|
2004-04-04 20:14:55 +00:00
|
|
|
/*
|
|
|
|
* Must disallow TCP ``connections'' to multicast addresses.
|
|
|
|
*/
|
2021-05-03 12:51:04 -04:00
|
|
|
if (IN_MULTICAST(ntohl(sinp->sin_addr.s_addr)))
|
2004-04-04 20:14:55 +00:00
|
|
|
return (EAFNOSUPPORT);
|
2021-05-03 12:51:04 -04:00
|
|
|
if (ntohl(sinp->sin_addr.s_addr) == INADDR_BROADCAST)
|
2020-07-16 16:46:24 +00:00
|
|
|
return (EACCES);
|
2009-02-05 14:06:09 +00:00
|
|
|
if ((error = prison_remote_ip4(td->td_ucred, &sinp->sin_addr)) != 0)
|
|
|
|
return (error);
|
This Implements the mumbled about "Jail" feature.
This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.
For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".
Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own
hostname.
Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.
It generally does what one would expect, but setting up a jail
still takes a little knowledge.
A few notes:
I have no scripts for setting up a jail, don't ask me for them.
The IP number should be an alias on one of the interfaces.
mount a /proc in each jail, it will make ps more useable.
/proc/<pid>/status tells the hostname of the prison for
jailed processes.
Quotas are only sensible if you have a mountpoint per prison.
There are no privisions for stopping resource-hogging.
Some "#ifdef INET" and similar may be missing (send patches!)
If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!
Tools, comments, patches & documentation most welcome.
Have fun...
Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/
1999-04-28 11:38:52 +00:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
TCPDEBUG0;
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("tcp_usr_connect: inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2015-03-09 20:29:16 +00:00
|
|
|
if (inp->inp_flags & INP_TIMEWAIT) {
|
|
|
|
error = EADDRINUSE;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
|
|
|
error = ECONNREFUSED;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
TCPDEBUG1();
|
2020-01-22 06:10:41 +00:00
|
|
|
NET_EPOCH_ENTER(et);
|
2001-09-12 08:38:13 +00:00
|
|
|
if ((error = tcp_connect(tp, nam, td)) != 0)
|
2020-01-22 06:10:41 +00:00
|
|
|
goto out_in_epoch;
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
if (registered_toedevs > 0 &&
|
2013-01-25 20:23:33 +00:00
|
|
|
(so->so_options & SO_NO_OFFLOAD) == 0 &&
|
2012-06-19 07:34:13 +00:00
|
|
|
(error = tcp_offload_connect(so, nam)) == 0)
|
2020-01-22 06:10:41 +00:00
|
|
|
goto out_in_epoch;
|
2012-06-19 07:34:13 +00:00
|
|
|
#endif
|
|
|
|
tcp_timer_activate(tp, TT_KEEP, TP_KEEPINIT(tp));
|
2015-12-16 00:56:45 +00:00
|
|
|
error = tp->t_fb->tfb_tcp_output(tp);
|
2020-01-22 06:10:41 +00:00
|
|
|
out_in_epoch:
|
2020-01-22 05:53:16 +00:00
|
|
|
NET_EPOCH_EXIT(et);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
out:
|
|
|
|
TCPDEBUG2(PRU_CONNECT);
|
2016-03-03 17:46:38 +00:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_CONNECT);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
return (error);
|
1996-07-11 16:32:50 +00:00
|
|
|
}
|
2011-04-30 11:21:29 +00:00
|
|
|
#endif /* INET */
|
1996-07-11 16:32:50 +00:00
|
|
|
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
|
|
|
static int
|
2001-09-12 08:38:13 +00:00
|
|
|
tcp6_usr_connect(struct socket *so, struct sockaddr *nam, struct thread *td)
|
2000-01-09 19:17:30 +00:00
|
|
|
{
|
2020-01-22 05:53:16 +00:00
|
|
|
struct epoch_tracker et;
|
2000-01-09 19:17:30 +00:00
|
|
|
int error = 0;
|
2002-06-10 20:05:46 +00:00
|
|
|
struct inpcb *inp;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
struct tcpcb *tp = NULL;
|
2019-08-02 07:41:36 +00:00
|
|
|
struct sockaddr_in6 *sin6;
|
2019-10-24 20:05:10 +00:00
|
|
|
u_int8_t incflagsav;
|
|
|
|
u_char vflagsav;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
|
|
|
|
TCPDEBUG0;
|
2000-01-09 19:17:30 +00:00
|
|
|
|
2019-08-02 07:41:36 +00:00
|
|
|
sin6 = (struct sockaddr_in6 *)nam;
|
2021-05-03 12:51:04 -04:00
|
|
|
if (nam->sa_family != AF_INET6)
|
|
|
|
return (EAFNOSUPPORT);
|
2019-08-02 07:41:36 +00:00
|
|
|
if (nam->sa_len != sizeof (*sin6))
|
2004-01-10 08:53:00 +00:00
|
|
|
return (EINVAL);
|
2021-05-03 12:51:04 -04:00
|
|
|
|
2004-04-04 20:14:55 +00:00
|
|
|
/*
|
|
|
|
* Must disallow TCP ``connections'' to multicast addresses.
|
|
|
|
*/
|
2021-05-03 12:51:04 -04:00
|
|
|
if (IN6_IS_ADDR_MULTICAST(&sin6->sin6_addr))
|
2004-04-04 20:14:55 +00:00
|
|
|
return (EAFNOSUPPORT);
|
2000-01-09 19:17:30 +00:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("tcp6_usr_connect: inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2019-10-24 20:05:10 +00:00
|
|
|
vflagsav = inp->inp_vflag;
|
|
|
|
incflagsav = inp->inp_inc.inc_flags;
|
2015-03-09 20:29:16 +00:00
|
|
|
if (inp->inp_flags & INP_TIMEWAIT) {
|
|
|
|
error = EADDRINUSE;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
|
|
|
error = ECONNREFUSED;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
TCPDEBUG1();
|
2011-04-30 11:21:29 +00:00
|
|
|
#ifdef INET
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
/*
|
|
|
|
* XXXRW: Some confusion: V4/V6 flags relate to binding, and
|
|
|
|
* therefore probably require the hash lock, which isn't held here.
|
|
|
|
* Is this a significant problem?
|
|
|
|
*/
|
2019-08-02 07:41:36 +00:00
|
|
|
if (IN6_IS_ADDR_V4MAPPED(&sin6->sin6_addr)) {
|
2000-01-09 19:17:30 +00:00
|
|
|
struct sockaddr_in sin;
|
|
|
|
|
2002-07-29 09:01:39 +00:00
|
|
|
if ((inp->inp_flags & IN6P_IPV6_V6ONLY) != 0) {
|
|
|
|
error = EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
2017-05-22 15:29:10 +00:00
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0) {
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
goto out;
|
|
|
|
}
|
2001-06-11 12:39:29 +00:00
|
|
|
|
2019-08-02 07:41:36 +00:00
|
|
|
in6_sin6_2_sin(&sin, sin6);
|
2018-07-30 21:27:26 +00:00
|
|
|
if (IN_MULTICAST(ntohl(sin.sin_addr.s_addr))) {
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
goto out;
|
|
|
|
}
|
2020-07-16 16:46:24 +00:00
|
|
|
if (ntohl(sin.sin_addr.s_addr) == INADDR_BROADCAST) {
|
|
|
|
error = EACCES;
|
2020-06-03 14:16:40 +00:00
|
|
|
goto out;
|
|
|
|
}
|
2009-02-05 14:06:09 +00:00
|
|
|
if ((error = prison_remote_ip4(td->td_ucred,
|
|
|
|
&sin.sin_addr)) != 0)
|
MFp4:
Bring in updated jail support from bz_jail branch.
This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..
SCTP support was updated and supports IPv6 in jails as well.
Cpuset support permits jails to be bound to specific processor
sets after creation.
Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.
DDB 'show jails' command was added to aid debugging.
Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.
Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.
Bump __FreeBSD_version for the afore mentioned and in kernel changes.
Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
on cluster machines as well as all the testers and people
who provided feedback the last months on freebsd-jail and
other channels.
- My employer, CK Software GmbH, for the support so I could work on this.
Reviewed by: (see above)
MFC after: 3 months (this is just so that I get the mail)
X-MFC Before: 7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
|
|
|
goto out;
|
2019-10-24 20:05:10 +00:00
|
|
|
inp->inp_vflag |= INP_IPV4;
|
|
|
|
inp->inp_vflag &= ~INP_IPV6;
|
2020-01-22 06:10:41 +00:00
|
|
|
NET_EPOCH_ENTER(et);
|
2001-09-12 08:38:13 +00:00
|
|
|
if ((error = tcp_connect(tp, (struct sockaddr *)&sin, td)) != 0)
|
2020-01-22 06:10:41 +00:00
|
|
|
goto out_in_epoch;
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
if (registered_toedevs > 0 &&
|
2013-01-26 01:41:42 +00:00
|
|
|
(so->so_options & SO_NO_OFFLOAD) == 0 &&
|
2012-06-19 07:34:13 +00:00
|
|
|
(error = tcp_offload_connect(so, nam)) == 0)
|
2020-01-22 06:10:41 +00:00
|
|
|
goto out_in_epoch;
|
2012-06-19 07:34:13 +00:00
|
|
|
#endif
|
2015-12-16 00:56:45 +00:00
|
|
|
error = tp->t_fb->tfb_tcp_output(tp);
|
2020-01-22 06:10:41 +00:00
|
|
|
goto out_in_epoch;
|
2017-05-22 15:29:10 +00:00
|
|
|
} else {
|
|
|
|
if ((inp->inp_vflag & INP_IPV6) == 0) {
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
goto out;
|
|
|
|
}
|
2000-01-09 19:17:30 +00:00
|
|
|
}
|
2011-04-30 11:21:29 +00:00
|
|
|
#endif
|
2019-10-24 20:05:10 +00:00
|
|
|
if ((error = prison_remote_ip6(td->td_ucred, &sin6->sin6_addr)) != 0)
|
|
|
|
goto out;
|
2000-01-09 19:17:30 +00:00
|
|
|
inp->inp_vflag &= ~INP_IPV4;
|
|
|
|
inp->inp_vflag |= INP_IPV6;
|
2008-12-17 12:52:34 +00:00
|
|
|
inp->inp_inc.inc_flags |= INC_ISIPV6;
|
2001-09-12 08:38:13 +00:00
|
|
|
if ((error = tcp6_connect(tp, nam, td)) != 0)
|
2000-01-09 19:17:30 +00:00
|
|
|
goto out;
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
if (registered_toedevs > 0 &&
|
2013-01-26 01:41:42 +00:00
|
|
|
(so->so_options & SO_NO_OFFLOAD) == 0 &&
|
2012-06-19 07:34:13 +00:00
|
|
|
(error = tcp_offload_connect(so, nam)) == 0)
|
|
|
|
goto out;
|
|
|
|
#endif
|
|
|
|
tcp_timer_activate(tp, TT_KEEP, TP_KEEPINIT(tp));
|
2020-01-22 05:53:16 +00:00
|
|
|
NET_EPOCH_ENTER(et);
|
2015-12-16 00:56:45 +00:00
|
|
|
error = tp->t_fb->tfb_tcp_output(tp);
|
2020-01-22 15:06:59 +00:00
|
|
|
#ifdef INET
|
2020-01-22 06:10:41 +00:00
|
|
|
out_in_epoch:
|
2020-01-22 15:06:59 +00:00
|
|
|
#endif
|
2020-01-22 05:53:16 +00:00
|
|
|
NET_EPOCH_EXIT(et);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
out:
|
2019-10-24 20:05:10 +00:00
|
|
|
/*
|
|
|
|
* If the implicit bind in the connect call fails, restore
|
|
|
|
* the flags we modified.
|
|
|
|
*/
|
|
|
|
if (error != 0 && inp->inp_lport == 0) {
|
|
|
|
inp->inp_vflag = vflagsav;
|
|
|
|
inp->inp_inc.inc_flags = incflagsav;
|
|
|
|
}
|
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
TCPDEBUG2(PRU_CONNECT);
|
2015-09-13 15:50:55 +00:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_CONNECT);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
return (error);
|
2000-01-09 19:17:30 +00:00
|
|
|
}
|
|
|
|
#endif /* INET6 */
|
|
|
|
|
1996-07-11 16:32:50 +00:00
|
|
|
/*
|
|
|
|
* Initiate disconnect from peer.
|
|
|
|
* If connection never passed embryonic stage, just drop;
|
|
|
|
* else if don't need to let data drain, then can just drop anyways,
|
|
|
|
* else have to begin TCP shutdown process: mark socket disconnecting,
|
|
|
|
* drain unread data, state switch to reflect user close, and
|
|
|
|
* send segment (e.g. FIN) to peer. Socket will be really disconnected
|
|
|
|
* when peer sends FIN and acks ours.
|
|
|
|
*
|
|
|
|
* SHOULD IMPLEMENT LATER PRU_CONNECT VIA REALLOC TCPCB.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
tcp_usr_disconnect(struct socket *so)
|
|
|
|
{
|
2002-06-10 20:05:46 +00:00
|
|
|
struct inpcb *inp;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
struct tcpcb *tp = NULL;
|
2018-07-04 02:47:16 +00:00
|
|
|
struct epoch_tracker et;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
int error = 0;
|
1996-07-11 16:32:50 +00:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
TCPDEBUG0;
|
2019-11-07 00:10:14 +00:00
|
|
|
NET_EPOCH_ENTER(et);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("tcp_usr_disconnect: inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2014-10-12 23:01:25 +00:00
|
|
|
if (inp->inp_flags & INP_TIMEWAIT)
|
|
|
|
goto out;
|
|
|
|
if (inp->inp_flags & INP_DROPPED) {
|
2006-11-22 17:16:54 +00:00
|
|
|
error = ECONNRESET;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
TCPDEBUG1();
|
|
|
|
tcp_disconnect(tp);
|
|
|
|
out:
|
|
|
|
TCPDEBUG2(PRU_DISCONNECT);
|
2015-09-13 15:50:55 +00:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_DISCONNECT);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2019-11-07 00:10:14 +00:00
|
|
|
NET_EPOCH_EXIT(et);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
return (error);
|
1996-07-11 16:32:50 +00:00
|
|
|
}
|
|
|
|
|
2011-04-30 11:21:29 +00:00
|
|
|
#ifdef INET
|
1996-07-11 16:32:50 +00:00
|
|
|
/*
|
2010-03-06 21:38:31 +00:00
|
|
|
* Accept a connection. Essentially all the work is done at higher levels;
|
|
|
|
* just return the address of the peer, storing through addr.
|
1996-07-11 16:32:50 +00:00
|
|
|
*/
|
|
|
|
static int
|
1997-08-16 19:16:27 +00:00
|
|
|
tcp_usr_accept(struct socket *so, struct sockaddr **nam)
|
1996-07-11 16:32:50 +00:00
|
|
|
{
|
|
|
|
int error = 0;
|
2002-06-10 20:05:46 +00:00
|
|
|
struct inpcb *inp = NULL;
|
2001-03-12 02:57:42 +00:00
|
|
|
struct tcpcb *tp = NULL;
|
2002-08-21 11:57:12 +00:00
|
|
|
struct in_addr addr;
|
|
|
|
in_port_t port = 0;
|
2001-03-12 02:57:42 +00:00
|
|
|
TCPDEBUG0;
|
1996-07-11 16:32:50 +00:00
|
|
|
|
2006-04-03 09:52:55 +00:00
|
|
|
if (so->so_state & SS_ISDISCONNECTED)
|
|
|
|
return (ECONNABORTED);
|
2002-06-10 20:05:46 +00:00
|
|
|
|
|
|
|
inp = sotoinpcb(so);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
KASSERT(inp != NULL, ("tcp_usr_accept: inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2009-03-15 09:58:31 +00:00
|
|
|
if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {
|
2006-04-03 09:52:55 +00:00
|
|
|
error = ECONNABORTED;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
goto out;
|
|
|
|
}
|
2001-03-12 02:57:42 +00:00
|
|
|
tp = intotcpcb(inp);
|
|
|
|
TCPDEBUG1();
|
2002-06-10 20:05:46 +00:00
|
|
|
|
2004-08-16 18:32:07 +00:00
|
|
|
/*
|
2007-05-11 10:20:51 +00:00
|
|
|
* We inline in_getpeeraddr and COMMON_END here, so that we can
|
2002-08-21 11:57:12 +00:00
|
|
|
* copy the data of interest and defer the malloc until after we
|
|
|
|
* release the lock.
|
2002-06-10 20:05:46 +00:00
|
|
|
*/
|
2002-08-21 11:57:12 +00:00
|
|
|
port = inp->inp_fport;
|
|
|
|
addr = inp->inp_faddr;
|
2002-06-10 20:05:46 +00:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
out:
|
|
|
|
TCPDEBUG2(PRU_ACCEPT);
|
2015-09-13 15:50:55 +00:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_ACCEPT);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2002-08-21 11:57:12 +00:00
|
|
|
if (error == 0)
|
|
|
|
*nam = in_sockaddr(port, &addr);
|
|
|
|
return error;
|
1996-07-11 16:32:50 +00:00
|
|
|
}
|
2011-04-30 11:21:29 +00:00
|
|
|
#endif /* INET */
|
1996-07-11 16:32:50 +00:00
|
|
|
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
|
|
|
static int
|
|
|
|
tcp6_usr_accept(struct socket *so, struct sockaddr **nam)
|
|
|
|
{
|
2002-06-10 20:05:46 +00:00
|
|
|
struct inpcb *inp = NULL;
|
2000-01-09 19:17:30 +00:00
|
|
|
int error = 0;
|
2001-03-12 02:57:42 +00:00
|
|
|
struct tcpcb *tp = NULL;
|
2002-08-21 11:57:12 +00:00
|
|
|
struct in_addr addr;
|
|
|
|
struct in6_addr addr6;
|
2018-07-04 02:47:16 +00:00
|
|
|
struct epoch_tracker et;
|
2002-08-21 11:57:12 +00:00
|
|
|
in_port_t port = 0;
|
|
|
|
int v4 = 0;
|
2001-03-12 02:57:42 +00:00
|
|
|
TCPDEBUG0;
|
2000-01-09 19:17:30 +00:00
|
|
|
|
2006-06-26 09:38:08 +00:00
|
|
|
if (so->so_state & SS_ISDISCONNECTED)
|
|
|
|
return (ECONNABORTED);
|
2002-06-10 20:05:46 +00:00
|
|
|
|
|
|
|
inp = sotoinpcb(so);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
KASSERT(inp != NULL, ("tcp6_usr_accept: inp == NULL"));
|
2019-11-07 00:10:14 +00:00
|
|
|
NET_EPOCH_ENTER(et);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2009-03-15 09:58:31 +00:00
|
|
|
if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {
|
2006-11-22 17:16:54 +00:00
|
|
|
error = ECONNABORTED;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
goto out;
|
|
|
|
}
|
2001-03-12 02:57:42 +00:00
|
|
|
tp = intotcpcb(inp);
|
|
|
|
TCPDEBUG1();
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
|
2004-08-16 18:32:07 +00:00
|
|
|
/*
|
2002-08-21 11:57:12 +00:00
|
|
|
* We inline in6_mapped_peeraddr and COMMON_END here, so that we can
|
|
|
|
* copy the data of interest and defer the malloc until after we
|
|
|
|
* release the lock.
|
|
|
|
*/
|
|
|
|
if (inp->inp_vflag & INP_IPV4) {
|
|
|
|
v4 = 1;
|
|
|
|
port = inp->inp_fport;
|
|
|
|
addr = inp->inp_faddr;
|
|
|
|
} else {
|
|
|
|
port = inp->inp_fport;
|
|
|
|
addr6 = inp->in6p_faddr;
|
|
|
|
}
|
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
out:
|
|
|
|
TCPDEBUG2(PRU_ACCEPT);
|
2015-09-13 15:50:55 +00:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_ACCEPT);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2019-11-07 00:10:14 +00:00
|
|
|
NET_EPOCH_EXIT(et);
|
2002-08-21 11:57:12 +00:00
|
|
|
if (error == 0) {
|
|
|
|
if (v4)
|
|
|
|
*nam = in6_v4mapsin6_sockaddr(port, &addr);
|
|
|
|
else
|
|
|
|
*nam = in6_sockaddr(port, &addr6);
|
|
|
|
}
|
|
|
|
return error;
|
2000-01-09 19:17:30 +00:00
|
|
|
}
|
|
|
|
#endif /* INET6 */
|
2002-06-10 20:05:46 +00:00
|
|
|
|
1996-07-11 16:32:50 +00:00
|
|
|
/*
|
|
|
|
* Mark the connection as being incapable of further output.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
tcp_usr_shutdown(struct socket *so)
|
|
|
|
{
|
|
|
|
int error = 0;
|
2002-06-10 20:05:46 +00:00
|
|
|
struct inpcb *inp;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
struct tcpcb *tp = NULL;
|
2018-07-04 02:47:16 +00:00
|
|
|
struct epoch_tracker et;
|
1996-07-11 16:32:50 +00:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
TCPDEBUG0;
|
2019-11-07 00:10:14 +00:00
|
|
|
NET_EPOCH_ENTER(et);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2009-03-15 09:58:31 +00:00
|
|
|
if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {
|
2006-11-22 17:16:54 +00:00
|
|
|
error = ECONNRESET;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
TCPDEBUG1();
|
1996-07-11 16:32:50 +00:00
|
|
|
socantsendmore(so);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
tcp_usrclosed(tp);
|
2009-03-15 09:58:31 +00:00
|
|
|
if (!(inp->inp_flags & INP_DROPPED))
|
2015-12-16 00:56:45 +00:00
|
|
|
error = tp->t_fb->tfb_tcp_output(tp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
|
|
|
|
out:
|
|
|
|
TCPDEBUG2(PRU_SHUTDOWN);
|
2015-09-13 15:50:55 +00:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_SHUTDOWN);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2019-11-07 00:10:14 +00:00
|
|
|
NET_EPOCH_EXIT(et);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
|
|
|
|
return (error);
|
1996-07-11 16:32:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* After a receive, possibly send window update to peer.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
tcp_usr_rcvd(struct socket *so, int flags)
|
|
|
|
{
|
2020-01-22 05:53:16 +00:00
|
|
|
struct epoch_tracker et;
|
2002-06-10 20:05:46 +00:00
|
|
|
struct inpcb *inp;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
struct tcpcb *tp = NULL;
|
|
|
|
int error = 0;
|
1996-07-11 16:32:50 +00:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
TCPDEBUG0;
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("tcp_usr_rcvd: inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2009-03-15 09:58:31 +00:00
|
|
|
if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {
|
2006-11-22 17:16:54 +00:00
|
|
|
error = ECONNRESET;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
TCPDEBUG1();
|
2015-12-24 19:09:48 +00:00
|
|
|
/*
|
|
|
|
* For passively-created TFO connections, don't attempt a window
|
|
|
|
* update while still in SYN_RECEIVED as this may trigger an early
|
|
|
|
* SYN|ACK. It is preferable to have the SYN|ACK be sent along with
|
|
|
|
* application response data, or failing that, when the DELACK timer
|
|
|
|
* expires.
|
|
|
|
*/
|
2016-10-12 19:06:50 +00:00
|
|
|
if (IS_FASTOPEN(tp->t_flags) &&
|
2015-12-24 19:09:48 +00:00
|
|
|
(tp->t_state == TCPS_SYN_RECEIVED))
|
|
|
|
goto out;
|
2020-01-29 22:48:18 +00:00
|
|
|
NET_EPOCH_ENTER(et);
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
if (tp->t_flags & TF_TOE)
|
|
|
|
tcp_offload_rcvd(tp);
|
2013-01-25 22:50:52 +00:00
|
|
|
else
|
2012-06-19 07:34:13 +00:00
|
|
|
#endif
|
2015-12-16 00:56:45 +00:00
|
|
|
tp->t_fb->tfb_tcp_output(tp);
|
2020-01-22 05:53:16 +00:00
|
|
|
NET_EPOCH_EXIT(et);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
out:
|
|
|
|
TCPDEBUG2(PRU_RCVD);
|
2015-09-13 15:50:55 +00:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_RCVD);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
return (error);
|
1996-07-11 16:32:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Do a send by putting data in output queue and updating urgent
|
1999-06-04 02:27:06 +00:00
|
|
|
* marker if URG set. Possibly send more data. Unlike the other
|
|
|
|
* pru_*() routines, the mbuf chains are our responsibility. We
|
|
|
|
* must either enqueue them or free them. The other pru_* routines
|
|
|
|
* generally are caller-frees.
|
1996-07-11 16:32:50 +00:00
|
|
|
*/
|
|
|
|
static int
|
2004-08-16 18:32:07 +00:00
|
|
|
tcp_usr_send(struct socket *so, int flags, struct mbuf *m,
|
2007-03-21 19:37:55 +00:00
|
|
|
struct sockaddr *nam, struct mbuf *control, struct thread *td)
|
1996-07-11 16:32:50 +00:00
|
|
|
{
|
2019-11-07 00:10:14 +00:00
|
|
|
struct epoch_tracker et;
|
1996-07-11 16:32:50 +00:00
|
|
|
int error = 0;
|
2002-06-10 20:05:46 +00:00
|
|
|
struct inpcb *inp;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
struct tcpcb *tp = NULL;
|
2018-07-30 21:27:26 +00:00
|
|
|
#ifdef INET
|
2018-07-31 06:27:05 +00:00
|
|
|
#ifdef INET6
|
|
|
|
struct sockaddr_in sin;
|
|
|
|
#endif
|
|
|
|
struct sockaddr_in *sinp;
|
2018-07-30 21:27:26 +00:00
|
|
|
#endif
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
|
|
|
int isipv6;
|
|
|
|
#endif
|
2019-10-24 20:05:10 +00:00
|
|
|
u_int8_t incflagsav;
|
|
|
|
u_char vflagsav;
|
|
|
|
bool restoreflags;
|
1999-06-04 02:27:06 +00:00
|
|
|
TCPDEBUG0;
|
1996-07-11 16:32:50 +00:00
|
|
|
|
2002-06-10 20:05:46 +00:00
|
|
|
/*
|
2019-11-07 00:10:14 +00:00
|
|
|
* We require the pcbinfo "read lock" if we will close the socket
|
|
|
|
* as part of this call.
|
2002-06-10 20:05:46 +00:00
|
|
|
*/
|
2020-01-22 05:53:16 +00:00
|
|
|
NET_EPOCH_ENTER(et);
|
2002-06-10 20:05:46 +00:00
|
|
|
inp = sotoinpcb(so);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
KASSERT(inp != NULL, ("tcp_usr_send: inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2019-10-24 20:05:10 +00:00
|
|
|
vflagsav = inp->inp_vflag;
|
|
|
|
incflagsav = inp->inp_inc.inc_flags;
|
|
|
|
restoreflags = false;
|
2009-03-15 09:58:31 +00:00
|
|
|
if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {
|
2006-09-17 13:39:35 +00:00
|
|
|
if (control)
|
|
|
|
m_freem(control);
|
2006-11-22 17:16:54 +00:00
|
|
|
error = ECONNRESET;
|
1997-12-18 09:50:38 +00:00
|
|
|
goto out;
|
1996-07-11 16:32:50 +00:00
|
|
|
}
|
2021-05-12 09:39:36 -04:00
|
|
|
if (control != NULL) {
|
|
|
|
/* TCP doesn't do control messages (rights, creds, etc) */
|
|
|
|
if (control->m_len) {
|
|
|
|
m_freem(control);
|
|
|
|
error = EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
m_freem(control); /* empty control, just free it */
|
|
|
|
control = NULL;
|
|
|
|
}
|
1999-06-04 02:27:06 +00:00
|
|
|
tp = intotcpcb(inp);
|
2021-05-21 17:44:40 -04:00
|
|
|
if ((flags & PRUS_OOB) != 0 &&
|
|
|
|
(error = tcp_pru_options_support(tp, PRUS_OOB)) != 0)
|
|
|
|
goto out;
|
|
|
|
|
1999-06-04 02:27:06 +00:00
|
|
|
TCPDEBUG1();
|
2018-07-30 21:27:26 +00:00
|
|
|
if (nam != NULL && tp->t_state < TCPS_SYN_SENT) {
|
|
|
|
switch (nam->sa_family) {
|
|
|
|
#ifdef INET
|
|
|
|
case AF_INET:
|
|
|
|
sinp = (struct sockaddr_in *)nam;
|
|
|
|
if (sinp->sin_len != sizeof(struct sockaddr_in)) {
|
|
|
|
error = EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if ((inp->inp_vflag & INP_IPV6) != 0) {
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (IN_MULTICAST(ntohl(sinp->sin_addr.s_addr))) {
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
goto out;
|
|
|
|
}
|
2020-07-16 16:46:24 +00:00
|
|
|
if (ntohl(sinp->sin_addr.s_addr) == INADDR_BROADCAST) {
|
|
|
|
error = EACCES;
|
2020-06-03 14:16:40 +00:00
|
|
|
goto out;
|
|
|
|
}
|
2018-07-30 21:27:26 +00:00
|
|
|
if ((error = prison_remote_ip4(td->td_ucred,
|
2021-05-21 17:44:40 -04:00
|
|
|
&sinp->sin_addr)))
|
2018-07-30 21:27:26 +00:00
|
|
|
goto out;
|
|
|
|
#ifdef INET6
|
|
|
|
isipv6 = 0;
|
|
|
|
#endif
|
|
|
|
break;
|
|
|
|
#endif /* INET */
|
|
|
|
#ifdef INET6
|
|
|
|
case AF_INET6:
|
|
|
|
{
|
2019-08-02 07:41:36 +00:00
|
|
|
struct sockaddr_in6 *sin6;
|
2018-07-30 21:27:26 +00:00
|
|
|
|
2019-08-02 07:41:36 +00:00
|
|
|
sin6 = (struct sockaddr_in6 *)nam;
|
|
|
|
if (sin6->sin6_len != sizeof(*sin6)) {
|
2018-07-30 21:27:26 +00:00
|
|
|
error = EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
2020-05-15 14:06:37 +00:00
|
|
|
if ((inp->inp_vflag & INP_IPV6PROTO) == 0) {
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
goto out;
|
|
|
|
}
|
2019-08-02 07:41:36 +00:00
|
|
|
if (IN6_IS_ADDR_MULTICAST(&sin6->sin6_addr)) {
|
2018-07-30 21:27:26 +00:00
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
goto out;
|
|
|
|
}
|
2019-08-02 07:41:36 +00:00
|
|
|
if (IN6_IS_ADDR_V4MAPPED(&sin6->sin6_addr)) {
|
2018-07-30 21:27:26 +00:00
|
|
|
#ifdef INET
|
|
|
|
if ((inp->inp_flags & IN6P_IPV6_V6ONLY) != 0) {
|
|
|
|
error = EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if ((inp->inp_vflag & INP_IPV4) == 0) {
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
goto out;
|
|
|
|
}
|
2019-10-24 20:05:10 +00:00
|
|
|
restoreflags = true;
|
2018-07-30 21:27:26 +00:00
|
|
|
inp->inp_vflag &= ~INP_IPV6;
|
|
|
|
sinp = &sin;
|
2019-08-02 07:41:36 +00:00
|
|
|
in6_sin6_2_sin(sinp, sin6);
|
2018-07-30 21:27:26 +00:00
|
|
|
if (IN_MULTICAST(
|
|
|
|
ntohl(sinp->sin_addr.s_addr))) {
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if ((error = prison_remote_ip4(td->td_ucred,
|
2021-05-21 17:44:40 -04:00
|
|
|
&sinp->sin_addr)))
|
2018-07-30 21:27:26 +00:00
|
|
|
goto out;
|
|
|
|
isipv6 = 0;
|
|
|
|
#else /* !INET */
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
goto out;
|
|
|
|
#endif /* INET */
|
|
|
|
} else {
|
|
|
|
if ((inp->inp_vflag & INP_IPV6) == 0) {
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
goto out;
|
|
|
|
}
|
2019-10-24 20:05:10 +00:00
|
|
|
restoreflags = true;
|
2018-07-30 21:27:26 +00:00
|
|
|
inp->inp_vflag &= ~INP_IPV4;
|
|
|
|
inp->inp_inc.inc_flags |= INC_ISIPV6;
|
|
|
|
if ((error = prison_remote_ip6(td->td_ucred,
|
2021-05-21 17:44:40 -04:00
|
|
|
&sin6->sin6_addr)))
|
2018-07-30 21:27:26 +00:00
|
|
|
goto out;
|
|
|
|
isipv6 = 1;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
#endif /* INET6 */
|
|
|
|
default:
|
|
|
|
error = EAFNOSUPPORT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
2002-06-10 20:05:46 +00:00
|
|
|
if (!(flags & PRUS_OOB)) {
|
2014-11-30 13:24:21 +00:00
|
|
|
sbappendstream(&so->so_snd, m, flags);
|
2021-05-21 17:44:40 -04:00
|
|
|
m = NULL;
|
1996-07-11 16:32:50 +00:00
|
|
|
if (nam && tp->t_state < TCPS_SYN_SENT) {
|
|
|
|
/*
|
|
|
|
* Do implied connect if not yet connected,
|
|
|
|
* initialize window to default value, and
|
2016-01-07 00:14:42 +00:00
|
|
|
* initialize maxseg using peer's cached MSS.
|
1996-07-11 16:32:50 +00:00
|
|
|
*/
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
|
|
|
if (isipv6)
|
2001-09-12 08:38:13 +00:00
|
|
|
error = tcp6_connect(tp, nam, td);
|
2000-01-09 19:17:30 +00:00
|
|
|
#endif /* INET6 */
|
2011-04-30 11:21:29 +00:00
|
|
|
#if defined(INET6) && defined(INET)
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
#ifdef INET
|
2018-07-30 21:27:26 +00:00
|
|
|
error = tcp_connect(tp,
|
|
|
|
(struct sockaddr *)sinp, td);
|
2011-04-30 11:21:29 +00:00
|
|
|
#endif
|
2019-10-24 20:05:10 +00:00
|
|
|
/*
|
|
|
|
* The bind operation in tcp_connect succeeded. We
|
|
|
|
* no longer want to restore the flags if later
|
|
|
|
* operations fail.
|
|
|
|
*/
|
|
|
|
if (error == 0 || inp->inp_lport != 0)
|
|
|
|
restoreflags = false;
|
|
|
|
|
2021-05-21 17:44:40 -04:00
|
|
|
if (error) {
|
|
|
|
/* m is freed if PRUS_NOTREADY is unset. */
|
|
|
|
sbflush(&so->so_snd);
|
1996-07-11 16:32:50 +00:00
|
|
|
goto out;
|
2021-05-21 17:44:40 -04:00
|
|
|
}
|
2018-02-26 02:53:22 +00:00
|
|
|
if (IS_FASTOPEN(tp->t_flags))
|
|
|
|
tcp_fastopen_connect(tp);
|
2018-02-26 03:03:41 +00:00
|
|
|
else {
|
2018-02-26 02:53:22 +00:00
|
|
|
tp->snd_wnd = TTCP_CLIENT_SND_WND;
|
|
|
|
tcp_mss(tp, -1);
|
|
|
|
}
|
1996-07-11 16:32:50 +00:00
|
|
|
}
|
|
|
|
if (flags & PRUS_EOF) {
|
|
|
|
/*
|
|
|
|
* Close the send side of the connection after
|
|
|
|
* the data is sent.
|
|
|
|
*/
|
|
|
|
socantsendmore(so);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
tcp_usrclosed(tp);
|
|
|
|
}
|
2020-06-08 11:48:07 +00:00
|
|
|
if (TCPS_HAVEESTABLISHED(tp->t_state) &&
|
|
|
|
((tp->t_flags2 & TF2_FBYTES_COMPLETE) == 0) &&
|
|
|
|
(tp->t_fbyte_out == 0) &&
|
|
|
|
(so->so_snd.sb_ccc > 0)) {
|
|
|
|
tp->t_fbyte_out = ticks;
|
|
|
|
if (tp->t_fbyte_out == 0)
|
|
|
|
tp->t_fbyte_out = 1;
|
|
|
|
if (tp->t_fbyte_out && tp->t_fbyte_in)
|
|
|
|
tp->t_flags2 |= TF2_FBYTES_COMPLETE;
|
|
|
|
}
|
2014-11-30 13:43:52 +00:00
|
|
|
if (!(inp->inp_flags & INP_DROPPED) &&
|
|
|
|
!(flags & PRUS_NOTREADY)) {
|
1999-01-20 17:32:01 +00:00
|
|
|
if (flags & PRUS_MORETOCOME)
|
|
|
|
tp->t_flags |= TF_MORETOCOME;
|
2015-12-16 00:56:45 +00:00
|
|
|
error = tp->t_fb->tfb_tcp_output(tp);
|
1999-01-20 17:32:01 +00:00
|
|
|
if (flags & PRUS_MORETOCOME)
|
|
|
|
tp->t_flags &= ~TF_MORETOCOME;
|
|
|
|
}
|
1996-07-11 16:32:50 +00:00
|
|
|
} else {
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
/*
|
|
|
|
* XXXRW: PRUS_EOF not implemented with PRUS_OOB?
|
|
|
|
*/
|
2005-03-14 22:15:14 +00:00
|
|
|
SOCKBUF_LOCK(&so->so_snd);
|
1996-07-11 16:32:50 +00:00
|
|
|
if (sbspace(&so->so_snd) < -512) {
|
2005-03-14 22:15:14 +00:00
|
|
|
SOCKBUF_UNLOCK(&so->so_snd);
|
1996-07-11 16:32:50 +00:00
|
|
|
error = ENOBUFS;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* According to RFC961 (Assigned Protocols),
|
|
|
|
* the urgent pointer points to the last octet
|
|
|
|
* of urgent data. We continue, however,
|
|
|
|
* to consider it to indicate the first octet
|
|
|
|
* of data past the urgent section.
|
|
|
|
* Otherwise, snd_up should be one lower.
|
|
|
|
*/
|
2014-11-30 13:24:21 +00:00
|
|
|
sbappendstream_locked(&so->so_snd, m, flags);
|
2005-03-14 22:15:14 +00:00
|
|
|
SOCKBUF_UNLOCK(&so->so_snd);
|
2021-05-21 17:44:40 -04:00
|
|
|
m = NULL;
|
1997-02-21 16:30:31 +00:00
|
|
|
if (nam && tp->t_state < TCPS_SYN_SENT) {
|
|
|
|
/*
|
|
|
|
* Do implied connect if not yet connected,
|
|
|
|
* initialize window to default value, and
|
2016-01-07 00:14:42 +00:00
|
|
|
* initialize maxseg using peer's cached MSS.
|
1997-02-21 16:30:31 +00:00
|
|
|
*/
|
2018-02-26 03:03:41 +00:00
|
|
|
|
2018-02-26 02:53:22 +00:00
|
|
|
/*
|
|
|
|
* Not going to contemplate SYN|URG
|
|
|
|
*/
|
|
|
|
if (IS_FASTOPEN(tp->t_flags))
|
|
|
|
tp->t_flags &= ~TF_FASTOPEN;
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
|
|
|
if (isipv6)
|
2001-09-12 08:38:13 +00:00
|
|
|
error = tcp6_connect(tp, nam, td);
|
2000-01-09 19:17:30 +00:00
|
|
|
#endif /* INET6 */
|
2011-04-30 11:21:29 +00:00
|
|
|
#if defined(INET6) && defined(INET)
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
#ifdef INET
|
2018-07-30 21:27:26 +00:00
|
|
|
error = tcp_connect(tp,
|
|
|
|
(struct sockaddr *)sinp, td);
|
2011-04-30 11:21:29 +00:00
|
|
|
#endif
|
2019-10-24 20:05:10 +00:00
|
|
|
/*
|
|
|
|
* The bind operation in tcp_connect succeeded. We
|
|
|
|
* no longer want to restore the flags if later
|
|
|
|
* operations fail.
|
|
|
|
*/
|
|
|
|
if (error == 0 || inp->inp_lport != 0)
|
|
|
|
restoreflags = false;
|
|
|
|
|
2021-05-21 17:44:40 -04:00
|
|
|
if (error != 0) {
|
|
|
|
/* m is freed if PRUS_NOTREADY is unset. */
|
|
|
|
sbflush(&so->so_snd);
|
1997-02-21 16:30:31 +00:00
|
|
|
goto out;
|
2021-05-21 17:44:40 -04:00
|
|
|
}
|
1997-02-21 16:30:31 +00:00
|
|
|
tp->snd_wnd = TTCP_CLIENT_SND_WND;
|
|
|
|
tcp_mss(tp, -1);
|
|
|
|
}
|
2014-11-30 12:11:01 +00:00
|
|
|
tp->snd_up = tp->snd_una + sbavail(&so->so_snd);
|
2021-05-21 17:44:40 -04:00
|
|
|
if ((flags & PRUS_NOTREADY) == 0) {
|
2014-11-30 13:43:52 +00:00
|
|
|
tp->t_flags |= TF_FORCEDATA;
|
2015-12-16 00:56:45 +00:00
|
|
|
error = tp->t_fb->tfb_tcp_output(tp);
|
2014-11-30 13:43:52 +00:00
|
|
|
tp->t_flags &= ~TF_FORCEDATA;
|
|
|
|
}
|
1996-07-11 16:32:50 +00:00
|
|
|
}
|
2018-03-22 09:40:08 +00:00
|
|
|
TCP_LOG_EVENT(tp, NULL,
|
|
|
|
&inp->inp_socket->so_rcv,
|
|
|
|
&inp->inp_socket->so_snd,
|
|
|
|
TCP_LOG_USERSEND, error,
|
|
|
|
0, NULL, false);
|
2021-05-21 17:44:40 -04:00
|
|
|
|
2005-05-01 11:11:38 +00:00
|
|
|
out:
|
2021-05-21 17:44:40 -04:00
|
|
|
/*
|
|
|
|
* In case of PRUS_NOTREADY, the caller or tcp_usr_ready() is
|
|
|
|
* responsible for freeing memory.
|
|
|
|
*/
|
|
|
|
if (m != NULL && (flags & PRUS_NOTREADY) == 0)
|
|
|
|
m_freem(m);
|
|
|
|
|
2019-10-24 20:05:10 +00:00
|
|
|
/*
|
|
|
|
* If the request was unsuccessful and we changed flags,
|
|
|
|
* restore the original flags.
|
|
|
|
*/
|
|
|
|
if (error != 0 && restoreflags) {
|
|
|
|
inp->inp_vflag = vflagsav;
|
|
|
|
inp->inp_inc.inc_flags = incflagsav;
|
|
|
|
}
|
2005-05-01 11:11:38 +00:00
|
|
|
TCPDEBUG2((flags & PRUS_OOB) ? PRU_SENDOOB :
|
|
|
|
((flags & PRUS_EOF) ? PRU_SEND_EOF : PRU_SEND));
|
2015-09-13 15:50:55 +00:00
|
|
|
TCP_PROBE2(debug__user, tp, (flags & PRUS_OOB) ? PRU_SENDOOB :
|
|
|
|
((flags & PRUS_EOF) ? PRU_SEND_EOF : PRU_SEND));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2020-01-22 05:53:16 +00:00
|
|
|
NET_EPOCH_EXIT(et);
|
2005-05-01 13:06:05 +00:00
|
|
|
return (error);
|
1996-07-11 16:32:50 +00:00
|
|
|
}
|
|
|
|
|
2014-11-30 13:43:52 +00:00
|
|
|
static int
|
|
|
|
tcp_usr_ready(struct socket *so, struct mbuf *m, int count)
|
|
|
|
{
|
2020-01-22 05:53:16 +00:00
|
|
|
struct epoch_tracker et;
|
2014-11-30 13:43:52 +00:00
|
|
|
struct inpcb *inp;
|
|
|
|
struct tcpcb *tp;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {
|
|
|
|
INP_WUNLOCK(inp);
|
Add an external mbuf buffer type that holds multiple unmapped pages.
Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer
now points to a struct mbuf_ext_pgs structure instead of a data
buffer. This structure contains an array of physical addresses (this
reduces cache misses compared to an earlier version that stored an
array of vm_page_t pointers). It also stores additional fields needed
for in-kernel TLS such as the TLS header and trailer data that are
currently unused. To more easily detect these mbufs, the M_NOMAP flag
is set in m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys()), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain of mapped mbufs before
computing the checksum.
Submitted by: gallatin (earlier version)
Reviewed by: gallatin, hselasky, rrs
Discussed with: ae, kp (firewalls)
Relnotes: yes
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20616
2019-06-29 00:48:33 +00:00
|
|
|
mb_free_notready(m, count);
|
2014-11-30 13:43:52 +00:00
|
|
|
return (ECONNRESET);
|
|
|
|
}
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
|
|
|
|
SOCKBUF_LOCK(&so->so_snd);
|
|
|
|
error = sbready(&so->so_snd, m, count);
|
|
|
|
SOCKBUF_UNLOCK(&so->so_snd);
|
2020-01-22 05:53:16 +00:00
|
|
|
if (error == 0) {
|
|
|
|
NET_EPOCH_ENTER(et);
|
2015-12-16 00:56:45 +00:00
|
|
|
error = tp->t_fb->tfb_tcp_output(tp);
|
2020-01-22 05:53:16 +00:00
|
|
|
NET_EPOCH_EXIT(et);
|
|
|
|
}
|
2014-11-30 13:43:52 +00:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
1996-07-11 16:32:50 +00:00
|
|
|
/*
|
2006-07-21 17:11:15 +00:00
|
|
|
* Abort the TCP. Drop the connection abruptly.
|
1996-07-11 16:32:50 +00:00
|
|
|
*/
|
2006-04-01 15:15:05 +00:00
|
|
|
static void
|
1996-07-11 16:32:50 +00:00
|
|
|
tcp_usr_abort(struct socket *so)
|
|
|
|
{
|
2002-06-10 20:05:46 +00:00
|
|
|
struct inpcb *inp;
|
2006-07-21 17:11:15 +00:00
|
|
|
struct tcpcb *tp = NULL;
|
2018-07-04 02:47:16 +00:00
|
|
|
struct epoch_tracker et;
|
2006-04-24 08:20:02 +00:00
|
|
|
TCPDEBUG0;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
|
2006-04-24 08:20:02 +00:00
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("tcp_usr_abort: inp == NULL"));
|
1996-07-11 16:32:50 +00:00
|
|
|
|
2019-11-07 00:10:14 +00:00
|
|
|
NET_EPOCH_ENTER(et);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2006-04-24 08:20:02 +00:00
|
|
|
KASSERT(inp->inp_socket != NULL,
|
|
|
|
("tcp_usr_abort: inp_socket == NULL"));
|
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
/*
|
2006-07-21 17:11:15 +00:00
|
|
|
* If we still have full TCP state, and we're not dropped, drop.
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
*/
|
2009-03-15 09:58:31 +00:00
|
|
|
if (!(inp->inp_flags & INP_TIMEWAIT) &&
|
|
|
|
!(inp->inp_flags & INP_DROPPED)) {
|
2006-04-24 08:20:02 +00:00
|
|
|
tp = intotcpcb(inp);
|
2006-07-21 17:11:15 +00:00
|
|
|
TCPDEBUG1();
|
2018-04-06 17:20:37 +00:00
|
|
|
tp = tcp_drop(tp, ECONNABORTED);
|
|
|
|
if (tp == NULL)
|
|
|
|
goto dropped;
|
2006-07-21 17:11:15 +00:00
|
|
|
TCPDEBUG2(PRU_ABORT);
|
2015-09-13 15:50:55 +00:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_ABORT);
|
2006-04-24 08:20:02 +00:00
|
|
|
}
|
2009-03-15 09:58:31 +00:00
|
|
|
if (!(inp->inp_flags & INP_DROPPED)) {
|
2006-07-21 17:11:15 +00:00
|
|
|
SOCK_LOCK(so);
|
|
|
|
so->so_state |= SS_PROTOREF;
|
|
|
|
SOCK_UNLOCK(so);
|
2009-03-15 09:58:31 +00:00
|
|
|
inp->inp_flags |= INP_SOCKREF;
|
2006-07-21 17:11:15 +00:00
|
|
|
}
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2018-04-06 17:20:37 +00:00
|
|
|
dropped:
|
2019-11-07 00:10:14 +00:00
|
|
|
NET_EPOCH_EXIT(et);
|
2006-07-21 17:11:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* TCP socket is closed. Start friendly disconnect.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
tcp_usr_close(struct socket *so)
|
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
|
|
|
struct tcpcb *tp = NULL;
|
2018-07-04 02:47:16 +00:00
|
|
|
struct epoch_tracker et;
|
2006-07-21 17:11:15 +00:00
|
|
|
TCPDEBUG0;
|
|
|
|
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("tcp_usr_close: inp == NULL"));
|
|
|
|
|
2019-11-07 00:10:14 +00:00
|
|
|
NET_EPOCH_ENTER(et);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2006-07-21 17:11:15 +00:00
|
|
|
KASSERT(inp->inp_socket != NULL,
|
|
|
|
("tcp_usr_close: inp_socket == NULL"));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we still have full TCP state, and we're not dropped, initiate
|
|
|
|
* a disconnect.
|
|
|
|
*/
|
2009-03-15 09:58:31 +00:00
|
|
|
if (!(inp->inp_flags & INP_TIMEWAIT) &&
|
|
|
|
!(inp->inp_flags & INP_DROPPED)) {
|
2006-07-21 17:11:15 +00:00
|
|
|
tp = intotcpcb(inp);
|
|
|
|
TCPDEBUG1();
|
|
|
|
tcp_disconnect(tp);
|
|
|
|
TCPDEBUG2(PRU_CLOSE);
|
2015-09-13 15:50:55 +00:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_CLOSE);
|
2006-07-21 17:11:15 +00:00
|
|
|
}
|
2009-03-15 09:58:31 +00:00
|
|
|
if (!(inp->inp_flags & INP_DROPPED)) {
|
2006-07-21 17:11:15 +00:00
|
|
|
SOCK_LOCK(so);
|
|
|
|
so->so_state |= SS_PROTOREF;
|
|
|
|
SOCK_UNLOCK(so);
|
2009-03-15 09:58:31 +00:00
|
|
|
inp->inp_flags |= INP_SOCKREF;
|
2006-07-21 17:11:15 +00:00
|
|
|
}
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2019-11-07 00:10:14 +00:00
|
|
|
NET_EPOCH_EXIT(et);
|
1996-07-11 16:32:50 +00:00
|
|
|
}
|
|
|
|
|
2020-05-10 17:43:42 +00:00
|
|
|
static int
|
2020-05-04 20:19:57 +00:00
|
|
|
tcp_pru_options_support(struct tcpcb *tp, int flags)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* If the specific TCP stack has a pru_options
|
|
|
|
* specified then it does not always support
|
|
|
|
* all the PRU_XX options and we must ask it.
|
|
|
|
* If the function is not specified then all
|
|
|
|
* of the PRU_XX options are supported.
|
|
|
|
*/
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
if (tp->t_fb->tfb_pru_options) {
|
|
|
|
ret = (*tp->t_fb->tfb_pru_options)(tp, flags);
|
|
|
|
}
|
|
|
|
return (ret);
|
|
|
|
}
|
|
|
|
|
1996-07-11 16:32:50 +00:00
|
|
|
/*
|
|
|
|
* Receive out-of-band data.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
tcp_usr_rcvoob(struct socket *so, struct mbuf *m, int flags)
|
|
|
|
{
|
|
|
|
int error = 0;
|
2002-06-10 20:05:46 +00:00
|
|
|
struct inpcb *inp;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
struct tcpcb *tp = NULL;
|
1996-07-11 16:32:50 +00:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
TCPDEBUG0;
|
|
|
|
inp = sotoinpcb(so);
|
|
|
|
KASSERT(inp != NULL, ("tcp_usr_rcvoob: inp == NULL"));
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2009-03-15 09:58:31 +00:00
|
|
|
if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {
|
2006-11-22 17:16:54 +00:00
|
|
|
error = ECONNRESET;
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
tp = intotcpcb(inp);
|
2020-05-04 20:19:57 +00:00
|
|
|
error = tcp_pru_options_support(tp, PRUS_OOB);
|
|
|
|
if (error) {
|
|
|
|
goto out;
|
|
|
|
}
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
TCPDEBUG1();
|
1996-07-11 16:32:50 +00:00
|
|
|
if ((so->so_oobmark == 0 &&
|
2004-06-14 18:16:22 +00:00
|
|
|
(so->so_rcv.sb_state & SBS_RCVATMARK) == 0) ||
|
2002-05-31 11:52:35 +00:00
|
|
|
so->so_options & SO_OOBINLINE ||
|
|
|
|
tp->t_oobflags & TCPOOB_HADDATA) {
|
1996-07-11 16:32:50 +00:00
|
|
|
error = EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if ((tp->t_oobflags & TCPOOB_HAVEDATA) == 0) {
|
|
|
|
error = EWOULDBLOCK;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
m->m_len = 1;
|
|
|
|
*mtod(m, caddr_t) = tp->t_iobc;
|
|
|
|
if ((flags & MSG_PEEK) == 0)
|
|
|
|
tp->t_oobflags ^= (TCPOOB_HAVEDATA | TCPOOB_HADDATA);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
|
|
|
|
out:
|
|
|
|
TCPDEBUG2(PRU_RCVOOB);
|
2015-09-13 15:50:55 +00:00
|
|
|
TCP_PROBE2(debug__user, tp, PRU_RCVOOB);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
return (error);
|
1996-07-11 16:32:50 +00:00
|
|
|
}
|
|
|
|
|
2011-04-30 11:21:29 +00:00
|
|
|
#ifdef INET
|
1996-07-11 16:32:50 +00:00
|
|
|
struct pr_usrreqs tcp_usrreqs = {
|
2004-11-08 14:44:54 +00:00
|
|
|
.pru_abort = tcp_usr_abort,
|
|
|
|
.pru_accept = tcp_usr_accept,
|
|
|
|
.pru_attach = tcp_usr_attach,
|
|
|
|
.pru_bind = tcp_usr_bind,
|
|
|
|
.pru_connect = tcp_usr_connect,
|
|
|
|
.pru_control = in_control,
|
|
|
|
.pru_detach = tcp_usr_detach,
|
|
|
|
.pru_disconnect = tcp_usr_disconnect,
|
|
|
|
.pru_listen = tcp_usr_listen,
|
2007-05-11 10:20:51 +00:00
|
|
|
.pru_peeraddr = in_getpeeraddr,
|
2004-11-08 14:44:54 +00:00
|
|
|
.pru_rcvd = tcp_usr_rcvd,
|
|
|
|
.pru_rcvoob = tcp_usr_rcvoob,
|
|
|
|
.pru_send = tcp_usr_send,
|
2014-11-30 13:43:52 +00:00
|
|
|
.pru_ready = tcp_usr_ready,
|
2004-11-08 14:44:54 +00:00
|
|
|
.pru_shutdown = tcp_usr_shutdown,
|
2007-05-11 10:20:51 +00:00
|
|
|
.pru_sockaddr = in_getsockaddr,
|
2006-07-21 17:11:15 +00:00
|
|
|
.pru_sosetlabel = in_pcbsosetlabel,
|
|
|
|
.pru_close = tcp_usr_close,
|
1996-07-11 16:32:50 +00:00
|
|
|
};
|
2011-04-30 11:21:29 +00:00
|
|
|
#endif /* INET */
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
|
|
|
struct pr_usrreqs tcp6_usrreqs = {
|
2004-11-08 14:44:54 +00:00
|
|
|
.pru_abort = tcp_usr_abort,
|
|
|
|
.pru_accept = tcp6_usr_accept,
|
|
|
|
.pru_attach = tcp_usr_attach,
|
|
|
|
.pru_bind = tcp6_usr_bind,
|
|
|
|
.pru_connect = tcp6_usr_connect,
|
|
|
|
.pru_control = in6_control,
|
|
|
|
.pru_detach = tcp_usr_detach,
|
|
|
|
.pru_disconnect = tcp_usr_disconnect,
|
|
|
|
.pru_listen = tcp6_usr_listen,
|
|
|
|
.pru_peeraddr = in6_mapped_peeraddr,
|
|
|
|
.pru_rcvd = tcp_usr_rcvd,
|
|
|
|
.pru_rcvoob = tcp_usr_rcvoob,
|
|
|
|
.pru_send = tcp_usr_send,
|
2014-11-30 13:43:52 +00:00
|
|
|
.pru_ready = tcp_usr_ready,
|
2004-11-08 14:44:54 +00:00
|
|
|
.pru_shutdown = tcp_usr_shutdown,
|
|
|
|
.pru_sockaddr = in6_mapped_sockaddr,
|
2011-01-07 21:40:34 +00:00
|
|
|
.pru_sosetlabel = in_pcbsosetlabel,
|
2006-07-21 17:11:15 +00:00
|
|
|
.pru_close = tcp_usr_close,
|
2000-01-09 19:17:30 +00:00
|
|
|
};
|
|
|
|
#endif /* INET6 */
|
|
|
|
|
2011-04-30 11:21:29 +00:00
|
|
|
#ifdef INET
|
1995-02-09 23:13:27 +00:00
|
|
|
/*
|
|
|
|
* Common subroutine to open a TCP connection to remote host specified
|
|
|
|
* by struct sockaddr_in in mbuf *nam. Call in_pcbbind to assign a local
|
2002-10-21 13:55:50 +00:00
|
|
|
* port number if needed. Call in_pcbconnect_setup to do the routing and
|
|
|
|
* to choose a local host address (interface). If there is an existing
|
|
|
|
* incarnation of the same connection in TIME-WAIT state and if the remote
|
|
|
|
* host was sending CC options and if the connection duration was < MSL, then
|
1995-02-09 23:13:27 +00:00
|
|
|
* truncate the previous TIME-WAIT state and proceed.
|
|
|
|
* Initialize connection parameters and enter SYN-SENT state.
|
|
|
|
*/
|
1995-11-14 20:34:56 +00:00
|
|
|
static int
|
2007-03-21 19:37:55 +00:00
|
|
|
tcp_connect(struct tcpcb *tp, struct sockaddr *nam, struct thread *td)
|
1995-02-09 23:13:27 +00:00
|
|
|
{
|
|
|
|
struct inpcb *inp = tp->t_inpcb, *oinp;
|
|
|
|
struct socket *so = inp->inp_socket;
|
2002-10-21 13:55:50 +00:00
|
|
|
struct in_addr laddr;
|
|
|
|
u_short lport;
|
Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
1998-01-27 09:15:13 +00:00
|
|
|
int error;
|
1995-02-09 23:13:27 +00:00
|
|
|
|
2020-01-22 06:10:41 +00:00
|
|
|
NET_EPOCH_ASSERT();
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK(&V_tcbinfo);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
|
2020-05-18 22:53:12 +00:00
|
|
|
if (V_tcp_require_unique_port && inp->inp_lport == 0) {
|
2017-02-10 05:58:16 +00:00
|
|
|
error = in_pcbbind(inp, (struct sockaddr *)0, td->td_ucred);
|
|
|
|
if (error)
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
goto out;
|
1995-02-09 23:13:27 +00:00
|
|
|
}
|
|
|
|
|
1995-05-30 08:16:23 +00:00
|
|
|
/*
|
1995-02-09 23:13:27 +00:00
|
|
|
* Cannot simply call in_pcbconnect, because there might be an
|
|
|
|
* earlier incarnation of this same connection still in
|
|
|
|
* TIME_WAIT state, creating an ADDRINUSE error.
|
|
|
|
*/
|
2002-10-21 13:55:50 +00:00
|
|
|
laddr = inp->inp_laddr;
|
|
|
|
lport = inp->inp_lport;
|
|
|
|
error = in_pcbconnect_setup(inp, nam, &laddr.s_addr, &lport,
|
2004-03-27 21:05:46 +00:00
|
|
|
&inp->inp_faddr.s_addr, &inp->inp_fport, &oinp, td->td_ucred);
|
2002-10-21 13:55:50 +00:00
|
|
|
if (error && oinp == NULL)
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
goto out;
|
|
|
|
if (oinp) {
|
|
|
|
error = EADDRINUSE;
|
|
|
|
goto out;
|
|
|
|
}
|
2020-05-18 22:53:12 +00:00
|
|
|
/* Handle initial bind if it hadn't been done in advance. */
|
|
|
|
if (inp->inp_lport == 0) {
|
|
|
|
inp->inp_lport = lport;
|
|
|
|
if (in_pcbinshash(inp) != 0) {
|
|
|
|
inp->inp_lport = 0;
|
|
|
|
error = EAGAIN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
2002-10-21 13:55:50 +00:00
|
|
|
inp->inp_laddr = laddr;
|
1995-04-09 01:29:31 +00:00
|
|
|
in_pcbrehash(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
1995-02-09 23:13:27 +00:00
|
|
|
|
2007-02-01 17:39:18 +00:00
|
|
|
/*
|
|
|
|
* Compute window scaling to request:
|
|
|
|
* Scale to fit into sweet spot. See tcp_syncache.c.
|
|
|
|
* XXX: This should move to tcp_output().
|
|
|
|
*/
|
1995-02-09 23:13:27 +00:00
|
|
|
while (tp->request_r_scale < TCP_MAX_WINSHIFT &&
|
2007-10-19 08:53:14 +00:00
|
|
|
(TCP_MAXWIN << tp->request_r_scale) < sb_max)
|
1995-02-09 23:13:27 +00:00
|
|
|
tp->request_r_scale++;
|
|
|
|
|
|
|
|
soisconnecting(so);
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_connattempt);
|
2013-08-25 21:54:41 +00:00
|
|
|
tcp_state_change(tp, TCPS_SYN_SENT);
|
2018-08-19 14:56:10 +00:00
|
|
|
tp->iss = tcp_new_isn(&inp->inp_inc);
|
|
|
|
if (tp->t_flags & TF_REQ_TSTMP)
|
|
|
|
tp->ts_offset = tcp_new_ts_offset(&inp->inp_inc);
|
1995-02-09 23:13:27 +00:00
|
|
|
tcp_sendseqinit(tp);
|
1995-11-03 22:08:13 +00:00
|
|
|
|
1995-02-09 23:13:27 +00:00
|
|
|
return 0;
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
|
|
|
|
out:
|
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
|
|
|
return (error);
|
1995-02-09 23:13:27 +00:00
|
|
|
}
|
2011-04-30 11:21:29 +00:00
|
|
|
#endif /* INET */
|
1995-02-09 23:13:27 +00:00
|
|
|
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
|
|
|
static int
|
2007-03-21 19:37:55 +00:00
|
|
|
tcp6_connect(struct tcpcb *tp, struct sockaddr *nam, struct thread *td)
|
2000-01-09 19:17:30 +00:00
|
|
|
{
|
2014-09-10 13:17:35 +00:00
|
|
|
struct inpcb *inp = tp->t_inpcb;
|
2000-01-09 19:17:30 +00:00
|
|
|
int error;
|
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK(&V_tcbinfo);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
|
2020-05-18 22:53:12 +00:00
|
|
|
if (V_tcp_require_unique_port && inp->inp_lport == 0) {
|
2017-02-10 05:58:16 +00:00
|
|
|
error = in6_pcbbind(inp, (struct sockaddr *)0, td->td_ucred);
|
|
|
|
if (error)
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
goto out;
|
2000-01-09 19:17:30 +00:00
|
|
|
}
|
2014-09-10 13:17:35 +00:00
|
|
|
error = in6_pcbconnect(inp, nam, td->td_ucred);
|
|
|
|
if (error != 0)
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
goto out;
|
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
2000-01-09 19:17:30 +00:00
|
|
|
|
|
|
|
/* Compute window scaling to request. */
|
|
|
|
while (tp->request_r_scale < TCP_MAX_WINSHIFT &&
|
2009-04-07 14:42:40 +00:00
|
|
|
(TCP_MAXWIN << tp->request_r_scale) < sb_max)
|
2000-01-09 19:17:30 +00:00
|
|
|
tp->request_r_scale++;
|
|
|
|
|
2014-09-10 13:17:35 +00:00
|
|
|
soisconnecting(inp->inp_socket);
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_connattempt);
|
2013-08-25 21:54:41 +00:00
|
|
|
tcp_state_change(tp, TCPS_SYN_SENT);
|
2018-08-19 14:56:10 +00:00
|
|
|
tp->iss = tcp_new_isn(&inp->inp_inc);
|
|
|
|
if (tp->t_flags & TF_REQ_TSTMP)
|
|
|
|
tp->ts_offset = tcp_new_ts_offset(&inp->inp_inc);
|
2000-01-09 19:17:30 +00:00
|
|
|
tcp_sendseqinit(tp);
|
|
|
|
|
|
|
|
return 0;
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
|
|
|
|
out:
|
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
|
|
|
return error;
|
2000-01-09 19:17:30 +00:00
|
|
|
}
|
|
|
|
#endif /* INET6 */
|
|
|
|
|
2004-11-26 18:58:46 +00:00
|
|
|
/*
|
|
|
|
* Export TCP internal state information via a struct tcp_info, based on the
|
|
|
|
* Linux 2.6 API. Not ABI compatible as our constants are mapped differently
|
|
|
|
* (TCP state machine, etc). We export all information using FreeBSD-native
|
|
|
|
* constants -- for example, the numeric values for tcpi_state will differ
|
|
|
|
* from Linux.
|
|
|
|
*/
|
|
|
|
static void
|
2007-03-21 19:37:55 +00:00
|
|
|
tcp_fill_info(struct tcpcb *tp, struct tcp_info *ti)
|
2004-11-26 18:58:46 +00:00
|
|
|
{
|
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(tp->t_inpcb);
|
2004-11-26 18:58:46 +00:00
|
|
|
bzero(ti, sizeof(*ti));
|
|
|
|
|
|
|
|
ti->tcpi_state = tp->t_state;
|
|
|
|
if ((tp->t_flags & TF_REQ_TSTMP) && (tp->t_flags & TF_RCVD_TSTMP))
|
|
|
|
ti->tcpi_options |= TCPI_OPT_TIMESTAMPS;
|
2007-05-06 15:56:31 +00:00
|
|
|
if (tp->t_flags & TF_SACK_PERMIT)
|
2004-11-26 18:58:46 +00:00
|
|
|
ti->tcpi_options |= TCPI_OPT_SACK;
|
|
|
|
if ((tp->t_flags & TF_REQ_SCALE) && (tp->t_flags & TF_RCVD_SCALE)) {
|
|
|
|
ti->tcpi_options |= TCPI_OPT_WSCALE;
|
|
|
|
ti->tcpi_snd_wscale = tp->snd_scale;
|
|
|
|
ti->tcpi_rcv_wscale = tp->rcv_scale;
|
|
|
|
}
|
2019-12-01 21:01:33 +00:00
|
|
|
if (tp->t_flags2 & TF2_ECN_PERMIT)
|
2016-09-14 14:48:00 +00:00
|
|
|
ti->tcpi_options |= TCPI_OPT_ECN;
|
2007-02-02 18:34:18 +00:00
|
|
|
|
2009-12-22 15:47:40 +00:00
|
|
|
ti->tcpi_rto = tp->t_rxtcur * tick;
|
2016-10-06 16:28:34 +00:00
|
|
|
ti->tcpi_last_data_recv = ((uint32_t)ticks - tp->t_rcvtime) * tick;
|
2007-02-02 18:34:18 +00:00
|
|
|
ti->tcpi_rtt = ((u_int64_t)tp->t_srtt * tick) >> TCP_RTT_SHIFT;
|
|
|
|
ti->tcpi_rttvar = ((u_int64_t)tp->t_rttvar * tick) >> TCP_RTTVAR_SHIFT;
|
|
|
|
|
2004-11-26 18:58:46 +00:00
|
|
|
ti->tcpi_snd_ssthresh = tp->snd_ssthresh;
|
|
|
|
ti->tcpi_snd_cwnd = tp->snd_cwnd;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* FreeBSD-specific extension fields for tcp_info.
|
|
|
|
*/
|
2004-11-27 20:20:11 +00:00
|
|
|
ti->tcpi_rcv_space = tp->rcv_wnd;
|
2008-05-05 20:13:31 +00:00
|
|
|
ti->tcpi_rcv_nxt = tp->rcv_nxt;
|
2004-11-26 18:58:46 +00:00
|
|
|
ti->tcpi_snd_wnd = tp->snd_wnd;
|
2010-09-16 21:06:45 +00:00
|
|
|
ti->tcpi_snd_bwnd = 0; /* Unused, kept for compat. */
|
2008-05-05 23:13:27 +00:00
|
|
|
ti->tcpi_snd_nxt = tp->snd_nxt;
|
2009-12-22 15:47:40 +00:00
|
|
|
ti->tcpi_snd_mss = tp->t_maxseg;
|
|
|
|
ti->tcpi_rcv_mss = tp->t_maxseg;
|
2010-11-17 18:55:12 +00:00
|
|
|
ti->tcpi_snd_rexmitpack = tp->t_sndrexmitpack;
|
|
|
|
ti->tcpi_rcv_ooopack = tp->t_rcvoopack;
|
|
|
|
ti->tcpi_snd_zerowin = tp->t_sndzerowin;
|
2018-04-03 01:08:54 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
if (tp->t_flags & TF_TOE) {
|
|
|
|
ti->tcpi_options |= TCPI_OPT_TOE;
|
|
|
|
tcp_offload_tcp_info(tp, ti);
|
|
|
|
}
|
|
|
|
#endif
|
2004-11-26 18:58:46 +00:00
|
|
|
}
|
|
|
|
|
1998-08-23 03:07:17 +00:00
|
|
|
/*
|
2008-01-18 12:19:50 +00:00
|
|
|
* tcp_ctloutput() must drop the inpcb lock before performing copyin on
|
|
|
|
* socket option arguments. When it re-acquires the lock after the copy, it
|
|
|
|
* has to revalidate that the connection is still valid for the socket
|
|
|
|
* option.
|
1998-08-23 03:07:17 +00:00
|
|
|
*/
|
2016-04-26 23:02:18 +00:00
|
|
|
#define INP_WLOCK_RECHECK_CLEANUP(inp, cleanup) do { \
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp); \
|
2009-03-15 09:58:31 +00:00
|
|
|
if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) { \
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp); \
|
2016-04-26 23:02:18 +00:00
|
|
|
cleanup; \
|
2008-01-18 12:19:50 +00:00
|
|
|
return (ECONNRESET); \
|
|
|
|
} \
|
|
|
|
tp = intotcpcb(inp); \
|
|
|
|
} while(0)
|
2016-04-26 23:02:18 +00:00
|
|
|
#define INP_WLOCK_RECHECK(inp) INP_WLOCK_RECHECK_CLEANUP((inp), /* noop */)
|
2008-01-18 12:19:50 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
int
|
2007-03-21 19:37:55 +00:00
|
|
|
tcp_ctloutput(struct socket *so, struct sockopt *sopt)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2015-12-16 00:56:45 +00:00
|
|
|
int error;
|
1998-08-23 03:07:17 +00:00
|
|
|
struct inpcb *inp;
|
|
|
|
struct tcpcb *tp;
|
2015-12-16 00:56:45 +00:00
|
|
|
struct tcp_function_block *blk;
|
|
|
|
struct tcp_function_set fsn;
|
1994-05-24 10:09:53 +00:00
|
|
|
|
1998-08-23 03:07:17 +00:00
|
|
|
error = 0;
|
1994-05-24 10:09:53 +00:00
|
|
|
inp = sotoinpcb(so);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
KASSERT(inp != NULL, ("tcp_ctloutput: inp == NULL"));
|
1998-08-23 03:07:17 +00:00
|
|
|
if (sopt->sopt_level != IPPROTO_TCP) {
|
2000-01-09 19:17:30 +00:00
|
|
|
#ifdef INET6
|
2008-11-27 13:19:42 +00:00
|
|
|
if (inp->inp_vflag & INP_IPV6PROTO) {
|
2000-01-09 19:17:30 +00:00
|
|
|
error = ip6_ctloutput(so, sopt);
|
2018-08-21 14:12:30 +00:00
|
|
|
/*
|
|
|
|
* In case of the IPV6_USE_MIN_MTU socket option,
|
|
|
|
* the INC_IPV6MINMTU flag to announce a corresponding
|
|
|
|
* MSS during the initial handshake.
|
|
|
|
* If the TCP connection is not in the front states,
|
|
|
|
* just reduce the MSS being used.
|
|
|
|
* This avoids the sending of TCP segments which will
|
|
|
|
* be fragmented at the IPv6 layer.
|
|
|
|
*/
|
|
|
|
if ((error == 0) &&
|
|
|
|
(sopt->sopt_dir == SOPT_SET) &&
|
|
|
|
(sopt->sopt_level == IPPROTO_IPV6) &&
|
|
|
|
(sopt->sopt_name == IPV6_USE_MIN_MTU)) {
|
|
|
|
INP_WLOCK(inp);
|
|
|
|
if ((inp->inp_flags &
|
|
|
|
(INP_TIMEWAIT | INP_DROPPED))) {
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
return (ECONNRESET);
|
|
|
|
}
|
|
|
|
inp->inp_inc.inc_flags |= INC_IPV6MINMTU;
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
if ((tp->t_state >= TCPS_SYN_SENT) &&
|
|
|
|
(inp->inp_inc.inc_flags & INC_ISIPV6)) {
|
|
|
|
struct ip6_pktopts *opt;
|
|
|
|
|
|
|
|
opt = inp->in6p_outputopts;
|
|
|
|
if ((opt != NULL) &&
|
|
|
|
(opt->ip6po_minmtu ==
|
|
|
|
IP6PO_MINMTU_ALL)) {
|
|
|
|
if (tp->t_maxseg > TCP6_MSS) {
|
|
|
|
tp->t_maxseg = TCP6_MSS;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
}
|
2011-04-30 11:21:29 +00:00
|
|
|
}
|
2000-01-09 19:17:30 +00:00
|
|
|
#endif /* INET6 */
|
2011-04-30 11:21:29 +00:00
|
|
|
#if defined(INET6) && defined(INET)
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
#ifdef INET
|
|
|
|
{
|
2008-01-18 12:19:50 +00:00
|
|
|
error = ip_ctloutput(so, sopt);
|
|
|
|
}
|
|
|
|
#endif
|
1994-05-24 10:09:53 +00:00
|
|
|
return (error);
|
|
|
|
}
|
2019-04-18 23:21:26 +00:00
|
|
|
INP_WLOCK(inp);
|
2009-03-15 09:58:31 +00:00
|
|
|
if (inp->inp_flags & (INP_TIMEWAIT | INP_DROPPED)) {
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2008-01-18 12:19:50 +00:00
|
|
|
return (ECONNRESET);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
}
|
2015-12-16 00:56:45 +00:00
|
|
|
tp = intotcpcb(inp);
|
|
|
|
/*
|
|
|
|
* Protect the TCP option TCP_FUNCTION_BLK so
|
|
|
|
* that a sub-function can *never* overwrite this.
|
|
|
|
*/
|
2020-02-12 13:31:36 +00:00
|
|
|
if ((sopt->sopt_dir == SOPT_SET) &&
|
2015-12-16 00:56:45 +00:00
|
|
|
(sopt->sopt_name == TCP_FUNCTION_BLK)) {
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyin(sopt, &fsn, sizeof fsn,
|
|
|
|
sizeof fsn);
|
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
blk = find_and_ref_tcp_functions(&fsn);
|
|
|
|
if (blk == NULL) {
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
return (ENOENT);
|
|
|
|
}
|
2016-08-16 15:11:46 +00:00
|
|
|
if (tp->t_fb == blk) {
|
|
|
|
/* You already have this */
|
|
|
|
refcount_release(&blk->tfb_refcnt);
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
if (tp->t_state != TCPS_CLOSED) {
|
2020-02-12 13:31:36 +00:00
|
|
|
/*
|
2016-08-16 15:11:46 +00:00
|
|
|
* The user has advanced the state
|
|
|
|
* past the initial point, we may not
|
2020-02-12 13:31:36 +00:00
|
|
|
* be able to switch.
|
2016-08-16 15:11:46 +00:00
|
|
|
*/
|
|
|
|
if (blk->tfb_tcp_handoff_ok != NULL) {
|
2020-02-12 13:31:36 +00:00
|
|
|
/*
|
2016-08-16 15:11:46 +00:00
|
|
|
* Does the stack provide a
|
|
|
|
* query mechanism, if so it may
|
|
|
|
* still be possible?
|
|
|
|
*/
|
|
|
|
error = (*blk->tfb_tcp_handoff_ok)(tp);
|
2018-08-24 10:50:19 +00:00
|
|
|
} else
|
|
|
|
error = EINVAL;
|
2016-08-16 15:11:46 +00:00
|
|
|
if (error) {
|
2015-12-16 00:56:45 +00:00
|
|
|
refcount_release(&blk->tfb_refcnt);
|
|
|
|
INP_WUNLOCK(inp);
|
2016-08-16 15:11:46 +00:00
|
|
|
return(error);
|
2015-12-16 00:56:45 +00:00
|
|
|
}
|
2016-08-16 15:11:46 +00:00
|
|
|
}
|
|
|
|
if (blk->tfb_flags & TCP_FUNC_BEING_REMOVED) {
|
|
|
|
refcount_release(&blk->tfb_refcnt);
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
return (ENOENT);
|
|
|
|
}
|
2020-02-12 13:31:36 +00:00
|
|
|
/*
|
2016-08-16 15:11:46 +00:00
|
|
|
* Release the old refcnt, the
|
|
|
|
* lookup acquired a ref on the
|
|
|
|
* new one already.
|
|
|
|
*/
|
|
|
|
if (tp->t_fb->tfb_tcp_fb_fini) {
|
2021-05-25 13:45:37 -04:00
|
|
|
struct epoch_tracker et;
|
2020-02-12 13:31:36 +00:00
|
|
|
/*
|
2016-08-16 15:11:46 +00:00
|
|
|
* Tell the stack to cleanup with 0 i.e.
|
|
|
|
* the tcb is not going away.
|
2015-12-16 00:56:45 +00:00
|
|
|
*/
|
2021-05-25 13:45:37 -04:00
|
|
|
NET_EPOCH_ENTER(et);
|
2016-08-16 15:11:46 +00:00
|
|
|
(*tp->t_fb->tfb_tcp_fb_fini)(tp, 0);
|
2021-05-25 13:45:37 -04:00
|
|
|
NET_EPOCH_EXIT(et);
|
2016-08-16 15:11:46 +00:00
|
|
|
}
|
2020-02-12 13:31:36 +00:00
|
|
|
#ifdef TCPHPTS
|
2018-04-19 13:37:59 +00:00
|
|
|
/* Assure that we are not on any hpts */
|
|
|
|
tcp_hpts_remove(tp->t_inpcb, HPTS_REMOVE_ALL);
|
|
|
|
#endif
|
|
|
|
if (blk->tfb_tcp_fb_init) {
|
|
|
|
error = (*blk->tfb_tcp_fb_init)(tp);
|
|
|
|
if (error) {
|
|
|
|
refcount_release(&blk->tfb_refcnt);
|
|
|
|
if (tp->t_fb->tfb_tcp_fb_init) {
|
|
|
|
if((*tp->t_fb->tfb_tcp_fb_init)(tp) != 0) {
|
|
|
|
/* Fall back failed, drop the connection */
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
soabort(so);
|
|
|
|
return(error);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
goto err_out;
|
|
|
|
}
|
|
|
|
}
|
2016-08-16 15:11:46 +00:00
|
|
|
refcount_release(&tp->t_fb->tfb_refcnt);
|
|
|
|
tp->t_fb = blk;
|
2015-12-16 00:56:45 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
if (tp->t_flags & TF_TOE) {
|
|
|
|
tcp_offload_ctloutput(tp, sopt->sopt_dir,
|
|
|
|
sopt->sopt_name);
|
|
|
|
}
|
|
|
|
#endif
|
2018-04-19 13:37:59 +00:00
|
|
|
err_out:
|
2015-12-16 00:56:45 +00:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
return (error);
|
2020-02-12 13:31:36 +00:00
|
|
|
} else if ((sopt->sopt_dir == SOPT_GET) &&
|
2015-12-16 00:56:45 +00:00
|
|
|
(sopt->sopt_name == TCP_FUNCTION_BLK)) {
|
2018-04-04 21:12:35 +00:00
|
|
|
strncpy(fsn.function_set_name, tp->t_fb->tfb_tcp_block_name,
|
|
|
|
TCP_FUNCTION_NAME_LEN_MAX);
|
|
|
|
fsn.function_set_name[TCP_FUNCTION_NAME_LEN_MAX - 1] = '\0';
|
2015-12-16 00:56:45 +00:00
|
|
|
fsn.pcbcnt = tp->t_fb->tfb_refcnt;
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyout(sopt, &fsn, sizeof fsn);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
/* Pass in the INP locked, called must unlock it */
|
|
|
|
return (tp->t_fb->tfb_tcp_ctloutput(so, sopt, inp, tp));
|
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2018-03-22 09:40:08 +00:00
|
|
|
/*
|
|
|
|
* If this assert becomes untrue, we need to change the size of the buf
|
|
|
|
* variable in tcp_default_ctloutput().
|
|
|
|
*/
|
|
|
|
#ifdef CTASSERT
|
|
|
|
CTASSERT(TCP_CA_NAME_MAX <= TCP_LOG_ID_LEN);
|
|
|
|
CTASSERT(TCP_LOG_REASON_LEN <= TCP_LOG_ID_LEN);
|
|
|
|
#endif
|
|
|
|
|
2020-04-27 22:31:42 +00:00
|
|
|
#ifdef KERN_TLS
|
|
|
|
static int
|
|
|
|
copyin_tls_enable(struct sockopt *sopt, struct tls_enable *tls)
|
|
|
|
{
|
|
|
|
struct tls_enable_v0 tls_v0;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
if (sopt->sopt_valsize == sizeof(tls_v0)) {
|
|
|
|
error = sooptcopyin(sopt, &tls_v0, sizeof(tls_v0),
|
|
|
|
sizeof(tls_v0));
|
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
memset(tls, 0, sizeof(*tls));
|
|
|
|
tls->cipher_key = tls_v0.cipher_key;
|
|
|
|
tls->iv = tls_v0.iv;
|
|
|
|
tls->auth_key = tls_v0.auth_key;
|
|
|
|
tls->cipher_algorithm = tls_v0.cipher_algorithm;
|
|
|
|
tls->cipher_key_len = tls_v0.cipher_key_len;
|
|
|
|
tls->iv_len = tls_v0.iv_len;
|
|
|
|
tls->auth_algorithm = tls_v0.auth_algorithm;
|
|
|
|
tls->auth_key_len = tls_v0.auth_key_len;
|
|
|
|
tls->flags = tls_v0.flags;
|
|
|
|
tls->tls_vmajor = tls_v0.tls_vmajor;
|
|
|
|
tls->tls_vminor = tls_v0.tls_vminor;
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (sooptcopyin(sopt, tls, sizeof(*tls), sizeof(*tls)));
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2015-12-16 00:56:45 +00:00
|
|
|
int
|
|
|
|
tcp_default_ctloutput(struct socket *so, struct sockopt *sopt, struct inpcb *inp, struct tcpcb *tp)
|
|
|
|
{
|
|
|
|
int error, opt, optval;
|
|
|
|
u_int ui;
|
|
|
|
struct tcp_info ti;
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#ifdef KERN_TLS
|
|
|
|
struct tls_enable tls;
|
|
|
|
#endif
|
2015-12-16 00:56:45 +00:00
|
|
|
struct cc_algo *algo;
|
2018-03-22 09:40:08 +00:00
|
|
|
char *pbuf, buf[TCP_LOG_ID_LEN];
|
2019-12-02 20:58:04 +00:00
|
|
|
#ifdef STATS
|
|
|
|
struct statsblob *sbp;
|
|
|
|
#endif
|
2016-01-27 07:34:00 +00:00
|
|
|
size_t len;
|
2016-01-22 02:07:48 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* For TCP_CCALGOOPT forward the control to CC module, for both
|
|
|
|
* SOPT_SET and SOPT_GET.
|
|
|
|
*/
|
|
|
|
switch (sopt->sopt_name) {
|
|
|
|
case TCP_CCALGOOPT:
|
|
|
|
INP_WUNLOCK(inp);
|
2018-11-30 10:50:07 +00:00
|
|
|
if (sopt->sopt_valsize > CC_ALGOOPT_LIMIT)
|
|
|
|
return (EINVAL);
|
2016-01-27 07:34:00 +00:00
|
|
|
pbuf = malloc(sopt->sopt_valsize, M_TEMP, M_WAITOK | M_ZERO);
|
|
|
|
error = sooptcopyin(sopt, pbuf, sopt->sopt_valsize,
|
2016-01-22 02:07:48 +00:00
|
|
|
sopt->sopt_valsize);
|
|
|
|
if (error) {
|
2016-01-27 07:34:00 +00:00
|
|
|
free(pbuf, M_TEMP);
|
2016-01-22 02:07:48 +00:00
|
|
|
return (error);
|
|
|
|
}
|
2016-04-26 23:02:18 +00:00
|
|
|
INP_WLOCK_RECHECK_CLEANUP(inp, free(pbuf, M_TEMP));
|
2016-01-22 02:07:48 +00:00
|
|
|
if (CC_ALGO(tp)->ctl_output != NULL)
|
2016-01-27 07:34:00 +00:00
|
|
|
error = CC_ALGO(tp)->ctl_output(tp->ccv, sopt, pbuf);
|
2016-01-22 02:07:48 +00:00
|
|
|
else
|
|
|
|
error = ENOENT;
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
if (error == 0 && sopt->sopt_dir == SOPT_GET)
|
2016-01-27 07:34:00 +00:00
|
|
|
error = sooptcopyout(sopt, pbuf, sopt->sopt_valsize);
|
|
|
|
free(pbuf, M_TEMP);
|
2016-01-22 02:07:48 +00:00
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
1998-08-23 03:07:17 +00:00
|
|
|
switch (sopt->sopt_dir) {
|
|
|
|
case SOPT_SET:
|
|
|
|
switch (sopt->sopt_name) {
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
|
2004-02-16 22:21:16 +00:00
|
|
|
case TCP_MD5SIG:
|
2017-02-06 08:49:57 +00:00
|
|
|
if (!TCPMD5_ENABLED()) {
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
return (ENOPROTOOPT);
|
|
|
|
}
|
|
|
|
error = TCPMD5_PCBCTL(inp, sopt);
|
Initial import of RFC 2385 (TCP-MD5) digest support.
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.
For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.
Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.
There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.
Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.
This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.
Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.
Sponsored by: sentex.net
2004-02-11 04:26:04 +00:00
|
|
|
if (error)
|
2008-01-18 12:19:50 +00:00
|
|
|
return (error);
|
2012-06-19 07:34:13 +00:00
|
|
|
goto unlock_and_done;
|
2017-02-06 08:49:57 +00:00
|
|
|
#endif /* IPSEC */
|
2012-06-19 07:34:13 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
case TCP_NODELAY:
|
1998-08-23 03:07:17 +00:00
|
|
|
case TCP_NOOPT:
|
2021-05-10 18:47:47 +02:00
|
|
|
case TCP_LRD:
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
1998-08-23 03:07:17 +00:00
|
|
|
error = sooptcopyin(sopt, &optval, sizeof optval,
|
2008-01-18 12:19:50 +00:00
|
|
|
sizeof optval);
|
1998-08-23 03:07:17 +00:00
|
|
|
if (error)
|
2008-01-18 12:19:50 +00:00
|
|
|
return (error);
|
1998-08-23 03:07:17 +00:00
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_RECHECK(inp);
|
1998-08-23 03:07:17 +00:00
|
|
|
switch (sopt->sopt_name) {
|
|
|
|
case TCP_NODELAY:
|
|
|
|
opt = TF_NODELAY;
|
|
|
|
break;
|
|
|
|
case TCP_NOOPT:
|
|
|
|
opt = TF_NOOPT;
|
|
|
|
break;
|
2021-05-10 18:47:47 +02:00
|
|
|
case TCP_LRD:
|
|
|
|
opt = TF_LRD;
|
|
|
|
break;
|
1998-08-23 03:07:17 +00:00
|
|
|
default:
|
|
|
|
opt = 0; /* dead code to fool gcc */
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (optval)
|
|
|
|
tp->t_flags |= opt;
|
1994-05-24 10:09:53 +00:00
|
|
|
else
|
1998-08-23 03:07:17 +00:00
|
|
|
tp->t_flags &= ~opt;
|
2012-06-19 07:34:13 +00:00
|
|
|
unlock_and_done:
|
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
if (tp->t_flags & TF_TOE) {
|
|
|
|
tcp_offload_ctloutput(tp, sopt->sopt_dir,
|
|
|
|
sopt->sopt_name);
|
|
|
|
}
|
|
|
|
#endif
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
|
2001-02-02 18:48:25 +00:00
|
|
|
case TCP_NOPUSH:
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2001-02-02 18:48:25 +00:00
|
|
|
error = sooptcopyin(sopt, &optval, sizeof optval,
|
2008-01-18 12:19:50 +00:00
|
|
|
sizeof optval);
|
2001-02-02 18:48:25 +00:00
|
|
|
if (error)
|
2008-01-18 12:19:50 +00:00
|
|
|
return (error);
|
2001-02-02 18:48:25 +00:00
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_RECHECK(inp);
|
2001-02-02 18:48:25 +00:00
|
|
|
if (optval)
|
|
|
|
tp->t_flags |= TF_NOPUSH;
|
2011-02-04 14:13:15 +00:00
|
|
|
else if (tp->t_flags & TF_NOPUSH) {
|
2001-02-02 18:48:25 +00:00
|
|
|
tp->t_flags &= ~TF_NOPUSH;
|
2020-01-22 05:53:16 +00:00
|
|
|
if (TCPS_HAVEESTABLISHED(tp->t_state)) {
|
|
|
|
struct epoch_tracker et;
|
|
|
|
|
|
|
|
NET_EPOCH_ENTER(et);
|
2015-12-16 00:56:45 +00:00
|
|
|
error = tp->t_fb->tfb_tcp_output(tp);
|
2020-01-22 05:53:16 +00:00
|
|
|
NET_EPOCH_EXIT(et);
|
|
|
|
}
|
2001-02-02 18:48:25 +00:00
|
|
|
}
|
2012-06-19 07:34:13 +00:00
|
|
|
goto unlock_and_done;
|
2001-02-02 18:48:25 +00:00
|
|
|
|
2021-04-18 16:08:08 +02:00
|
|
|
case TCP_REMOTE_UDP_ENCAPS_PORT:
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyin(sopt, &optval, sizeof optval,
|
|
|
|
sizeof optval);
|
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
if ((optval < TCP_TUNNELING_PORT_MIN) ||
|
|
|
|
(optval > TCP_TUNNELING_PORT_MAX)) {
|
|
|
|
/* Its got to be in range */
|
|
|
|
return (EINVAL);
|
|
|
|
}
|
|
|
|
if ((V_tcp_udp_tunneling_port == 0) && (optval != 0)) {
|
|
|
|
/* You have to have enabled a UDP tunneling port first */
|
|
|
|
return (EINVAL);
|
|
|
|
}
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
if (tp->t_state != TCPS_CLOSED) {
|
|
|
|
/* You can't change after you are connected */
|
|
|
|
error = EINVAL;
|
|
|
|
} else {
|
|
|
|
/* Ok we are all good set the port */
|
|
|
|
tp->t_port = htons(optval);
|
|
|
|
}
|
|
|
|
goto unlock_and_done;
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
case TCP_MAXSEG:
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
1998-08-23 03:07:17 +00:00
|
|
|
error = sooptcopyin(sopt, &optval, sizeof optval,
|
2008-01-18 12:19:50 +00:00
|
|
|
sizeof optval);
|
1998-08-23 03:07:17 +00:00
|
|
|
if (error)
|
2008-01-18 12:19:50 +00:00
|
|
|
return (error);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_RECHECK(inp);
|
2004-01-08 17:40:07 +00:00
|
|
|
if (optval > 0 && optval <= tp->t_maxseg &&
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
optval + 40 >= V_tcp_minmss)
|
1998-08-23 03:07:17 +00:00
|
|
|
tp->t_maxseg = optval;
|
1995-02-09 23:13:27 +00:00
|
|
|
else
|
|
|
|
error = EINVAL;
|
2012-06-19 07:34:13 +00:00
|
|
|
goto unlock_and_done;
|
1995-02-09 23:13:27 +00:00
|
|
|
|
2004-11-26 18:58:46 +00:00
|
|
|
case TCP_INFO:
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2004-11-26 18:58:46 +00:00
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
|
2019-12-02 20:58:04 +00:00
|
|
|
case TCP_STATS:
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
#ifdef STATS
|
|
|
|
error = sooptcopyin(sopt, &optval, sizeof optval,
|
|
|
|
sizeof optval);
|
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
if (optval > 0)
|
|
|
|
sbp = stats_blob_alloc(
|
|
|
|
V_tcp_perconn_stats_dflt_tpl, 0);
|
|
|
|
else
|
|
|
|
sbp = NULL;
|
|
|
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
if ((tp->t_stats != NULL && sbp == NULL) ||
|
|
|
|
(tp->t_stats == NULL && sbp != NULL)) {
|
|
|
|
struct statsblob *t = tp->t_stats;
|
|
|
|
tp->t_stats = sbp;
|
|
|
|
sbp = t;
|
|
|
|
}
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
|
|
|
|
stats_blob_destroy(sbp);
|
|
|
|
#else
|
|
|
|
return (EOPNOTSUPP);
|
|
|
|
#endif /* !STATS */
|
|
|
|
break;
|
|
|
|
|
2010-11-12 06:41:55 +00:00
|
|
|
case TCP_CONGESTION:
|
|
|
|
INP_WUNLOCK(inp);
|
2016-01-27 07:34:00 +00:00
|
|
|
error = sooptcopyin(sopt, buf, TCP_CA_NAME_MAX - 1, 1);
|
|
|
|
if (error)
|
2010-11-12 06:41:55 +00:00
|
|
|
break;
|
2016-01-27 07:34:00 +00:00
|
|
|
buf[sopt->sopt_valsize] = '\0';
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
2016-01-21 22:53:12 +00:00
|
|
|
CC_LIST_RLOCK();
|
|
|
|
STAILQ_FOREACH(algo, &cc_list, entries)
|
|
|
|
if (strncmp(buf, algo->name,
|
|
|
|
TCP_CA_NAME_MAX) == 0)
|
|
|
|
break;
|
|
|
|
CC_LIST_RUNLOCK();
|
|
|
|
if (algo == NULL) {
|
2016-01-27 07:34:00 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2016-01-21 22:53:12 +00:00
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
2010-11-12 06:41:55 +00:00
|
|
|
/*
|
2016-01-21 22:53:12 +00:00
|
|
|
* We hold a write lock over the tcb so it's safe to
|
|
|
|
* do these things without ordering concerns.
|
2010-11-12 06:41:55 +00:00
|
|
|
*/
|
2016-01-21 22:53:12 +00:00
|
|
|
if (CC_ALGO(tp)->cb_destroy != NULL)
|
|
|
|
CC_ALGO(tp)->cb_destroy(tp->ccv);
|
2018-07-22 05:37:58 +00:00
|
|
|
CC_DATA(tp) = NULL;
|
2016-01-21 22:53:12 +00:00
|
|
|
CC_ALGO(tp) = algo;
|
|
|
|
/*
|
|
|
|
* If something goes pear shaped initialising the new
|
|
|
|
* algo, fall back to newreno (which does not
|
|
|
|
* require initialisation).
|
|
|
|
*/
|
|
|
|
if (algo->cb_init != NULL &&
|
|
|
|
algo->cb_init(tp->ccv) != 0) {
|
|
|
|
CC_ALGO(tp) = &newreno_cc_algo;
|
|
|
|
/*
|
|
|
|
* The only reason init should fail is
|
|
|
|
* because of malloc.
|
|
|
|
*/
|
|
|
|
error = ENOMEM;
|
2010-11-12 06:41:55 +00:00
|
|
|
}
|
2016-01-21 22:53:12 +00:00
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
break;
|
2010-11-12 06:41:55 +00:00
|
|
|
|
Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.
This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.
This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.
Reviewed by: jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by: Netfix
Differential Revision: https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
|
|
|
case TCP_REUSPORT_LB_NUMA:
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyin(sopt, &optval, sizeof(optval),
|
|
|
|
sizeof(optval));
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
if (!error)
|
|
|
|
error = in_pcblbgroup_numa(inp, optval);
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
break;
|
|
|
|
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#ifdef KERN_TLS
|
|
|
|
case TCP_TXTLS_ENABLE:
|
|
|
|
INP_WUNLOCK(inp);
|
2020-04-27 22:31:42 +00:00
|
|
|
error = copyin_tls_enable(sopt, &tls);
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
error = ktls_enable_tx(so, &tls);
|
|
|
|
break;
|
|
|
|
case TCP_TXTLS_MODE:
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyin(sopt, &ui, sizeof(ui), sizeof(ui));
|
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
error = ktls_set_tx_mode(so, ui);
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
break;
|
2020-04-27 23:17:19 +00:00
|
|
|
case TCP_RXTLS_ENABLE:
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyin(sopt, &tls, sizeof(tls),
|
|
|
|
sizeof(tls));
|
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
error = ktls_enable_rx(so, &tls);
|
|
|
|
break;
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#endif
|
|
|
|
|
2012-02-05 16:53:02 +00:00
|
|
|
case TCP_KEEPIDLE:
|
|
|
|
case TCP_KEEPINTVL:
|
|
|
|
case TCP_KEEPINIT:
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyin(sopt, &ui, sizeof(ui), sizeof(ui));
|
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
if (ui > (UINT_MAX / hz)) {
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
ui *= hz;
|
|
|
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
switch (sopt->sopt_name) {
|
|
|
|
case TCP_KEEPIDLE:
|
|
|
|
tp->t_keepidle = ui;
|
|
|
|
/*
|
|
|
|
* XXX: better check current remaining
|
|
|
|
* timeout and "merge" it with new value.
|
|
|
|
*/
|
|
|
|
if ((tp->t_state > TCPS_LISTEN) &&
|
|
|
|
(tp->t_state <= TCPS_CLOSING))
|
|
|
|
tcp_timer_activate(tp, TT_KEEP,
|
|
|
|
TP_KEEPIDLE(tp));
|
|
|
|
break;
|
|
|
|
case TCP_KEEPINTVL:
|
|
|
|
tp->t_keepintvl = ui;
|
|
|
|
if ((tp->t_state == TCPS_FIN_WAIT_2) &&
|
|
|
|
(TP_MAXIDLE(tp) > 0))
|
|
|
|
tcp_timer_activate(tp, TT_2MSL,
|
|
|
|
TP_MAXIDLE(tp));
|
|
|
|
break;
|
|
|
|
case TCP_KEEPINIT:
|
|
|
|
tp->t_keepinit = ui;
|
|
|
|
if (tp->t_state == TCPS_SYN_RECEIVED ||
|
|
|
|
tp->t_state == TCPS_SYN_SENT)
|
|
|
|
tcp_timer_activate(tp, TT_KEEP,
|
|
|
|
TP_KEEPINIT(tp));
|
|
|
|
break;
|
|
|
|
}
|
2012-06-19 07:34:13 +00:00
|
|
|
goto unlock_and_done;
|
2012-02-05 16:53:02 +00:00
|
|
|
|
2012-09-27 07:13:21 +00:00
|
|
|
case TCP_KEEPCNT:
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyin(sopt, &ui, sizeof(ui), sizeof(ui));
|
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
tp->t_keepcnt = ui;
|
|
|
|
if ((tp->t_state == TCPS_FIN_WAIT_2) &&
|
|
|
|
(TP_MAXIDLE(tp) > 0))
|
|
|
|
tcp_timer_activate(tp, TT_2MSL,
|
|
|
|
TP_MAXIDLE(tp));
|
|
|
|
goto unlock_and_done;
|
|
|
|
|
There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.
It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.
To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).
There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.
I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.
The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.
Differential Revision: D3100
Submitted by: Jonathan Looney <jlooney at juniper dot net>
Reviewed by: gnn, hiren
2015-10-14 00:35:37 +00:00
|
|
|
#ifdef TCPPCAP
|
|
|
|
case TCP_PCAP_OUT:
|
|
|
|
case TCP_PCAP_IN:
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyin(sopt, &optval, sizeof optval,
|
|
|
|
sizeof optval);
|
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
if (optval >= 0)
|
|
|
|
tcp_pcap_set_sock_max(TCP_PCAP_OUT ?
|
|
|
|
&(tp->t_outpkts) : &(tp->t_inpkts),
|
|
|
|
optval);
|
|
|
|
else
|
|
|
|
error = EINVAL;
|
|
|
|
goto unlock_and_done;
|
|
|
|
#endif
|
|
|
|
|
2018-02-26 02:53:22 +00:00
|
|
|
case TCP_FASTOPEN: {
|
|
|
|
struct tcp_fastopen tfo_optval;
|
|
|
|
|
2015-12-24 19:09:48 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2018-02-26 02:53:22 +00:00
|
|
|
if (!V_tcp_fastopen_client_enable &&
|
|
|
|
!V_tcp_fastopen_server_enable)
|
2015-12-24 19:09:48 +00:00
|
|
|
return (EPERM);
|
|
|
|
|
2018-02-26 02:53:22 +00:00
|
|
|
error = sooptcopyin(sopt, &tfo_optval,
|
|
|
|
sizeof(tfo_optval), sizeof(int));
|
2015-12-24 19:09:48 +00:00
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
2020-06-03 13:51:53 +00:00
|
|
|
if ((tp->t_state != TCPS_CLOSED) &&
|
|
|
|
(tp->t_state != TCPS_LISTEN)) {
|
|
|
|
error = EINVAL;
|
|
|
|
goto unlock_and_done;
|
|
|
|
}
|
2018-02-26 02:53:22 +00:00
|
|
|
if (tfo_optval.enable) {
|
|
|
|
if (tp->t_state == TCPS_LISTEN) {
|
|
|
|
if (!V_tcp_fastopen_server_enable) {
|
|
|
|
error = EPERM;
|
|
|
|
goto unlock_and_done;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (tp->t_tfo_pending == NULL)
|
|
|
|
tp->t_tfo_pending =
|
|
|
|
tcp_fastopen_alloc_counter();
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* If a pre-shared key was provided,
|
|
|
|
* stash it in the client cookie
|
|
|
|
* field of the tcpcb for use during
|
|
|
|
* connect.
|
|
|
|
*/
|
|
|
|
if (sopt->sopt_valsize ==
|
|
|
|
sizeof(tfo_optval)) {
|
|
|
|
memcpy(tp->t_tfo_cookie.client,
|
|
|
|
tfo_optval.psk,
|
|
|
|
TCP_FASTOPEN_PSK_LEN);
|
|
|
|
tp->t_tfo_client_cookie_len =
|
|
|
|
TCP_FASTOPEN_PSK_LEN;
|
|
|
|
}
|
|
|
|
}
|
2020-06-03 13:51:53 +00:00
|
|
|
tp->t_flags |= TF_FASTOPEN;
|
2015-12-24 19:09:48 +00:00
|
|
|
} else
|
|
|
|
tp->t_flags &= ~TF_FASTOPEN;
|
|
|
|
goto unlock_and_done;
|
2018-02-26 02:53:22 +00:00
|
|
|
}
|
2015-12-24 19:09:48 +00:00
|
|
|
|
2018-03-24 12:48:10 +00:00
|
|
|
#ifdef TCP_BLACKBOX
|
2018-03-22 09:40:08 +00:00
|
|
|
case TCP_LOG:
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyin(sopt, &optval, sizeof optval,
|
|
|
|
sizeof optval);
|
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
error = tcp_log_state_change(tp, optval);
|
|
|
|
goto unlock_and_done;
|
|
|
|
|
|
|
|
case TCP_LOGBUF:
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
|
|
|
|
|
|
|
case TCP_LOGID:
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyin(sopt, buf, TCP_LOG_ID_LEN - 1, 0);
|
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
buf[sopt->sopt_valsize] = '\0';
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
error = tcp_log_set_id(tp, buf);
|
|
|
|
/* tcp_log_set_id() unlocks the INP. */
|
|
|
|
break;
|
|
|
|
|
|
|
|
case TCP_LOGDUMP:
|
|
|
|
case TCP_LOGDUMPID:
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error =
|
|
|
|
sooptcopyin(sopt, buf, TCP_LOG_REASON_LEN - 1, 0);
|
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
buf[sopt->sopt_valsize] = '\0';
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
if (sopt->sopt_name == TCP_LOGDUMP) {
|
|
|
|
error = tcp_log_dump_tp_logbuf(tp, buf,
|
|
|
|
M_WAITOK, true);
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
} else {
|
|
|
|
tcp_log_dump_tp_bucket_logbufs(tp, buf);
|
|
|
|
/*
|
|
|
|
* tcp_log_dump_tp_bucket_logbufs() drops the
|
|
|
|
* INP lock.
|
|
|
|
*/
|
|
|
|
}
|
|
|
|
break;
|
2018-03-24 12:48:10 +00:00
|
|
|
#endif
|
2018-03-22 09:40:08 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
default:
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
1994-05-24 10:09:53 +00:00
|
|
|
error = ENOPROTOOPT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
|
1998-08-23 03:07:17 +00:00
|
|
|
case SOPT_GET:
|
2008-01-18 12:19:50 +00:00
|
|
|
tp = intotcpcb(inp);
|
1998-08-23 03:07:17 +00:00
|
|
|
switch (sopt->sopt_name) {
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
|
2004-02-16 22:21:16 +00:00
|
|
|
case TCP_MD5SIG:
|
2017-02-06 08:49:57 +00:00
|
|
|
if (!TCPMD5_ENABLED()) {
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
return (ENOPROTOOPT);
|
|
|
|
}
|
|
|
|
error = TCPMD5_PCBCTL(inp, sopt);
|
Initial import of RFC 2385 (TCP-MD5) digest support.
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.
For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.
Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.
There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.
Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.
This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.
Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.
Sponsored by: sentex.net
2004-02-11 04:26:04 +00:00
|
|
|
break;
|
2004-02-13 18:21:45 +00:00
|
|
|
#endif
|
2008-01-18 12:19:50 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
case TCP_NODELAY:
|
1998-08-23 03:07:17 +00:00
|
|
|
optval = tp->t_flags & TF_NODELAY;
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2004-11-26 18:58:46 +00:00
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
case TCP_MAXSEG:
|
1998-08-23 03:07:17 +00:00
|
|
|
optval = tp->t_maxseg;
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2004-11-26 18:58:46 +00:00
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
2021-04-18 16:08:08 +02:00
|
|
|
case TCP_REMOTE_UDP_ENCAPS_PORT:
|
|
|
|
optval = ntohs(tp->t_port);
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
|
|
|
break;
|
1995-02-09 23:13:27 +00:00
|
|
|
case TCP_NOOPT:
|
1998-08-23 03:07:17 +00:00
|
|
|
optval = tp->t_flags & TF_NOOPT;
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2004-11-26 18:58:46 +00:00
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
1995-02-09 23:13:27 +00:00
|
|
|
break;
|
|
|
|
case TCP_NOPUSH:
|
1998-08-23 03:07:17 +00:00
|
|
|
optval = tp->t_flags & TF_NOPUSH;
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2004-11-26 18:58:46 +00:00
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
|
|
|
break;
|
|
|
|
case TCP_INFO:
|
|
|
|
tcp_fill_info(tp, &ti);
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2004-11-26 18:58:46 +00:00
|
|
|
error = sooptcopyout(sopt, &ti, sizeof ti);
|
1995-02-09 23:13:27 +00:00
|
|
|
break;
|
2019-12-02 20:58:04 +00:00
|
|
|
case TCP_STATS:
|
|
|
|
{
|
|
|
|
#ifdef STATS
|
|
|
|
int nheld;
|
|
|
|
TYPEOF_MEMBER(struct statsblob, flags) sbflags = 0;
|
|
|
|
|
|
|
|
error = 0;
|
|
|
|
socklen_t outsbsz = sopt->sopt_valsize;
|
|
|
|
if (tp->t_stats == NULL)
|
|
|
|
error = ENOENT;
|
|
|
|
else if (outsbsz >= tp->t_stats->cursz)
|
|
|
|
outsbsz = tp->t_stats->cursz;
|
|
|
|
else if (outsbsz >= sizeof(struct statsblob))
|
|
|
|
outsbsz = sizeof(struct statsblob);
|
|
|
|
else
|
|
|
|
error = EINVAL;
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
if (error)
|
|
|
|
break;
|
|
|
|
|
|
|
|
sbp = sopt->sopt_val;
|
|
|
|
nheld = atop(round_page(((vm_offset_t)sbp) +
|
|
|
|
(vm_size_t)outsbsz) - trunc_page((vm_offset_t)sbp));
|
|
|
|
vm_page_t ma[nheld];
|
|
|
|
if (vm_fault_quick_hold_pages(
|
|
|
|
&curproc->p_vmspace->vm_map, (vm_offset_t)sbp,
|
|
|
|
outsbsz, VM_PROT_READ | VM_PROT_WRITE, ma,
|
|
|
|
nheld) < 0) {
|
|
|
|
error = EFAULT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if ((error = copyin_nofault(&(sbp->flags), &sbflags,
|
|
|
|
SIZEOF_MEMBER(struct statsblob, flags))))
|
|
|
|
goto unhold;
|
|
|
|
|
|
|
|
INP_WLOCK_RECHECK(inp);
|
|
|
|
error = stats_blob_snapshot(&sbp, outsbsz, tp->t_stats,
|
|
|
|
sbflags | SB_CLONE_USRDSTNOFAULT);
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
sopt->sopt_valsize = outsbsz;
|
|
|
|
unhold:
|
|
|
|
vm_page_unhold_pages(ma, nheld);
|
|
|
|
#else
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = EOPNOTSUPP;
|
|
|
|
#endif /* !STATS */
|
|
|
|
break;
|
|
|
|
}
|
2010-11-12 06:41:55 +00:00
|
|
|
case TCP_CONGESTION:
|
2016-01-27 07:34:00 +00:00
|
|
|
len = strlcpy(buf, CC_ALGO(tp)->name, TCP_CA_NAME_MAX);
|
2010-11-12 06:41:55 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2016-01-27 07:34:00 +00:00
|
|
|
error = sooptcopyout(sopt, buf, len + 1);
|
2010-11-12 06:41:55 +00:00
|
|
|
break;
|
2013-11-08 13:04:14 +00:00
|
|
|
case TCP_KEEPIDLE:
|
|
|
|
case TCP_KEEPINTVL:
|
|
|
|
case TCP_KEEPINIT:
|
|
|
|
case TCP_KEEPCNT:
|
|
|
|
switch (sopt->sopt_name) {
|
|
|
|
case TCP_KEEPIDLE:
|
2016-09-14 14:48:00 +00:00
|
|
|
ui = TP_KEEPIDLE(tp) / hz;
|
2013-11-08 13:04:14 +00:00
|
|
|
break;
|
|
|
|
case TCP_KEEPINTVL:
|
2016-09-14 14:48:00 +00:00
|
|
|
ui = TP_KEEPINTVL(tp) / hz;
|
2013-11-08 13:04:14 +00:00
|
|
|
break;
|
|
|
|
case TCP_KEEPINIT:
|
2016-09-14 14:48:00 +00:00
|
|
|
ui = TP_KEEPINIT(tp) / hz;
|
2013-11-08 13:04:14 +00:00
|
|
|
break;
|
|
|
|
case TCP_KEEPCNT:
|
2016-09-14 14:48:00 +00:00
|
|
|
ui = TP_KEEPCNT(tp);
|
2013-11-08 13:04:14 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyout(sopt, &ui, sizeof(ui));
|
|
|
|
break;
|
There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.
It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.
To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).
There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.
I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.
The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.
Differential Revision: D3100
Submitted by: Jonathan Looney <jlooney at juniper dot net>
Reviewed by: gnn, hiren
2015-10-14 00:35:37 +00:00
|
|
|
#ifdef TCPPCAP
|
|
|
|
case TCP_PCAP_OUT:
|
|
|
|
case TCP_PCAP_IN:
|
|
|
|
optval = tcp_pcap_get_sock_max(TCP_PCAP_OUT ?
|
|
|
|
&(tp->t_outpkts) : &(tp->t_inpkts));
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
|
|
|
break;
|
|
|
|
#endif
|
2015-12-24 19:09:48 +00:00
|
|
|
case TCP_FASTOPEN:
|
|
|
|
optval = tp->t_flags & TF_FASTOPEN;
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
|
|
|
break;
|
2018-03-24 12:48:10 +00:00
|
|
|
#ifdef TCP_BLACKBOX
|
2018-03-22 09:40:08 +00:00
|
|
|
case TCP_LOG:
|
|
|
|
optval = tp->t_logstate;
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyout(sopt, &optval, sizeof(optval));
|
|
|
|
break;
|
|
|
|
case TCP_LOGBUF:
|
|
|
|
/* tcp_log_getlogbuf() does INP_WUNLOCK(inp) */
|
|
|
|
error = tcp_log_getlogbuf(sopt, tp);
|
|
|
|
break;
|
|
|
|
case TCP_LOGID:
|
|
|
|
len = tcp_log_get_id(tp, buf);
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyout(sopt, buf, len + 1);
|
|
|
|
break;
|
|
|
|
case TCP_LOGDUMP:
|
|
|
|
case TCP_LOGDUMPID:
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = EINVAL;
|
|
|
|
break;
|
Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports
offload of TLS for transmitted data. Key negotation must still be
performed in userland. Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option. All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.
Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type. Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.
At present, rekeying is not supported though the in-kernel framework
should support rekeying.
KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer. Each TLS frame is described by a single
ext_pgs mbuf. The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.
KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.
Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then
called to schedule TLS frames for encryption. In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed. For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().
A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue(). Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.
(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)
KTLS supports pluggable software encryption backends. Internally,
Netflix uses proprietary pure-software backends. This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames. As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.
Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready(). At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.
ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session. TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted. The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface. If so, the packet is tagged
with the TLS send tag and sent to the interface. The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation. If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped. In addition, a task is scheduled to refresh the TLS send
tag for the TLS session. If a new TLS send tag cannot be allocated,
the connection is dropped. If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag. (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another. As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)
ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.
Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option. They can also use this socket
option to toggle between software and ifnet TLS modes.
In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax. However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.
Various sysctls and counters are available under the kern.ipc.tls
sysctl node. The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default). The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.
KTLS is enabled via the KERN_TLS kernel option.
This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.
Reviewed by: gallatin, hselasky, rrs
Obtained from: Netflix
Sponsored by: Netflix, Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
|
|
|
#endif
|
|
|
|
#ifdef KERN_TLS
|
|
|
|
case TCP_TXTLS_MODE:
|
|
|
|
optval = ktls_get_tx_mode(so);
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyout(sopt, &optval, sizeof(optval));
|
|
|
|
break;
|
2020-04-27 23:17:19 +00:00
|
|
|
case TCP_RXTLS_MODE:
|
|
|
|
optval = ktls_get_rx_mode(so);
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyout(sopt, &optval, sizeof(optval));
|
|
|
|
break;
|
2018-03-24 12:48:10 +00:00
|
|
|
#endif
|
2021-05-10 18:47:47 +02:00
|
|
|
case TCP_LRD:
|
|
|
|
optval = tp->t_flags & TF_LRD;
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
error = sooptcopyout(sopt, &optval, sizeof optval);
|
|
|
|
break;
|
1994-05-24 10:09:53 +00:00
|
|
|
default:
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
1994-05-24 10:09:53 +00:00
|
|
|
error = ENOPROTOOPT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return (error);
|
|
|
|
}
|
2008-04-17 21:38:18 +00:00
|
|
|
#undef INP_WLOCK_RECHECK
|
2016-04-26 23:02:18 +00:00
|
|
|
#undef INP_WLOCK_RECHECK_CLEANUP
|
1994-05-24 10:09:53 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Initiate (or continue) disconnect.
|
|
|
|
* If embryonic state, just send reset (once).
|
|
|
|
* If in ``let data drain'' option and linger null, just drop.
|
|
|
|
* Otherwise (hard), mark socket disconnecting and drop
|
|
|
|
* current input data; switch states based on user close, and
|
|
|
|
* send segment to peer (with FIN).
|
|
|
|
*/
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
static void
|
2007-03-21 19:37:55 +00:00
|
|
|
tcp_disconnect(struct tcpcb *tp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
2005-06-01 12:08:15 +00:00
|
|
|
struct inpcb *inp = tp->t_inpcb;
|
|
|
|
struct socket *so = inp->inp_socket;
|
|
|
|
|
2019-11-07 00:10:14 +00:00
|
|
|
NET_EPOCH_ASSERT();
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(inp);
|
1994-05-24 10:09:53 +00:00
|
|
|
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
/*
|
|
|
|
* Neither tcp_close() nor tcp_drop() should return NULL, as the
|
|
|
|
* socket is still open.
|
|
|
|
*/
|
Fix some TCP fast open issues.
The following issues are fixed:
* Whenever a TCP server with TCP fast open enabled, calls accept(),
recv(), send(), and close() before the TCP-ACK segment has been received,
the TCP connection is just dropped and the reception of the TCP-ACK
segment triggers the sending of a TCP-RST segment.
* Whenever a TCP server with TCP fast open enabled, calls accept(), recv(),
send(), send(), and close() before the TCP-ACK segment has been received,
the first byte provided in the second send call is not transferred.
* Whenever a TCP client with TCP fast open enabled calls sendto() followed
by close() the TCP connection is just dropped.
Reviewed by: jtl@, kbowling@, rrs@
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16485
2018-07-30 20:35:50 +00:00
|
|
|
if (tp->t_state < TCPS_ESTABLISHED &&
|
|
|
|
!(tp->t_state > TCPS_LISTEN && IS_FASTOPEN(tp->t_flags))) {
|
1994-05-24 10:09:53 +00:00
|
|
|
tp = tcp_close(tp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
KASSERT(tp != NULL,
|
|
|
|
("tcp_disconnect: tcp_close() returned NULL"));
|
|
|
|
} else if ((so->so_options & SO_LINGER) && so->so_linger == 0) {
|
2002-05-31 11:52:35 +00:00
|
|
|
tp = tcp_drop(tp, 0);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
KASSERT(tp != NULL,
|
|
|
|
("tcp_disconnect: tcp_drop() returned NULL"));
|
|
|
|
} else {
|
2002-05-31 11:52:35 +00:00
|
|
|
soisdisconnecting(so);
|
|
|
|
sbflush(&so->so_rcv);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
tcp_usrclosed(tp);
|
2009-03-15 09:58:31 +00:00
|
|
|
if (!(inp->inp_flags & INP_DROPPED))
|
2015-12-16 00:56:45 +00:00
|
|
|
tp->t_fb->tfb_tcp_output(tp);
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* User issued close, and wish to trail through shutdown states:
|
|
|
|
* if never received SYN, just forget it. If got a SYN from peer,
|
|
|
|
* but haven't sent FIN, then go to FIN_WAIT_1 state to send peer a FIN.
|
|
|
|
* If already got a FIN from peer, then almost done; go to LAST_ACK
|
|
|
|
* state. In all other cases, have already sent FIN to peer (e.g.
|
|
|
|
* after PRU_SHUTDOWN), and just have to play tedious game waiting
|
|
|
|
* for peer to send FIN or not respond to keep-alives, etc.
|
|
|
|
* We can let the user exit from the close as soon as the FIN is acked.
|
|
|
|
*/
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
static void
|
2007-03-21 19:37:55 +00:00
|
|
|
tcp_usrclosed(struct tcpcb *tp)
|
1994-05-24 10:09:53 +00:00
|
|
|
{
|
|
|
|
|
2019-11-07 00:10:14 +00:00
|
|
|
NET_EPOCH_ASSERT();
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(tp->t_inpcb);
|
2005-06-01 12:08:15 +00:00
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
switch (tp->t_state) {
|
|
|
|
case TCPS_LISTEN:
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
tcp_offload_listen_stop(tp);
|
|
|
|
#endif
|
2015-09-15 20:04:30 +00:00
|
|
|
tcp_state_change(tp, TCPS_CLOSED);
|
2007-12-18 22:59:07 +00:00
|
|
|
/* FALLTHROUGH */
|
|
|
|
case TCPS_CLOSED:
|
1994-05-24 10:09:53 +00:00
|
|
|
tp = tcp_close(tp);
|
Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
2006-04-01 16:36:36 +00:00
|
|
|
/*
|
|
|
|
* tcp_close() should never return NULL here as the socket is
|
|
|
|
* still open.
|
|
|
|
*/
|
|
|
|
KASSERT(tp != NULL,
|
|
|
|
("tcp_usrclosed: tcp_close() returned NULL"));
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
|
1995-02-09 23:13:27 +00:00
|
|
|
case TCPS_SYN_SENT:
|
|
|
|
case TCPS_SYN_RECEIVED:
|
|
|
|
tp->t_flags |= TF_NEEDFIN;
|
|
|
|
break;
|
|
|
|
|
1994-05-24 10:09:53 +00:00
|
|
|
case TCPS_ESTABLISHED:
|
2013-08-25 21:54:41 +00:00
|
|
|
tcp_state_change(tp, TCPS_FIN_WAIT_1);
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
|
|
|
|
case TCPS_CLOSE_WAIT:
|
2013-08-25 21:54:41 +00:00
|
|
|
tcp_state_change(tp, TCPS_LAST_ACK);
|
1994-05-24 10:09:53 +00:00
|
|
|
break;
|
|
|
|
}
|
2007-05-31 12:06:02 +00:00
|
|
|
if (tp->t_state >= TCPS_FIN_WAIT_2) {
|
1994-05-24 10:09:53 +00:00
|
|
|
soisdisconnected(tp->t_inpcb->inp_socket);
|
2007-05-31 12:06:02 +00:00
|
|
|
/* Prevent the connection hanging in FIN_WAIT_2 forever. */
|
2007-02-26 22:25:21 +00:00
|
|
|
if (tp->t_state == TCPS_FIN_WAIT_2) {
|
|
|
|
int timeout;
|
|
|
|
|
2020-02-12 13:31:36 +00:00
|
|
|
timeout = (tcp_fast_finwait2_recycle) ?
|
2012-02-05 16:53:02 +00:00
|
|
|
tcp_finwait2_timeout : TP_MAXIDLE(tp);
|
2007-04-11 09:45:16 +00:00
|
|
|
tcp_timer_activate(tp, TT_2MSL, timeout);
|
2007-02-26 22:25:21 +00:00
|
|
|
}
|
1995-10-29 21:30:25 +00:00
|
|
|
}
|
1994-05-24 10:09:53 +00:00
|
|
|
}
|
2007-02-17 21:02:38 +00:00
|
|
|
|
|
|
|
#ifdef DDB
|
|
|
|
static void
|
|
|
|
db_print_indent(int indent)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < indent; i++)
|
|
|
|
db_printf(" ");
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
db_print_tstate(int t_state)
|
|
|
|
{
|
|
|
|
|
|
|
|
switch (t_state) {
|
|
|
|
case TCPS_CLOSED:
|
|
|
|
db_printf("TCPS_CLOSED");
|
|
|
|
return;
|
|
|
|
|
|
|
|
case TCPS_LISTEN:
|
|
|
|
db_printf("TCPS_LISTEN");
|
|
|
|
return;
|
|
|
|
|
|
|
|
case TCPS_SYN_SENT:
|
|
|
|
db_printf("TCPS_SYN_SENT");
|
|
|
|
return;
|
|
|
|
|
|
|
|
case TCPS_SYN_RECEIVED:
|
|
|
|
db_printf("TCPS_SYN_RECEIVED");
|
|
|
|
return;
|
|
|
|
|
|
|
|
case TCPS_ESTABLISHED:
|
|
|
|
db_printf("TCPS_ESTABLISHED");
|
|
|
|
return;
|
|
|
|
|
|
|
|
case TCPS_CLOSE_WAIT:
|
|
|
|
db_printf("TCPS_CLOSE_WAIT");
|
|
|
|
return;
|
|
|
|
|
|
|
|
case TCPS_FIN_WAIT_1:
|
|
|
|
db_printf("TCPS_FIN_WAIT_1");
|
|
|
|
return;
|
|
|
|
|
|
|
|
case TCPS_CLOSING:
|
|
|
|
db_printf("TCPS_CLOSING");
|
|
|
|
return;
|
|
|
|
|
|
|
|
case TCPS_LAST_ACK:
|
|
|
|
db_printf("TCPS_LAST_ACK");
|
|
|
|
return;
|
|
|
|
|
|
|
|
case TCPS_FIN_WAIT_2:
|
|
|
|
db_printf("TCPS_FIN_WAIT_2");
|
|
|
|
return;
|
|
|
|
|
|
|
|
case TCPS_TIME_WAIT:
|
|
|
|
db_printf("TCPS_TIME_WAIT");
|
|
|
|
return;
|
|
|
|
|
|
|
|
default:
|
|
|
|
db_printf("unknown");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
db_print_tflags(u_int t_flags)
|
|
|
|
{
|
|
|
|
int comma;
|
|
|
|
|
|
|
|
comma = 0;
|
|
|
|
if (t_flags & TF_ACKNOW) {
|
|
|
|
db_printf("%sTF_ACKNOW", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_DELACK) {
|
|
|
|
db_printf("%sTF_DELACK", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_NODELAY) {
|
|
|
|
db_printf("%sTF_NODELAY", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_NOOPT) {
|
|
|
|
db_printf("%sTF_NOOPT", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_SENTFIN) {
|
|
|
|
db_printf("%sTF_SENTFIN", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_REQ_SCALE) {
|
|
|
|
db_printf("%sTF_REQ_SCALE", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_RCVD_SCALE) {
|
|
|
|
db_printf("%sTF_RECVD_SCALE", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_REQ_TSTMP) {
|
|
|
|
db_printf("%sTF_REQ_TSTMP", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_RCVD_TSTMP) {
|
|
|
|
db_printf("%sTF_RCVD_TSTMP", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_SACK_PERMIT) {
|
|
|
|
db_printf("%sTF_SACK_PERMIT", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_NEEDSYN) {
|
|
|
|
db_printf("%sTF_NEEDSYN", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_NEEDFIN) {
|
|
|
|
db_printf("%sTF_NEEDFIN", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_NOPUSH) {
|
|
|
|
db_printf("%sTF_NOPUSH", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_MORETOCOME) {
|
|
|
|
db_printf("%sTF_MORETOCOME", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_LQ_OVERFLOW) {
|
|
|
|
db_printf("%sTF_LQ_OVERFLOW", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_LASTIDLE) {
|
|
|
|
db_printf("%sTF_LASTIDLE", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_RXWIN0SENT) {
|
|
|
|
db_printf("%sTF_RXWIN0SENT", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_FASTRECOVERY) {
|
|
|
|
db_printf("%sTF_FASTRECOVERY", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
2010-11-12 06:41:55 +00:00
|
|
|
if (t_flags & TF_CONGRECOVERY) {
|
|
|
|
db_printf("%sTF_CONGRECOVERY", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
2007-02-17 21:02:38 +00:00
|
|
|
if (t_flags & TF_WASFRECOVERY) {
|
|
|
|
db_printf("%sTF_WASFRECOVERY", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_SIGNATURE) {
|
|
|
|
db_printf("%sTF_SIGNATURE", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_FORCEDATA) {
|
|
|
|
db_printf("%sTF_FORCEDATA", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_flags & TF_TSO) {
|
|
|
|
db_printf("%sTF_TSO", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
2015-12-24 19:09:48 +00:00
|
|
|
if (t_flags & TF_FASTOPEN) {
|
|
|
|
db_printf("%sTF_FASTOPEN", comma ? ", " : "");
|
|
|
|
comma = 1;
|
2008-07-31 15:10:09 +00:00
|
|
|
}
|
2007-02-17 21:02:38 +00:00
|
|
|
}
|
|
|
|
|
2019-12-01 21:01:33 +00:00
|
|
|
static void
|
|
|
|
db_print_tflags2(u_int t_flags2)
|
|
|
|
{
|
|
|
|
int comma;
|
|
|
|
|
|
|
|
comma = 0;
|
|
|
|
if (t_flags2 & TF2_ECN_PERMIT) {
|
|
|
|
db_printf("%sTF2_ECN_PERMIT", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2007-02-17 21:02:38 +00:00
|
|
|
static void
|
|
|
|
db_print_toobflags(char t_oobflags)
|
|
|
|
{
|
|
|
|
int comma;
|
|
|
|
|
|
|
|
comma = 0;
|
|
|
|
if (t_oobflags & TCPOOB_HAVEDATA) {
|
|
|
|
db_printf("%sTCPOOB_HAVEDATA", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
if (t_oobflags & TCPOOB_HADDATA) {
|
|
|
|
db_printf("%sTCPOOB_HADDATA", comma ? ", " : "");
|
|
|
|
comma = 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
db_print_tcpcb(struct tcpcb *tp, const char *name, int indent)
|
|
|
|
{
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("%s at %p\n", name, tp);
|
|
|
|
|
|
|
|
indent += 2;
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("t_segq first: %p t_segqlen: %d t_dupacks: %d\n",
|
2018-08-20 12:43:18 +00:00
|
|
|
TAILQ_FIRST(&tp->t_segq), tp->t_segqlen, tp->t_dupacks);
|
2007-02-17 21:02:38 +00:00
|
|
|
|
|
|
|
db_print_indent(indent);
|
2007-09-07 09:19:22 +00:00
|
|
|
db_printf("tt_rexmt: %p tt_persist: %p tt_keep: %p\n",
|
2007-09-24 05:26:24 +00:00
|
|
|
&tp->t_timers->tt_rexmt, &tp->t_timers->tt_persist, &tp->t_timers->tt_keep);
|
2007-02-17 21:02:38 +00:00
|
|
|
|
|
|
|
db_print_indent(indent);
|
2007-09-24 05:26:24 +00:00
|
|
|
db_printf("tt_2msl: %p tt_delack: %p t_inpcb: %p\n", &tp->t_timers->tt_2msl,
|
|
|
|
&tp->t_timers->tt_delack, tp->t_inpcb);
|
2007-02-17 21:02:38 +00:00
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("t_state: %d (", tp->t_state);
|
|
|
|
db_print_tstate(tp->t_state);
|
|
|
|
db_printf(")\n");
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("t_flags: 0x%x (", tp->t_flags);
|
|
|
|
db_print_tflags(tp->t_flags);
|
|
|
|
db_printf(")\n");
|
|
|
|
|
2019-12-01 21:01:33 +00:00
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("t_flags2: 0x%x (", tp->t_flags2);
|
|
|
|
db_print_tflags2(tp->t_flags2);
|
|
|
|
db_printf(")\n");
|
|
|
|
|
2007-02-17 21:02:38 +00:00
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("snd_una: 0x%08x snd_max: 0x%08x snd_nxt: x0%08x\n",
|
|
|
|
tp->snd_una, tp->snd_max, tp->snd_nxt);
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("snd_up: 0x%08x snd_wl1: 0x%08x snd_wl2: 0x%08x\n",
|
|
|
|
tp->snd_up, tp->snd_wl1, tp->snd_wl2);
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("iss: 0x%08x irs: 0x%08x rcv_nxt: 0x%08x\n",
|
|
|
|
tp->iss, tp->irs, tp->rcv_nxt);
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2016-10-06 16:28:34 +00:00
|
|
|
db_printf("rcv_adv: 0x%08x rcv_wnd: %u rcv_up: 0x%08x\n",
|
2007-02-17 21:02:38 +00:00
|
|
|
tp->rcv_adv, tp->rcv_wnd, tp->rcv_up);
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2016-10-06 16:28:34 +00:00
|
|
|
db_printf("snd_wnd: %u snd_cwnd: %u\n",
|
2010-09-16 21:06:45 +00:00
|
|
|
tp->snd_wnd, tp->snd_cwnd);
|
2007-02-17 21:02:38 +00:00
|
|
|
|
|
|
|
db_print_indent(indent);
|
2016-10-06 16:28:34 +00:00
|
|
|
db_printf("snd_ssthresh: %u snd_recover: "
|
2010-09-16 21:06:45 +00:00
|
|
|
"0x%08x\n", tp->snd_ssthresh, tp->snd_recover);
|
2007-02-17 21:02:38 +00:00
|
|
|
|
|
|
|
db_print_indent(indent);
|
2016-01-07 00:14:42 +00:00
|
|
|
db_printf("t_rcvtime: %u t_startime: %u\n",
|
|
|
|
tp->t_rcvtime, tp->t_starttime);
|
2007-02-17 21:02:38 +00:00
|
|
|
|
|
|
|
db_print_indent(indent);
|
2010-09-16 21:06:45 +00:00
|
|
|
db_printf("t_rttime: %u t_rtsq: 0x%08x\n",
|
|
|
|
tp->t_rtttime, tp->t_rtseq);
|
2007-02-17 21:02:38 +00:00
|
|
|
|
|
|
|
db_print_indent(indent);
|
2010-09-16 21:06:45 +00:00
|
|
|
db_printf("t_rxtcur: %d t_maxseg: %u t_srtt: %d\n",
|
|
|
|
tp->t_rxtcur, tp->t_maxseg, tp->t_srtt);
|
2007-02-17 21:02:38 +00:00
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("t_rttvar: %d t_rxtshift: %d t_rttmin: %u "
|
|
|
|
"t_rttbest: %u\n", tp->t_rttvar, tp->t_rxtshift, tp->t_rttmin,
|
|
|
|
tp->t_rttbest);
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2016-10-06 16:28:34 +00:00
|
|
|
db_printf("t_rttupdated: %lu max_sndwnd: %u t_softerror: %d\n",
|
2007-02-17 21:02:38 +00:00
|
|
|
tp->t_rttupdated, tp->max_sndwnd, tp->t_softerror);
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("t_oobflags: 0x%x (", tp->t_oobflags);
|
|
|
|
db_print_toobflags(tp->t_oobflags);
|
|
|
|
db_printf(") t_iobc: 0x%02x\n", tp->t_iobc);
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("snd_scale: %u rcv_scale: %u request_r_scale: %u\n",
|
|
|
|
tp->snd_scale, tp->rcv_scale, tp->request_r_scale);
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2009-06-16 18:58:50 +00:00
|
|
|
db_printf("ts_recent: %u ts_recent_age: %u\n",
|
2007-05-06 16:04:36 +00:00
|
|
|
tp->ts_recent, tp->ts_recent_age);
|
2007-02-17 21:02:38 +00:00
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("ts_offset: %u last_ack_sent: 0x%08x snd_cwnd_prev: "
|
2016-10-06 16:28:34 +00:00
|
|
|
"%u\n", tp->ts_offset, tp->last_ack_sent, tp->snd_cwnd_prev);
|
2007-02-17 21:02:38 +00:00
|
|
|
|
|
|
|
db_print_indent(indent);
|
2016-10-06 16:28:34 +00:00
|
|
|
db_printf("snd_ssthresh_prev: %u snd_recover_prev: 0x%08x "
|
2009-06-16 18:58:50 +00:00
|
|
|
"t_badrxtwin: %u\n", tp->snd_ssthresh_prev,
|
2007-02-17 21:02:38 +00:00
|
|
|
tp->snd_recover_prev, tp->t_badrxtwin);
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
2007-05-06 15:56:31 +00:00
|
|
|
db_printf("snd_numholes: %d snd_holes first: %p\n",
|
|
|
|
tp->snd_numholes, TAILQ_FIRST(&tp->snd_holes));
|
2007-02-17 21:02:38 +00:00
|
|
|
|
|
|
|
db_print_indent(indent);
|
2020-02-13 15:14:46 +00:00
|
|
|
db_printf("snd_fack: 0x%08x rcv_numsacks: %d\n",
|
|
|
|
tp->snd_fack, tp->rcv_numsacks);
|
2007-02-17 21:02:38 +00:00
|
|
|
|
|
|
|
/* Skip sackblks, sackhint. */
|
|
|
|
|
|
|
|
db_print_indent(indent);
|
|
|
|
db_printf("t_rttlow: %d rfbuf_ts: %u rfbuf_cnt: %d\n",
|
|
|
|
tp->t_rttlow, tp->rfbuf_ts, tp->rfbuf_cnt);
|
|
|
|
}
|
|
|
|
|
|
|
|
DB_SHOW_COMMAND(tcpcb, db_show_tcpcb)
|
|
|
|
{
|
|
|
|
struct tcpcb *tp;
|
|
|
|
|
|
|
|
if (!have_addr) {
|
|
|
|
db_printf("usage: show tcpcb <addr>\n");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
tp = (struct tcpcb *)addr;
|
|
|
|
|
|
|
|
db_print_tcpcb(tp, "tcpcb", 0);
|
|
|
|
}
|
|
|
|
#endif
|