2001-11-22 04:50:44 +00:00
|
|
|
/*-
|
2005-01-30 19:28:27 +00:00
|
|
|
* Copyright (c) 2001 McAfee, Inc.
|
2013-07-11 15:29:25 +00:00
|
|
|
* Copyright (c) 2006,2013 Andre Oppermann, Internet Business Solutions AG
|
2001-11-22 04:50:44 +00:00
|
|
|
* All rights reserved.
|
|
|
|
*
|
|
|
|
* This software was developed for the FreeBSD Project by Jonathan Lemon
|
2005-01-30 19:28:27 +00:00
|
|
|
* and McAfee Research, the Security Research Division of McAfee, Inc. under
|
|
|
|
* DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the
|
2013-07-11 15:29:25 +00:00
|
|
|
* DARPA CHATS research program. [2001 McAfee, Inc.]
|
2001-11-22 04:50:44 +00:00
|
|
|
*
|
|
|
|
* Redistribution and use in source and binary forms, with or without
|
|
|
|
* modification, are permitted provided that the following conditions
|
|
|
|
* are met:
|
|
|
|
* 1. Redistributions of source code must retain the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer.
|
|
|
|
* 2. Redistributions in binary form must reproduce the above copyright
|
|
|
|
* notice, this list of conditions and the following disclaimer in the
|
|
|
|
* documentation and/or other materials provided with the distribution.
|
|
|
|
*
|
|
|
|
* THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
|
|
|
|
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
|
|
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
|
|
* ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
|
|
|
|
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
|
|
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
|
|
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
|
|
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
|
|
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
|
|
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
|
|
* SUCH DAMAGE.
|
|
|
|
*/
|
|
|
|
|
2007-10-07 20:44:24 +00:00
|
|
|
#include <sys/cdefs.h>
|
|
|
|
__FBSDID("$FreeBSD$");
|
|
|
|
|
Initial import of RFC 2385 (TCP-MD5) digest support.
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.
For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.
Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.
There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.
Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.
This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.
Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.
Sponsored by: sentex.net
2004-02-11 04:26:04 +00:00
|
|
|
#include "opt_inet.h"
|
2001-11-22 04:50:44 +00:00
|
|
|
#include "opt_inet6.h"
|
|
|
|
#include "opt_ipsec.h"
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
#include "opt_pcbgroup.h"
|
2001-11-22 04:50:44 +00:00
|
|
|
|
|
|
|
#include <sys/param.h>
|
|
|
|
#include <sys/systm.h>
|
Use Jenkins hash for TCP syncache.
o Unlike xor, in Jenkins hash every bit of input affects virtually
every bit of output, thus salting the hash actually works. With
xor salting only provides a false sense of security, since if
hash(x) collides with hash(y), then of course, hash(x) ^ salt
would also collide with hash(y) ^ salt. [1]
o Jenkins provides much better distribution than xor, very close to
ideal.
TCP connection setup/teardown benchmark has shown a 10% increase
with default hash size, and with bigger hashes that still provide
possibility for collisions. With enormous hash size, when dataset is
by an order of magnitude smaller than hash size, the benchmark has
shown 4% decrease in performance decrease, which is expected and
acceptable.
Noticed by: Jeffrey Knockel <jeffk cs.unm.edu> [1]
Benchmarks by: jch
Reviewed by: jch, pkelsey, delphij
Security: strengthens protection against hash collision DoS
Sponsored by: Nginx, Inc.
2015-09-05 10:15:19 +00:00
|
|
|
#include <sys/hash.h>
|
2015-12-16 00:56:45 +00:00
|
|
|
#include <sys/refcount.h>
|
2001-11-22 04:50:44 +00:00
|
|
|
#include <sys/kernel.h>
|
|
|
|
#include <sys/sysctl.h>
|
Fix bugs in the TCP syncache timeout code. including:
When system ticks are positive, for entries in the cache
bucket, syncache_timer() ran on every tick (doing nothing
useful) instead of the supposed 3, 6, 12, and 24 seconds
later (when it's time to retransmit SYN,ACK).
When ticks are negative, syncache_timer() was scheduled
for the too far future (up to ~25 days on systems with
HZ=1000), no SYN,ACK retransmits were attempted at all,
and syncache entries added in that period that correspond
to non-established connections stay there forever.
Only HEAD and RELENG_7 are affected.
Reviewed by: silby, kmacy (earlier version)
Submitted by: Maxim Dounin, ru
2007-12-19 16:56:28 +00:00
|
|
|
#include <sys/limits.h>
|
2006-06-17 17:32:38 +00:00
|
|
|
#include <sys/lock.h>
|
|
|
|
#include <sys/mutex.h>
|
2001-11-22 04:50:44 +00:00
|
|
|
#include <sys/malloc.h>
|
|
|
|
#include <sys/mbuf.h>
|
|
|
|
#include <sys/proc.h> /* for proc0 declaration */
|
|
|
|
#include <sys/random.h>
|
|
|
|
#include <sys/socket.h>
|
|
|
|
#include <sys/socketvar.h>
|
2007-05-18 21:13:01 +00:00
|
|
|
#include <sys/syslog.h>
|
2008-08-23 14:22:12 +00:00
|
|
|
#include <sys/ucred.h>
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2013-07-11 15:29:25 +00:00
|
|
|
#include <sys/md5.h>
|
|
|
|
#include <crypto/siphash/siphash.h>
|
|
|
|
|
2006-09-13 13:21:17 +00:00
|
|
|
#include <vm/uma.h>
|
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
#include <net/if.h>
|
2013-10-26 17:58:36 +00:00
|
|
|
#include <net/if_var.h>
|
2001-11-22 04:50:44 +00:00
|
|
|
#include <net/route.h>
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
#include <net/vnet.h>
|
2001-11-22 04:50:44 +00:00
|
|
|
|
|
|
|
#include <netinet/in.h>
|
|
|
|
#include <netinet/in_systm.h>
|
|
|
|
#include <netinet/ip.h>
|
|
|
|
#include <netinet/in_var.h>
|
|
|
|
#include <netinet/in_pcb.h>
|
|
|
|
#include <netinet/ip_var.h>
|
2005-11-18 20:12:40 +00:00
|
|
|
#include <netinet/ip_options.h>
|
2001-11-22 04:50:44 +00:00
|
|
|
#ifdef INET6
|
|
|
|
#include <netinet/ip6.h>
|
|
|
|
#include <netinet/icmp6.h>
|
|
|
|
#include <netinet6/nd6.h>
|
|
|
|
#include <netinet6/ip6_var.h>
|
|
|
|
#include <netinet6/in6_pcb.h>
|
|
|
|
#endif
|
|
|
|
#include <netinet/tcp.h>
|
2015-12-24 19:09:48 +00:00
|
|
|
#ifdef TCP_RFC7413
|
|
|
|
#include <netinet/tcp_fastopen.h>
|
|
|
|
#endif
|
2001-11-22 04:50:44 +00:00
|
|
|
#include <netinet/tcp_fsm.h>
|
|
|
|
#include <netinet/tcp_seq.h>
|
|
|
|
#include <netinet/tcp_timer.h>
|
|
|
|
#include <netinet/tcp_var.h>
|
2007-07-27 00:57:06 +00:00
|
|
|
#include <netinet/tcp_syncache.h>
|
2001-11-22 04:50:44 +00:00
|
|
|
#ifdef INET6
|
|
|
|
#include <netinet6/tcp6_var.h>
|
|
|
|
#endif
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
#include <netinet/toecore.h>
|
|
|
|
#endif
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2017-02-06 08:49:57 +00:00
|
|
|
#include <netipsec/ipsec_support.h>
|
2002-10-16 02:25:05 +00:00
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
#include <machine/in_cksum.h>
|
|
|
|
|
2006-10-22 11:52:19 +00:00
|
|
|
#include <security/mac/mac_framework.h>
|
|
|
|
|
2010-11-22 19:32:54 +00:00
|
|
|
static VNET_DEFINE(int, tcp_syncookies) = 1;
|
2009-07-16 21:13:04 +00:00
|
|
|
#define V_tcp_syncookies VNET(tcp_syncookies)
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_INT(_net_inet_tcp, OID_AUTO, syncookies, CTLFLAG_VNET | CTLFLAG_RW,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
&VNET_NAME(tcp_syncookies), 0,
|
2001-12-19 06:12:14 +00:00
|
|
|
"Use TCP SYN cookies if the syncache overflows");
|
|
|
|
|
2010-11-22 19:32:54 +00:00
|
|
|
static VNET_DEFINE(int, tcp_syncookiesonly) = 0;
|
2010-04-29 11:52:42 +00:00
|
|
|
#define V_tcp_syncookiesonly VNET(tcp_syncookiesonly)
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_INT(_net_inet_tcp, OID_AUTO, syncookies_only, CTLFLAG_VNET | CTLFLAG_RW,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
&VNET_NAME(tcp_syncookiesonly), 0,
|
2006-09-13 13:08:27 +00:00
|
|
|
"Use only TCP SYN cookies");
|
|
|
|
|
2017-01-27 23:10:46 +00:00
|
|
|
static VNET_DEFINE(int, functions_inherit_listen_socket_stack) = 1;
|
|
|
|
#define V_functions_inherit_listen_socket_stack \
|
|
|
|
VNET(functions_inherit_listen_socket_stack)
|
|
|
|
SYSCTL_INT(_net_inet_tcp, OID_AUTO, functions_inherit_listen_socket_stack,
|
|
|
|
CTLFLAG_VNET | CTLFLAG_RW,
|
|
|
|
&VNET_NAME(functions_inherit_listen_socket_stack), 0,
|
|
|
|
"Inherit listen socket's stack");
|
|
|
|
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
#define ADDED_BY_TOE(sc) ((sc)->sc_tod != NULL)
|
2007-12-17 07:56:27 +00:00
|
|
|
#endif
|
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
static void syncache_drop(struct syncache *, struct syncache_head *);
|
|
|
|
static void syncache_free(struct syncache *);
|
2001-12-19 06:12:14 +00:00
|
|
|
static void syncache_insert(struct syncache *, struct syncache_head *);
|
2016-04-29 07:23:08 +00:00
|
|
|
static int syncache_respond(struct syncache *, struct syncache_head *, int,
|
|
|
|
const struct mbuf *);
|
2004-08-16 18:32:07 +00:00
|
|
|
static struct socket *syncache_socket(struct syncache *, struct socket *,
|
2002-05-14 18:57:55 +00:00
|
|
|
struct mbuf *m);
|
2007-07-28 12:02:05 +00:00
|
|
|
static void syncache_timeout(struct syncache *sc, struct syncache_head *sch,
|
|
|
|
int docallout);
|
2001-11-22 04:50:44 +00:00
|
|
|
static void syncache_timer(void *);
|
2013-07-11 15:29:25 +00:00
|
|
|
|
|
|
|
static uint32_t syncookie_mac(struct in_conninfo *, tcp_seq, uint8_t,
|
|
|
|
uint8_t *, uintptr_t);
|
|
|
|
static tcp_seq syncookie_generate(struct syncache_head *, struct syncache *);
|
2006-06-17 17:49:11 +00:00
|
|
|
static struct syncache
|
2006-09-13 13:08:27 +00:00
|
|
|
*syncookie_lookup(struct in_conninfo *, struct syncache_head *,
|
2013-07-11 15:29:25 +00:00
|
|
|
struct syncache *, struct tcphdr *, struct tcpopt *,
|
2006-06-17 17:49:11 +00:00
|
|
|
struct socket *);
|
2013-07-11 15:29:25 +00:00
|
|
|
static void syncookie_reseed(void *);
|
|
|
|
#ifdef INVARIANTS
|
|
|
|
static int syncookie_cmp(struct in_conninfo *inc, struct syncache_head *sch,
|
|
|
|
struct syncache *sc, struct tcphdr *th, struct tcpopt *to,
|
|
|
|
struct socket *lso);
|
|
|
|
#endif
|
2001-11-22 04:50:44 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Transmit the SYN,ACK fewer times than TCP_MAXRXTSHIFT specifies.
|
Fix bugs in the TCP syncache timeout code. including:
When system ticks are positive, for entries in the cache
bucket, syncache_timer() ran on every tick (doing nothing
useful) instead of the supposed 3, 6, 12, and 24 seconds
later (when it's time to retransmit SYN,ACK).
When ticks are negative, syncache_timer() was scheduled
for the too far future (up to ~25 days on systems with
HZ=1000), no SYN,ACK retransmits were attempted at all,
and syncache entries added in that period that correspond
to non-established connections stay there forever.
Only HEAD and RELENG_7 are affected.
Reviewed by: silby, kmacy (earlier version)
Submitted by: Maxim Dounin, ru
2007-12-19 16:56:28 +00:00
|
|
|
* 3 retransmits corresponds to a timeout of 3 * (1 + 2 + 4 + 8) == 45 seconds,
|
2001-11-22 04:50:44 +00:00
|
|
|
* the odds are that the user has given up attempting to connect by then.
|
|
|
|
*/
|
|
|
|
#define SYNCACHE_MAXREXMTS 3
|
|
|
|
|
|
|
|
/* Arbitrary values */
|
|
|
|
#define TCP_SYNCACHE_HASHSIZE 512
|
|
|
|
#define TCP_SYNCACHE_BUCKETLIMIT 30
|
|
|
|
|
2010-11-22 19:32:54 +00:00
|
|
|
static VNET_DEFINE(struct tcp_syncache, tcp_syncache);
|
2010-04-29 11:52:42 +00:00
|
|
|
#define V_tcp_syncache VNET(tcp_syncache)
|
|
|
|
|
2011-11-07 15:43:11 +00:00
|
|
|
static SYSCTL_NODE(_net_inet_tcp, OID_AUTO, syncache, CTLFLAG_RW, 0,
|
|
|
|
"TCP SYN cache");
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_UINT(_net_inet_tcp_syncache, OID_AUTO, bucketlimit, CTLFLAG_VNET | CTLFLAG_RDTUN,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
&VNET_NAME(tcp_syncache.bucket_limit), 0,
|
|
|
|
"Per-bucket hash limit for syncache");
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_UINT(_net_inet_tcp_syncache, OID_AUTO, cachelimit, CTLFLAG_VNET | CTLFLAG_RDTUN,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
&VNET_NAME(tcp_syncache.cache_limit), 0,
|
|
|
|
"Overall entry limit for syncache");
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2014-02-07 14:31:51 +00:00
|
|
|
SYSCTL_UMA_CUR(_net_inet_tcp_syncache, OID_AUTO, count, CTLFLAG_VNET,
|
|
|
|
&VNET_NAME(tcp_syncache.zone), "Current number of entries in syncache");
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_UINT(_net_inet_tcp_syncache, OID_AUTO, hashsize, CTLFLAG_VNET | CTLFLAG_RDTUN,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
&VNET_NAME(tcp_syncache.hashsize), 0,
|
|
|
|
"Size of TCP syncache hashtable");
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_UINT(_net_inet_tcp_syncache, OID_AUTO, rexmtlimit, CTLFLAG_VNET | CTLFLAG_RW,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
&VNET_NAME(tcp_syncache.rexmt_limit), 0,
|
|
|
|
"Limit on SYN/ACK retransmissions");
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2010-04-29 11:52:42 +00:00
|
|
|
VNET_DEFINE(int, tcp_sc_rst_sock_fail) = 1;
|
2014-11-07 09:39:05 +00:00
|
|
|
SYSCTL_INT(_net_inet_tcp_syncache, OID_AUTO, rst_on_sock_fail,
|
|
|
|
CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(tcp_sc_rst_sock_fail), 0,
|
Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
2009-07-14 22:48:30 +00:00
|
|
|
"Send reset on socket allocation failure");
|
2007-05-28 11:03:53 +00:00
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
static MALLOC_DEFINE(M_SYNCACHE, "syncache", "TCP syncache");
|
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
#define SCH_LOCK(sch) mtx_lock(&(sch)->sch_mtx)
|
|
|
|
#define SCH_UNLOCK(sch) mtx_unlock(&(sch)->sch_mtx)
|
|
|
|
#define SCH_LOCK_ASSERT(sch) mtx_assert(&(sch)->sch_mtx, MA_OWNED)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Requires the syncache entry to be already removed from the bucket list.
|
|
|
|
*/
|
2001-11-22 04:50:44 +00:00
|
|
|
static void
|
|
|
|
syncache_free(struct syncache *sc)
|
|
|
|
{
|
Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit
Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.
Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().
Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).
All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).
(*) netipsec/keysock.c did not validate depending on compile time options.
Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
if (sc->sc_ipopts)
|
|
|
|
(void) m_free(sc->sc_ipopts);
|
2008-08-23 14:22:12 +00:00
|
|
|
if (sc->sc_cred)
|
|
|
|
crfree(sc->sc_cred);
|
2006-12-13 06:00:57 +00:00
|
|
|
#ifdef MAC
|
2007-10-25 14:37:37 +00:00
|
|
|
mac_syncache_destroy(&sc->sc_label);
|
2006-12-13 06:00:57 +00:00
|
|
|
#endif
|
2003-11-20 20:07:39 +00:00
|
|
|
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
uma_zfree(V_tcp_syncache.zone, sc);
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
syncache_init(void)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
V_tcp_syncache.hashsize = TCP_SYNCACHE_HASHSIZE;
|
|
|
|
V_tcp_syncache.bucket_limit = TCP_SYNCACHE_BUCKETLIMIT;
|
|
|
|
V_tcp_syncache.rexmt_limit = SYNCACHE_MAXREXMTS;
|
|
|
|
V_tcp_syncache.hash_secret = arc4random();
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2004-08-16 18:32:07 +00:00
|
|
|
TUNABLE_INT_FETCH("net.inet.tcp.syncache.hashsize",
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
&V_tcp_syncache.hashsize);
|
2004-08-16 18:32:07 +00:00
|
|
|
TUNABLE_INT_FETCH("net.inet.tcp.syncache.bucketlimit",
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
&V_tcp_syncache.bucket_limit);
|
2008-08-20 01:05:56 +00:00
|
|
|
if (!powerof2(V_tcp_syncache.hashsize) ||
|
|
|
|
V_tcp_syncache.hashsize == 0) {
|
2004-08-16 18:32:07 +00:00
|
|
|
printf("WARNING: syncache hash size is not a power of 2.\n");
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
V_tcp_syncache.hashsize = TCP_SYNCACHE_HASHSIZE;
|
2004-08-16 18:32:07 +00:00
|
|
|
}
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
V_tcp_syncache.hashmask = V_tcp_syncache.hashsize - 1;
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
/* Set limits. */
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
V_tcp_syncache.cache_limit =
|
|
|
|
V_tcp_syncache.hashsize * V_tcp_syncache.bucket_limit;
|
2006-06-17 17:32:38 +00:00
|
|
|
TUNABLE_INT_FETCH("net.inet.tcp.syncache.cachelimit",
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
&V_tcp_syncache.cache_limit);
|
2006-06-17 17:32:38 +00:00
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
/* Allocate the hash table. */
|
2008-10-23 20:26:15 +00:00
|
|
|
V_tcp_syncache.hashbase = malloc(V_tcp_syncache.hashsize *
|
|
|
|
sizeof(struct syncache_head), M_SYNCACHE, M_WAITOK | M_ZERO);
|
2001-11-22 04:50:44 +00:00
|
|
|
|
Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:
1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:
options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet
2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:
INIT_VNET_NET(ifp->if_vnet); becomes
struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];
3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.
4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.
5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.
6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.
Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.
Reviewed by: bz, rwatson
Approved by: julian (mentor)
2009-04-30 13:36:26 +00:00
|
|
|
#ifdef VIMAGE
|
2013-07-11 15:29:25 +00:00
|
|
|
V_tcp_syncache.vnet = curvnet;
|
Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:
1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:
options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet
2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:
INIT_VNET_NET(ifp->if_vnet); becomes
struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];
3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.
4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.
5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.
6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.
Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.
Reviewed by: bz, rwatson
Approved by: julian (mentor)
2009-04-30 13:36:26 +00:00
|
|
|
#endif
|
2013-07-11 15:29:25 +00:00
|
|
|
|
|
|
|
/* Initialize the hash buckets. */
|
|
|
|
for (i = 0; i < V_tcp_syncache.hashsize; i++) {
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
TAILQ_INIT(&V_tcp_syncache.hashbase[i].sch_bucket);
|
|
|
|
mtx_init(&V_tcp_syncache.hashbase[i].sch_mtx, "tcp_sc_head",
|
2006-06-17 17:32:38 +00:00
|
|
|
NULL, MTX_DEF);
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
callout_init_mtx(&V_tcp_syncache.hashbase[i].sch_timer,
|
|
|
|
&V_tcp_syncache.hashbase[i].sch_mtx, 0);
|
|
|
|
V_tcp_syncache.hashbase[i].sch_length = 0;
|
2013-07-11 15:29:25 +00:00
|
|
|
V_tcp_syncache.hashbase[i].sch_sc = &V_tcp_syncache;
|
2017-04-21 06:05:34 +00:00
|
|
|
V_tcp_syncache.hashbase[i].sch_last_overflow =
|
|
|
|
-(SYNCOOKIE_LIFETIME + 1);
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
/* Create the syncache entry zone. */
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
V_tcp_syncache.zone = uma_zcreate("syncache", sizeof(struct syncache),
|
2006-06-17 17:32:38 +00:00
|
|
|
NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0);
|
2013-02-01 14:21:09 +00:00
|
|
|
V_tcp_syncache.cache_limit = uma_zone_set_max(V_tcp_syncache.zone,
|
|
|
|
V_tcp_syncache.cache_limit);
|
2013-07-11 15:29:25 +00:00
|
|
|
|
|
|
|
/* Start the SYN cookie reseeder callout. */
|
|
|
|
callout_init(&V_tcp_syncache.secret.reseed, 1);
|
|
|
|
arc4rand(V_tcp_syncache.secret.key[0], SYNCOOKIE_SECRET_SIZE, 0);
|
|
|
|
arc4rand(V_tcp_syncache.secret.key[1], SYNCOOKIE_SECRET_SIZE, 0);
|
|
|
|
callout_reset(&V_tcp_syncache.secret.reseed, SYNCOOKIE_LIFETIME * hz,
|
|
|
|
syncookie_reseed, &V_tcp_syncache);
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
|
|
|
|
Introduce an infrastructure for dismantling vnet instances.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.
While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.
Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.
Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)
2009-06-08 17:15:40 +00:00
|
|
|
#ifdef VIMAGE
|
|
|
|
void
|
|
|
|
syncache_destroy(void)
|
|
|
|
{
|
2010-02-20 21:45:04 +00:00
|
|
|
struct syncache_head *sch;
|
|
|
|
struct syncache *sc, *nsc;
|
|
|
|
int i;
|
|
|
|
|
2016-04-09 10:51:07 +00:00
|
|
|
/*
|
|
|
|
* Stop the re-seed timer before freeing resources. No need to
|
|
|
|
* possibly schedule it another time.
|
|
|
|
*/
|
|
|
|
callout_drain(&V_tcp_syncache.secret.reseed);
|
|
|
|
|
2010-02-20 21:45:04 +00:00
|
|
|
/* Cleanup hash buckets: stop timers, free entries, destroy locks. */
|
|
|
|
for (i = 0; i < V_tcp_syncache.hashsize; i++) {
|
|
|
|
|
|
|
|
sch = &V_tcp_syncache.hashbase[i];
|
|
|
|
callout_drain(&sch->sch_timer);
|
|
|
|
|
|
|
|
SCH_LOCK(sch);
|
|
|
|
TAILQ_FOREACH_SAFE(sc, &sch->sch_bucket, sc_hash, nsc)
|
|
|
|
syncache_drop(sc, sch);
|
|
|
|
SCH_UNLOCK(sch);
|
|
|
|
KASSERT(TAILQ_EMPTY(&sch->sch_bucket),
|
|
|
|
("%s: sch->sch_bucket not empty", __func__));
|
|
|
|
KASSERT(sch->sch_length == 0, ("%s: sch->sch_length %d not 0",
|
|
|
|
__func__, sch->sch_length));
|
|
|
|
mtx_destroy(&sch->sch_mtx);
|
|
|
|
}
|
Introduce an infrastructure for dismantling vnet instances.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.
While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.
Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.
Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)
2009-06-08 17:15:40 +00:00
|
|
|
|
2012-10-28 18:07:34 +00:00
|
|
|
KASSERT(uma_zone_get_cur(V_tcp_syncache.zone) == 0,
|
|
|
|
("%s: cache_count not 0", __func__));
|
Introduce an infrastructure for dismantling vnet instances.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.
While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.
Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.
Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)
2009-06-08 17:15:40 +00:00
|
|
|
|
2010-02-20 21:45:04 +00:00
|
|
|
/* Free the allocated global resources. */
|
Introduce an infrastructure for dismantling vnet instances.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.
While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.
Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.
Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)
2009-06-08 17:15:40 +00:00
|
|
|
uma_zdestroy(V_tcp_syncache.zone);
|
2010-02-20 21:45:04 +00:00
|
|
|
free(V_tcp_syncache.hashbase, M_SYNCACHE);
|
Introduce an infrastructure for dismantling vnet instances.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.
While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.
Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.
Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)
2009-06-08 17:15:40 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
/*
|
|
|
|
* Inserts a syncache entry into the specified bucket row.
|
|
|
|
* Locks and unlocks the syncache_head autonomously.
|
|
|
|
*/
|
2001-12-19 06:12:14 +00:00
|
|
|
static void
|
2006-06-17 17:49:11 +00:00
|
|
|
syncache_insert(struct syncache *sc, struct syncache_head *sch)
|
2001-11-22 04:50:44 +00:00
|
|
|
{
|
|
|
|
struct syncache *sc2;
|
2003-11-11 17:54:47 +00:00
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
SCH_LOCK(sch);
|
2001-11-22 04:50:44 +00:00
|
|
|
|
|
|
|
/*
|
2006-06-17 17:32:38 +00:00
|
|
|
* Make sure that we don't overflow the per-bucket limit.
|
|
|
|
* If the bucket is full, toss the oldest element.
|
2001-11-22 04:50:44 +00:00
|
|
|
*/
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if (sch->sch_length >= V_tcp_syncache.bucket_limit) {
|
2006-06-17 17:32:38 +00:00
|
|
|
KASSERT(!TAILQ_EMPTY(&sch->sch_bucket),
|
|
|
|
("sch->sch_length incorrect"));
|
|
|
|
sc2 = TAILQ_LAST(&sch->sch_bucket, sch_head);
|
2017-04-20 19:19:33 +00:00
|
|
|
sch->sch_last_overflow = time_uptime;
|
2001-11-22 04:50:44 +00:00
|
|
|
syncache_drop(sc2, sch);
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sc_bucketoverflow);
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Put it into the bucket. */
|
2006-06-17 17:32:38 +00:00
|
|
|
TAILQ_INSERT_HEAD(&sch->sch_bucket, sc, sc_hash);
|
2001-11-22 04:50:44 +00:00
|
|
|
sch->sch_length++;
|
2006-06-17 17:32:38 +00:00
|
|
|
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
if (ADDED_BY_TOE(sc)) {
|
|
|
|
struct toedev *tod = sc->sc_tod;
|
|
|
|
|
|
|
|
tod->tod_syncache_added(tod, sc->sc_todctx);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
/* Reinitialize the bucket row's timer. */
|
Fix bugs in the TCP syncache timeout code. including:
When system ticks are positive, for entries in the cache
bucket, syncache_timer() ran on every tick (doing nothing
useful) instead of the supposed 3, 6, 12, and 24 seconds
later (when it's time to retransmit SYN,ACK).
When ticks are negative, syncache_timer() was scheduled
for the too far future (up to ~25 days on systems with
HZ=1000), no SYN,ACK retransmits were attempted at all,
and syncache entries added in that period that correspond
to non-established connections stay there forever.
Only HEAD and RELENG_7 are affected.
Reviewed by: silby, kmacy (earlier version)
Submitted by: Maxim Dounin, ru
2007-12-19 16:56:28 +00:00
|
|
|
if (sch->sch_length == 1)
|
|
|
|
sch->sch_nextc = ticks + INT_MAX;
|
2007-07-28 12:02:05 +00:00
|
|
|
syncache_timeout(sc, sch, 1);
|
2006-06-17 17:32:38 +00:00
|
|
|
|
|
|
|
SCH_UNLOCK(sch);
|
|
|
|
|
2016-03-15 00:15:10 +00:00
|
|
|
TCPSTATES_INC(TCPS_SYN_RECEIVED);
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sc_added);
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
/*
|
|
|
|
* Remove and free entry from syncache bucket row.
|
|
|
|
* Expects locked syncache head.
|
|
|
|
*/
|
2001-11-22 04:50:44 +00:00
|
|
|
static void
|
2006-06-17 17:49:11 +00:00
|
|
|
syncache_drop(struct syncache *sc, struct syncache_head *sch)
|
2001-11-22 04:50:44 +00:00
|
|
|
{
|
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
SCH_LOCK_ASSERT(sch);
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2016-03-15 00:15:10 +00:00
|
|
|
TCPSTATES_DEC(TCPS_SYN_RECEIVED);
|
2001-11-22 04:50:44 +00:00
|
|
|
TAILQ_REMOVE(&sch->sch_bucket, sc, sc_hash);
|
|
|
|
sch->sch_length--;
|
|
|
|
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
if (ADDED_BY_TOE(sc)) {
|
|
|
|
struct toedev *tod = sc->sc_tod;
|
|
|
|
|
|
|
|
tod->tod_syncache_removed(tod, sc->sc_todctx);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
syncache_free(sc);
|
|
|
|
}
|
|
|
|
|
2007-07-28 12:02:05 +00:00
|
|
|
/*
|
|
|
|
* Engage/reengage time on bucket row.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
syncache_timeout(struct syncache *sc, struct syncache_head *sch, int docallout)
|
|
|
|
{
|
|
|
|
sc->sc_rxttime = ticks +
|
2012-10-28 19:02:07 +00:00
|
|
|
TCPTV_RTOBASE * (tcp_syn_backoff[sc->sc_rxmits]);
|
2007-07-28 12:02:05 +00:00
|
|
|
sc->sc_rxmits++;
|
Fix bugs in the TCP syncache timeout code. including:
When system ticks are positive, for entries in the cache
bucket, syncache_timer() ran on every tick (doing nothing
useful) instead of the supposed 3, 6, 12, and 24 seconds
later (when it's time to retransmit SYN,ACK).
When ticks are negative, syncache_timer() was scheduled
for the too far future (up to ~25 days on systems with
HZ=1000), no SYN,ACK retransmits were attempted at all,
and syncache entries added in that period that correspond
to non-established connections stay there forever.
Only HEAD and RELENG_7 are affected.
Reviewed by: silby, kmacy (earlier version)
Submitted by: Maxim Dounin, ru
2007-12-19 16:56:28 +00:00
|
|
|
if (TSTMP_LT(sc->sc_rxttime, sch->sch_nextc)) {
|
2007-07-28 12:02:05 +00:00
|
|
|
sch->sch_nextc = sc->sc_rxttime;
|
Fix bugs in the TCP syncache timeout code. including:
When system ticks are positive, for entries in the cache
bucket, syncache_timer() ran on every tick (doing nothing
useful) instead of the supposed 3, 6, 12, and 24 seconds
later (when it's time to retransmit SYN,ACK).
When ticks are negative, syncache_timer() was scheduled
for the too far future (up to ~25 days on systems with
HZ=1000), no SYN,ACK retransmits were attempted at all,
and syncache entries added in that period that correspond
to non-established connections stay there forever.
Only HEAD and RELENG_7 are affected.
Reviewed by: silby, kmacy (earlier version)
Submitted by: Maxim Dounin, ru
2007-12-19 16:56:28 +00:00
|
|
|
if (docallout)
|
|
|
|
callout_reset(&sch->sch_timer, sch->sch_nextc - ticks,
|
|
|
|
syncache_timer, (void *)sch);
|
|
|
|
}
|
2007-07-28 12:02:05 +00:00
|
|
|
}
|
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
/*
|
|
|
|
* Walk the timer queues, looking for SYN,ACKs that need to be retransmitted.
|
|
|
|
* If we have retransmitted an entry the maximum number of times, expire it.
|
2006-06-17 17:32:38 +00:00
|
|
|
* One separate timer for each bucket row.
|
2001-11-22 04:50:44 +00:00
|
|
|
*/
|
|
|
|
static void
|
2006-06-17 17:49:11 +00:00
|
|
|
syncache_timer(void *xsch)
|
2001-11-22 04:50:44 +00:00
|
|
|
{
|
2006-06-17 17:32:38 +00:00
|
|
|
struct syncache_head *sch = (struct syncache_head *)xsch;
|
2001-11-22 04:50:44 +00:00
|
|
|
struct syncache *sc, *nsc;
|
2006-06-17 17:32:38 +00:00
|
|
|
int tick = ticks;
|
2007-05-18 21:13:01 +00:00
|
|
|
char *s;
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2013-07-11 15:29:25 +00:00
|
|
|
CURVNET_SET(sch->sch_sc->vnet);
|
2008-11-26 22:32:07 +00:00
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
/* NB: syncache_head has already been locked by the callout. */
|
|
|
|
SCH_LOCK_ASSERT(sch);
|
2001-11-22 04:50:44 +00:00
|
|
|
|
Fix bugs in the TCP syncache timeout code. including:
When system ticks are positive, for entries in the cache
bucket, syncache_timer() ran on every tick (doing nothing
useful) instead of the supposed 3, 6, 12, and 24 seconds
later (when it's time to retransmit SYN,ACK).
When ticks are negative, syncache_timer() was scheduled
for the too far future (up to ~25 days on systems with
HZ=1000), no SYN,ACK retransmits were attempted at all,
and syncache entries added in that period that correspond
to non-established connections stay there forever.
Only HEAD and RELENG_7 are affected.
Reviewed by: silby, kmacy (earlier version)
Submitted by: Maxim Dounin, ru
2007-12-19 16:56:28 +00:00
|
|
|
/*
|
|
|
|
* In the following cycle we may remove some entries and/or
|
|
|
|
* advance some timeouts, so re-initialize the bucket timer.
|
|
|
|
*/
|
|
|
|
sch->sch_nextc = tick + INT_MAX;
|
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
TAILQ_FOREACH_SAFE(sc, &sch->sch_bucket, sc_hash, nsc) {
|
|
|
|
/*
|
|
|
|
* We do not check if the listen socket still exists
|
|
|
|
* and accept the case where the listen socket may be
|
|
|
|
* gone by the time we resend the SYN/ACK. We do
|
|
|
|
* not expect this to happens often. If it does,
|
|
|
|
* then the RST will be sent by the time the remote
|
|
|
|
* host does the SYN/ACK->ACK.
|
|
|
|
*/
|
Fix bugs in the TCP syncache timeout code. including:
When system ticks are positive, for entries in the cache
bucket, syncache_timer() ran on every tick (doing nothing
useful) instead of the supposed 3, 6, 12, and 24 seconds
later (when it's time to retransmit SYN,ACK).
When ticks are negative, syncache_timer() was scheduled
for the too far future (up to ~25 days on systems with
HZ=1000), no SYN,ACK retransmits were attempted at all,
and syncache entries added in that period that correspond
to non-established connections stay there forever.
Only HEAD and RELENG_7 are affected.
Reviewed by: silby, kmacy (earlier version)
Submitted by: Maxim Dounin, ru
2007-12-19 16:56:28 +00:00
|
|
|
if (TSTMP_GT(sc->sc_rxttime, tick)) {
|
|
|
|
if (TSTMP_LT(sc->sc_rxttime, sch->sch_nextc))
|
2006-06-17 17:32:38 +00:00
|
|
|
sch->sch_nextc = sc->sc_rxttime;
|
|
|
|
continue;
|
|
|
|
}
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if (sc->sc_rxmits > V_tcp_syncache.rexmt_limit) {
|
2007-05-18 21:13:01 +00:00
|
|
|
if ((s = tcp_log_addrs(&sc->sc_inc, NULL, NULL, NULL))) {
|
2007-07-28 12:02:05 +00:00
|
|
|
log(LOG_DEBUG, "%s; %s: Retransmits exhausted, "
|
|
|
|
"giving up and removing syncache entry\n",
|
2007-05-18 21:13:01 +00:00
|
|
|
s, __func__);
|
|
|
|
free(s, M_TCPLOG);
|
|
|
|
}
|
2006-06-17 17:32:38 +00:00
|
|
|
syncache_drop(sc, sch);
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sc_stale);
|
2001-11-22 04:50:44 +00:00
|
|
|
continue;
|
|
|
|
}
|
2007-07-28 12:02:05 +00:00
|
|
|
if ((s = tcp_log_addrs(&sc->sc_inc, NULL, NULL, NULL))) {
|
|
|
|
log(LOG_DEBUG, "%s; %s: Response timeout, "
|
|
|
|
"retransmitting (%u) SYN|ACK\n",
|
|
|
|
s, __func__, sc->sc_rxmits);
|
|
|
|
free(s, M_TCPLOG);
|
|
|
|
}
|
2006-06-17 17:32:38 +00:00
|
|
|
|
2016-04-29 07:23:08 +00:00
|
|
|
syncache_respond(sc, sch, 1, NULL);
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sc_retransmitted);
|
2007-07-28 12:02:05 +00:00
|
|
|
syncache_timeout(sc, sch, 0);
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
2006-06-17 17:32:38 +00:00
|
|
|
if (!TAILQ_EMPTY(&(sch)->sch_bucket))
|
|
|
|
callout_reset(&(sch)->sch_timer, (sch)->sch_nextc - tick,
|
|
|
|
syncache_timer, (void *)(sch));
|
2008-11-26 22:32:07 +00:00
|
|
|
CURVNET_RESTORE();
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find an entry in the syncache.
|
2006-06-17 17:32:38 +00:00
|
|
|
* Returns always with locked syncache_head plus a matching entry or NULL.
|
2001-11-22 04:50:44 +00:00
|
|
|
*/
|
2014-05-24 15:03:36 +00:00
|
|
|
static struct syncache *
|
2006-06-17 17:49:11 +00:00
|
|
|
syncache_lookup(struct in_conninfo *inc, struct syncache_head **schp)
|
2001-11-22 04:50:44 +00:00
|
|
|
{
|
|
|
|
struct syncache *sc;
|
|
|
|
struct syncache_head *sch;
|
Use Jenkins hash for TCP syncache.
o Unlike xor, in Jenkins hash every bit of input affects virtually
every bit of output, thus salting the hash actually works. With
xor salting only provides a false sense of security, since if
hash(x) collides with hash(y), then of course, hash(x) ^ salt
would also collide with hash(y) ^ salt. [1]
o Jenkins provides much better distribution than xor, very close to
ideal.
TCP connection setup/teardown benchmark has shown a 10% increase
with default hash size, and with bigger hashes that still provide
possibility for collisions. With enormous hash size, when dataset is
by an order of magnitude smaller than hash size, the benchmark has
shown 4% decrease in performance decrease, which is expected and
acceptable.
Noticed by: Jeffrey Knockel <jeffk cs.unm.edu> [1]
Benchmarks by: jch
Reviewed by: jch, pkelsey, delphij
Security: strengthens protection against hash collision DoS
Sponsored by: Nginx, Inc.
2015-09-05 10:15:19 +00:00
|
|
|
uint32_t hash;
|
2003-11-11 17:54:47 +00:00
|
|
|
|
Use Jenkins hash for TCP syncache.
o Unlike xor, in Jenkins hash every bit of input affects virtually
every bit of output, thus salting the hash actually works. With
xor salting only provides a false sense of security, since if
hash(x) collides with hash(y), then of course, hash(x) ^ salt
would also collide with hash(y) ^ salt. [1]
o Jenkins provides much better distribution than xor, very close to
ideal.
TCP connection setup/teardown benchmark has shown a 10% increase
with default hash size, and with bigger hashes that still provide
possibility for collisions. With enormous hash size, when dataset is
by an order of magnitude smaller than hash size, the benchmark has
shown 4% decrease in performance decrease, which is expected and
acceptable.
Noticed by: Jeffrey Knockel <jeffk cs.unm.edu> [1]
Benchmarks by: jch
Reviewed by: jch, pkelsey, delphij
Security: strengthens protection against hash collision DoS
Sponsored by: Nginx, Inc.
2015-09-05 10:15:19 +00:00
|
|
|
/*
|
|
|
|
* The hash is built on foreign port + local port + foreign address.
|
|
|
|
* We rely on the fact that struct in_conninfo starts with 16 bits
|
|
|
|
* of foreign port, then 16 bits of local port then followed by 128
|
|
|
|
* bits of foreign address. In case of IPv4 address, the first 3
|
|
|
|
* 32-bit words of the address always are zeroes.
|
|
|
|
*/
|
|
|
|
hash = jenkins_hash32((uint32_t *)&inc->inc_ie, 5,
|
|
|
|
V_tcp_syncache.hash_secret) & V_tcp_syncache.hashmask;
|
2006-06-17 17:32:38 +00:00
|
|
|
|
Use Jenkins hash for TCP syncache.
o Unlike xor, in Jenkins hash every bit of input affects virtually
every bit of output, thus salting the hash actually works. With
xor salting only provides a false sense of security, since if
hash(x) collides with hash(y), then of course, hash(x) ^ salt
would also collide with hash(y) ^ salt. [1]
o Jenkins provides much better distribution than xor, very close to
ideal.
TCP connection setup/teardown benchmark has shown a 10% increase
with default hash size, and with bigger hashes that still provide
possibility for collisions. With enormous hash size, when dataset is
by an order of magnitude smaller than hash size, the benchmark has
shown 4% decrease in performance decrease, which is expected and
acceptable.
Noticed by: Jeffrey Knockel <jeffk cs.unm.edu> [1]
Benchmarks by: jch
Reviewed by: jch, pkelsey, delphij
Security: strengthens protection against hash collision DoS
Sponsored by: Nginx, Inc.
2015-09-05 10:15:19 +00:00
|
|
|
sch = &V_tcp_syncache.hashbase[hash];
|
|
|
|
*schp = sch;
|
|
|
|
SCH_LOCK(sch);
|
2006-06-17 17:32:38 +00:00
|
|
|
|
Use Jenkins hash for TCP syncache.
o Unlike xor, in Jenkins hash every bit of input affects virtually
every bit of output, thus salting the hash actually works. With
xor salting only provides a false sense of security, since if
hash(x) collides with hash(y), then of course, hash(x) ^ salt
would also collide with hash(y) ^ salt. [1]
o Jenkins provides much better distribution than xor, very close to
ideal.
TCP connection setup/teardown benchmark has shown a 10% increase
with default hash size, and with bigger hashes that still provide
possibility for collisions. With enormous hash size, when dataset is
by an order of magnitude smaller than hash size, the benchmark has
shown 4% decrease in performance decrease, which is expected and
acceptable.
Noticed by: Jeffrey Knockel <jeffk cs.unm.edu> [1]
Benchmarks by: jch
Reviewed by: jch, pkelsey, delphij
Security: strengthens protection against hash collision DoS
Sponsored by: Nginx, Inc.
2015-09-05 10:15:19 +00:00
|
|
|
/* Circle through bucket row to find matching entry. */
|
|
|
|
TAILQ_FOREACH(sc, &sch->sch_bucket, sc_hash)
|
|
|
|
if (bcmp(&inc->inc_ie, &sc->sc_inc.inc_ie,
|
|
|
|
sizeof(struct in_endpoints)) == 0)
|
|
|
|
break;
|
2006-06-17 17:32:38 +00:00
|
|
|
|
Use Jenkins hash for TCP syncache.
o Unlike xor, in Jenkins hash every bit of input affects virtually
every bit of output, thus salting the hash actually works. With
xor salting only provides a false sense of security, since if
hash(x) collides with hash(y), then of course, hash(x) ^ salt
would also collide with hash(y) ^ salt. [1]
o Jenkins provides much better distribution than xor, very close to
ideal.
TCP connection setup/teardown benchmark has shown a 10% increase
with default hash size, and with bigger hashes that still provide
possibility for collisions. With enormous hash size, when dataset is
by an order of magnitude smaller than hash size, the benchmark has
shown 4% decrease in performance decrease, which is expected and
acceptable.
Noticed by: Jeffrey Knockel <jeffk cs.unm.edu> [1]
Benchmarks by: jch
Reviewed by: jch, pkelsey, delphij
Security: strengthens protection against hash collision DoS
Sponsored by: Nginx, Inc.
2015-09-05 10:15:19 +00:00
|
|
|
return (sc); /* Always returns with locked sch. */
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function is called when we get a RST for a
|
|
|
|
* non-existent connection, so that we can see if the
|
|
|
|
* connection is in the syn cache. If it is, zap it.
|
|
|
|
*/
|
|
|
|
void
|
2006-06-17 17:49:11 +00:00
|
|
|
syncache_chkrst(struct in_conninfo *inc, struct tcphdr *th)
|
2001-11-22 04:50:44 +00:00
|
|
|
{
|
|
|
|
struct syncache *sc;
|
|
|
|
struct syncache_head *sch;
|
2007-07-28 11:51:44 +00:00
|
|
|
char *s = NULL;
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
sc = syncache_lookup(inc, &sch); /* returns locked sch */
|
|
|
|
SCH_LOCK_ASSERT(sch);
|
2007-07-28 11:51:44 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Any RST to our SYN|ACK must not carry ACK, SYN or FIN flags.
|
|
|
|
* See RFC 793 page 65, section SEGMENT ARRIVES.
|
|
|
|
*/
|
|
|
|
if (th->th_flags & (TH_ACK|TH_SYN|TH_FIN)) {
|
|
|
|
if ((s = tcp_log_addrs(inc, th, NULL, NULL)))
|
|
|
|
log(LOG_DEBUG, "%s; %s: Spurious RST with ACK, SYN or "
|
|
|
|
"FIN flag set, segment ignored\n", s, __func__);
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_badrst);
|
2007-07-28 11:51:44 +00:00
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* No corresponding connection was found in syncache.
|
|
|
|
* If syncookies are enabled and possibly exclusively
|
|
|
|
* used, or we are under memory pressure, a valid RST
|
|
|
|
* may not find a syncache entry. In that case we're
|
|
|
|
* done and no SYN|ACK retransmissions will happen.
|
2011-02-21 09:01:34 +00:00
|
|
|
* Otherwise the RST was misdirected or spoofed.
|
2007-07-28 11:51:44 +00:00
|
|
|
*/
|
|
|
|
if (sc == NULL) {
|
|
|
|
if ((s = tcp_log_addrs(inc, th, NULL, NULL)))
|
|
|
|
log(LOG_DEBUG, "%s; %s: Spurious RST without matching "
|
|
|
|
"syncache entry (possibly syncookie only), "
|
|
|
|
"segment ignored\n", s, __func__);
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_badrst);
|
2006-06-17 17:32:38 +00:00
|
|
|
goto done;
|
2007-07-28 11:51:44 +00:00
|
|
|
}
|
2006-06-17 17:32:38 +00:00
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
/*
|
|
|
|
* If the RST bit is set, check the sequence number to see
|
|
|
|
* if this is a valid reset segment.
|
|
|
|
* RFC 793 page 37:
|
|
|
|
* In all states except SYN-SENT, all reset (RST) segments
|
|
|
|
* are validated by checking their SEQ-fields. A reset is
|
|
|
|
* valid if its sequence number is in the window.
|
|
|
|
*
|
|
|
|
* The sequence number in the reset segment is normally an
|
|
|
|
* echo of our outgoing acknowlegement numbers, but some hosts
|
|
|
|
* send a reset with the sequence number at the rightmost edge
|
|
|
|
* of our receive window, and we have to handle this case.
|
|
|
|
*/
|
|
|
|
if (SEQ_GEQ(th->th_seq, sc->sc_irs) &&
|
|
|
|
SEQ_LEQ(th->th_seq, sc->sc_irs + sc->sc_wnd)) {
|
|
|
|
syncache_drop(sc, sch);
|
2007-07-28 11:51:44 +00:00
|
|
|
if ((s = tcp_log_addrs(inc, th, NULL, NULL)))
|
|
|
|
log(LOG_DEBUG, "%s; %s: Our SYN|ACK was rejected, "
|
|
|
|
"connection attempt aborted by remote endpoint\n",
|
|
|
|
s, __func__);
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sc_reset);
|
2008-05-08 22:21:09 +00:00
|
|
|
} else {
|
|
|
|
if ((s = tcp_log_addrs(inc, th, NULL, NULL)))
|
|
|
|
log(LOG_DEBUG, "%s; %s: RST with invalid SEQ %u != "
|
|
|
|
"IRS %u (+WND %u), segment ignored\n",
|
|
|
|
s, __func__, th->th_seq, sc->sc_irs, sc->sc_wnd);
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_badrst);
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
2007-07-28 11:51:44 +00:00
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
done:
|
2007-07-28 11:51:44 +00:00
|
|
|
if (s != NULL)
|
|
|
|
free(s, M_TCPLOG);
|
2006-06-17 17:32:38 +00:00
|
|
|
SCH_UNLOCK(sch);
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2006-06-17 17:49:11 +00:00
|
|
|
syncache_badack(struct in_conninfo *inc)
|
2001-11-22 04:50:44 +00:00
|
|
|
{
|
|
|
|
struct syncache *sc;
|
|
|
|
struct syncache_head *sch;
|
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
sc = syncache_lookup(inc, &sch); /* returns locked sch */
|
|
|
|
SCH_LOCK_ASSERT(sch);
|
2001-11-22 04:50:44 +00:00
|
|
|
if (sc != NULL) {
|
|
|
|
syncache_drop(sc, sch);
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sc_badack);
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
2006-06-17 17:32:38 +00:00
|
|
|
SCH_UNLOCK(sch);
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2017-06-03 21:53:58 +00:00
|
|
|
syncache_unreach(struct in_conninfo *inc, tcp_seq th_seq)
|
2001-11-22 04:50:44 +00:00
|
|
|
{
|
|
|
|
struct syncache *sc;
|
|
|
|
struct syncache_head *sch;
|
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
sc = syncache_lookup(inc, &sch); /* returns locked sch */
|
|
|
|
SCH_LOCK_ASSERT(sch);
|
2001-11-22 04:50:44 +00:00
|
|
|
if (sc == NULL)
|
2006-06-17 17:32:38 +00:00
|
|
|
goto done;
|
2001-11-22 04:50:44 +00:00
|
|
|
|
|
|
|
/* If the sequence number != sc_iss, then it's a bogus ICMP msg */
|
2017-06-03 21:53:58 +00:00
|
|
|
if (ntohl(th_seq) != sc->sc_iss)
|
2006-06-17 17:32:38 +00:00
|
|
|
goto done;
|
2001-11-22 04:50:44 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we've rertransmitted 3 times and this is our second error,
|
|
|
|
* we remove the entry. Otherwise, we allow it to continue on.
|
|
|
|
* This prevents us from incorrectly nuking an entry during a
|
|
|
|
* spurious network outage.
|
|
|
|
*
|
|
|
|
* See tcp_notify().
|
|
|
|
*/
|
2006-06-17 17:32:38 +00:00
|
|
|
if ((sc->sc_flags & SCF_UNREACH) == 0 || sc->sc_rxmits < 3 + 1) {
|
2001-11-22 04:50:44 +00:00
|
|
|
sc->sc_flags |= SCF_UNREACH;
|
2006-06-17 17:32:38 +00:00
|
|
|
goto done;
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
|
|
|
syncache_drop(sc, sch);
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sc_unreach);
|
2006-06-17 17:32:38 +00:00
|
|
|
done:
|
|
|
|
SCH_UNLOCK(sch);
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Build a new TCP socket structure from a syncache entry.
|
2015-08-03 12:13:54 +00:00
|
|
|
*
|
|
|
|
* On success return the newly created socket with its underlying inp locked.
|
2001-11-22 04:50:44 +00:00
|
|
|
*/
|
|
|
|
static struct socket *
|
2006-06-17 17:49:11 +00:00
|
|
|
syncache_socket(struct syncache *sc, struct socket *lso, struct mbuf *m)
|
2001-11-22 04:50:44 +00:00
|
|
|
{
|
2015-12-16 00:56:45 +00:00
|
|
|
struct tcp_function_block *blk;
|
2001-11-22 04:50:44 +00:00
|
|
|
struct inpcb *inp = NULL;
|
|
|
|
struct socket *so;
|
|
|
|
struct tcpcb *tp;
|
2010-08-15 13:07:08 +00:00
|
|
|
int error;
|
2007-05-18 21:13:01 +00:00
|
|
|
char *s;
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2015-08-03 12:13:54 +00:00
|
|
|
INP_INFO_RLOCK_ASSERT(&V_tcbinfo);
|
2003-11-11 17:54:47 +00:00
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
/*
|
|
|
|
* Ok, create the full blown connection, and set things up
|
|
|
|
* as they would have been set up if we had created the
|
|
|
|
* connection when the SYN arrived. If we can't create
|
|
|
|
* the connection, abort it.
|
|
|
|
*/
|
2014-01-28 20:28:32 +00:00
|
|
|
so = sonewconn(lso, 0);
|
2001-11-22 04:50:44 +00:00
|
|
|
if (so == NULL) {
|
|
|
|
/*
|
2007-05-18 21:13:01 +00:00
|
|
|
* Drop the connection; we will either send a RST or
|
|
|
|
* have the peer retransmit its SYN again after its
|
|
|
|
* RTO and try again.
|
2001-11-22 04:50:44 +00:00
|
|
|
*/
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_listendrop);
|
2007-05-18 21:13:01 +00:00
|
|
|
if ((s = tcp_log_addrs(&sc->sc_inc, NULL, NULL, NULL))) {
|
|
|
|
log(LOG_DEBUG, "%s; %s: Socket create failed "
|
|
|
|
"due to limits or memory shortage\n",
|
|
|
|
s, __func__);
|
|
|
|
free(s, M_TCPLOG);
|
|
|
|
}
|
2003-11-11 17:54:47 +00:00
|
|
|
goto abort2;
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
2002-07-31 19:06:49 +00:00
|
|
|
#ifdef MAC
|
2007-10-24 19:04:04 +00:00
|
|
|
mac_socketpeer_set_from_mbuf(m, so);
|
2002-07-31 19:06:49 +00:00
|
|
|
#endif
|
2001-11-22 04:50:44 +00:00
|
|
|
|
|
|
|
inp = sotoinpcb(so);
|
2009-07-28 19:43:27 +00:00
|
|
|
inp->inp_inc.inc_fibnum = so->so_fibnum;
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK(inp);
|
2015-08-03 12:13:54 +00:00
|
|
|
/*
|
|
|
|
* Exclusive pcbinfo lock is not required in syncache socket case even
|
|
|
|
* if two inpcb locks can be acquired simultaneously:
|
|
|
|
* - the inpcb in LISTEN state,
|
|
|
|
* - the newly created inp.
|
|
|
|
*
|
|
|
|
* In this case, an inp cannot be at same time in LISTEN state and
|
|
|
|
* just created by an accept() call.
|
|
|
|
*/
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WLOCK(&V_tcbinfo);
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
/* Insert new socket into PCB hash list. */
|
2008-12-17 12:52:34 +00:00
|
|
|
inp->inp_inc.inc_flags = sc->sc_inc.inc_flags;
|
2001-11-22 04:50:44 +00:00
|
|
|
#ifdef INET6
|
2008-12-17 12:52:34 +00:00
|
|
|
if (sc->sc_inc.inc_flags & INC_ISIPV6) {
|
2001-11-22 04:50:44 +00:00
|
|
|
inp->in6p_laddr = sc->sc_inc.inc6_laddr;
|
|
|
|
} else {
|
|
|
|
inp->inp_vflag &= ~INP_IPV6;
|
|
|
|
inp->inp_vflag |= INP_IPV4;
|
|
|
|
#endif
|
|
|
|
inp->inp_laddr = sc->sc_inc.inc_laddr;
|
|
|
|
#ifdef INET6
|
|
|
|
}
|
|
|
|
#endif
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
|
2014-01-18 23:48:20 +00:00
|
|
|
/*
|
|
|
|
* If there's an mbuf and it has a flowid, then let's initialise the
|
|
|
|
* inp with that particular flowid.
|
|
|
|
*/
|
2014-12-01 11:45:24 +00:00
|
|
|
if (m != NULL && M_HASHTYPE_GET(m) != M_HASHTYPE_NONE) {
|
2014-01-18 23:48:20 +00:00
|
|
|
inp->inp_flowid = m->m_pkthdr.flowid;
|
2014-05-18 22:34:06 +00:00
|
|
|
inp->inp_flowtype = M_HASHTYPE_GET(m);
|
2014-01-18 23:48:20 +00:00
|
|
|
}
|
|
|
|
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
/*
|
|
|
|
* Install in the reservation hash table for now, but don't yet
|
|
|
|
* install a connection group since the full 4-tuple isn't yet
|
|
|
|
* configured.
|
|
|
|
*/
|
2001-11-22 04:50:44 +00:00
|
|
|
inp->inp_lport = sc->sc_inc.inc_lport;
|
Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
|
|
|
if ((error = in_pcbinshash_nopcbgroup(inp)) != 0) {
|
2001-11-22 04:50:44 +00:00
|
|
|
/*
|
|
|
|
* Undo the assignments above if we failed to
|
|
|
|
* put the PCB on the hash lists.
|
|
|
|
*/
|
|
|
|
#ifdef INET6
|
2008-12-17 12:52:34 +00:00
|
|
|
if (sc->sc_inc.inc_flags & INC_ISIPV6)
|
2001-11-22 04:50:44 +00:00
|
|
|
inp->in6p_laddr = in6addr_any;
|
2004-08-16 18:32:07 +00:00
|
|
|
else
|
2001-11-22 04:50:44 +00:00
|
|
|
#endif
|
|
|
|
inp->inp_laddr.s_addr = INADDR_ANY;
|
|
|
|
inp->inp_lport = 0;
|
2010-08-15 09:30:13 +00:00
|
|
|
if ((s = tcp_log_addrs(&sc->sc_inc, NULL, NULL, NULL))) {
|
|
|
|
log(LOG_DEBUG, "%s; %s: in_pcbinshash failed "
|
|
|
|
"with error %i\n",
|
|
|
|
s, __func__, error);
|
|
|
|
free(s, M_TCPLOG);
|
|
|
|
}
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
2001-11-22 04:50:44 +00:00
|
|
|
goto abort;
|
|
|
|
}
|
|
|
|
#ifdef INET6
|
2008-12-17 12:52:34 +00:00
|
|
|
if (sc->sc_inc.inc_flags & INC_ISIPV6) {
|
2001-11-22 04:50:44 +00:00
|
|
|
struct inpcb *oinp = sotoinpcb(lso);
|
|
|
|
struct in6_addr laddr6;
|
2004-01-22 23:10:11 +00:00
|
|
|
struct sockaddr_in6 sin6;
|
2001-11-22 04:50:44 +00:00
|
|
|
/*
|
|
|
|
* Inherit socket options from the listening socket.
|
|
|
|
* Note that in6p_inputopts are not (and should not be)
|
|
|
|
* copied, since it stores previously received options and is
|
|
|
|
* used to detect if each new option is different than the
|
|
|
|
* previous one and hence should be passed to a user.
|
2004-08-16 18:32:07 +00:00
|
|
|
* If we copied in6p_inputopts, a user would not be able to
|
2001-11-22 04:50:44 +00:00
|
|
|
* receive options just after calling the accept system call.
|
|
|
|
*/
|
|
|
|
inp->inp_flags |= oinp->inp_flags & INP_CONTROLOPTS;
|
|
|
|
if (oinp->in6p_outputopts)
|
|
|
|
inp->in6p_outputopts =
|
|
|
|
ip6_copypktopts(oinp->in6p_outputopts, M_NOWAIT);
|
|
|
|
|
2004-01-22 23:10:11 +00:00
|
|
|
sin6.sin6_family = AF_INET6;
|
|
|
|
sin6.sin6_len = sizeof(sin6);
|
|
|
|
sin6.sin6_addr = sc->sc_inc.inc6_faddr;
|
|
|
|
sin6.sin6_port = sc->sc_inc.inc_fport;
|
|
|
|
sin6.sin6_flowinfo = sin6.sin6_scope_id = 0;
|
2001-11-22 04:50:44 +00:00
|
|
|
laddr6 = inp->in6p_laddr;
|
|
|
|
if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr))
|
|
|
|
inp->in6p_laddr = sc->sc_inc.inc6_laddr;
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
if ((error = in6_pcbconnect_mbuf(inp, (struct sockaddr *)&sin6,
|
|
|
|
thread0.td_ucred, m)) != 0) {
|
2001-11-22 04:50:44 +00:00
|
|
|
inp->in6p_laddr = laddr6;
|
2010-08-15 09:30:13 +00:00
|
|
|
if ((s = tcp_log_addrs(&sc->sc_inc, NULL, NULL, NULL))) {
|
|
|
|
log(LOG_DEBUG, "%s; %s: in6_pcbconnect failed "
|
|
|
|
"with error %i\n",
|
|
|
|
s, __func__, error);
|
|
|
|
free(s, M_TCPLOG);
|
|
|
|
}
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
2001-11-22 04:50:44 +00:00
|
|
|
goto abort;
|
|
|
|
}
|
2004-07-17 19:44:13 +00:00
|
|
|
/* Override flowlabel from in6_pcbconnect. */
|
Another step assimilating IPv[46] PCB code - directly use
the inpcb names rather than the following IPv6 compat macros:
in6pcb,in6p_sp, in6p_ip6_nxt,in6p_flowinfo,in6p_vflag,
in6p_flags,in6p_socket,in6p_lport,in6p_fport,in6p_ppcb and
sotoin6pcb().
Apart from removing duplicate code in netipsec, this is a pure
whitespace, not a functional change.
Discussed with: rwatson
Reviewed by: rwatson (version before review requested changes)
MFC after: 4 weeks (set the timer and see then)
2008-12-15 21:50:54 +00:00
|
|
|
inp->inp_flow &= ~IPV6_FLOWLABEL_MASK;
|
|
|
|
inp->inp_flow |= sc->sc_flowlabel;
|
2011-04-30 11:21:29 +00:00
|
|
|
}
|
|
|
|
#endif /* INET6 */
|
|
|
|
#if defined(INET) && defined(INET6)
|
|
|
|
else
|
2001-11-22 04:50:44 +00:00
|
|
|
#endif
|
2011-04-30 11:21:29 +00:00
|
|
|
#ifdef INET
|
2001-11-22 04:50:44 +00:00
|
|
|
{
|
|
|
|
struct in_addr laddr;
|
2004-01-22 23:10:11 +00:00
|
|
|
struct sockaddr_in sin;
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2007-12-17 07:56:27 +00:00
|
|
|
inp->inp_options = (m) ? ip_srcroute(m) : NULL;
|
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
if (inp->inp_options == NULL) {
|
|
|
|
inp->inp_options = sc->sc_ipopts;
|
|
|
|
sc->sc_ipopts = NULL;
|
|
|
|
}
|
|
|
|
|
2004-01-22 23:10:11 +00:00
|
|
|
sin.sin_family = AF_INET;
|
|
|
|
sin.sin_len = sizeof(sin);
|
|
|
|
sin.sin_addr = sc->sc_inc.inc_faddr;
|
|
|
|
sin.sin_port = sc->sc_inc.inc_fport;
|
|
|
|
bzero((caddr_t)sin.sin_zero, sizeof(sin.sin_zero));
|
2001-11-22 04:50:44 +00:00
|
|
|
laddr = inp->inp_laddr;
|
|
|
|
if (inp->inp_laddr.s_addr == INADDR_ANY)
|
|
|
|
inp->inp_laddr = sc->sc_inc.inc_laddr;
|
Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-06-04 16:33:06 +00:00
|
|
|
if ((error = in_pcbconnect_mbuf(inp, (struct sockaddr *)&sin,
|
|
|
|
thread0.td_ucred, m)) != 0) {
|
2001-11-22 04:50:44 +00:00
|
|
|
inp->inp_laddr = laddr;
|
2010-08-15 09:30:13 +00:00
|
|
|
if ((s = tcp_log_addrs(&sc->sc_inc, NULL, NULL, NULL))) {
|
|
|
|
log(LOG_DEBUG, "%s; %s: in_pcbconnect failed "
|
|
|
|
"with error %i\n",
|
|
|
|
s, __func__, error);
|
|
|
|
free(s, M_TCPLOG);
|
|
|
|
}
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
2001-11-22 04:50:44 +00:00
|
|
|
goto abort;
|
|
|
|
}
|
|
|
|
}
|
2011-04-30 11:21:29 +00:00
|
|
|
#endif /* INET */
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC) || defined(IPSEC_SUPPORT)
|
|
|
|
/* Copy old policy into new socket's. */
|
|
|
|
if (ipsec_copy_pcbpolicy(sotoinpcb(lso), inp) != 0)
|
|
|
|
printf("syncache_socket: could not copy policy\n");
|
|
|
|
#endif
|
Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
|
|
|
INP_HASH_WUNLOCK(&V_tcbinfo);
|
2001-11-22 04:50:44 +00:00
|
|
|
tp = intotcpcb(inp);
|
2013-08-25 21:54:41 +00:00
|
|
|
tcp_state_change(tp, TCPS_SYN_RECEIVED);
|
2001-11-22 04:50:44 +00:00
|
|
|
tp->iss = sc->sc_iss;
|
|
|
|
tp->irs = sc->sc_irs;
|
|
|
|
tcp_rcvseqinit(tp);
|
|
|
|
tcp_sendseqinit(tp);
|
2015-12-16 00:56:45 +00:00
|
|
|
blk = sototcpcb(lso)->t_fb;
|
2017-01-27 23:10:46 +00:00
|
|
|
if (V_functions_inherit_listen_socket_stack && blk != tp->t_fb) {
|
2015-12-16 00:56:45 +00:00
|
|
|
/*
|
|
|
|
* Our parents t_fb was not the default,
|
|
|
|
* we need to release our ref on tp->t_fb and
|
|
|
|
* pickup one on the new entry.
|
|
|
|
*/
|
|
|
|
struct tcp_function_block *rblk;
|
|
|
|
|
|
|
|
rblk = find_and_ref_tcp_fb(blk);
|
|
|
|
KASSERT(rblk != NULL,
|
|
|
|
("cannot find blk %p out of syncache?", blk));
|
|
|
|
if (tp->t_fb->tfb_tcp_fb_fini)
|
2016-08-16 15:11:46 +00:00
|
|
|
(*tp->t_fb->tfb_tcp_fb_fini)(tp, 0);
|
2015-12-16 00:56:45 +00:00
|
|
|
refcount_release(&tp->t_fb->tfb_refcnt);
|
|
|
|
tp->t_fb = rblk;
|
|
|
|
if (tp->t_fb->tfb_tcp_fb_init) {
|
|
|
|
(*tp->t_fb->tfb_tcp_fb_init)(tp);
|
|
|
|
}
|
|
|
|
}
|
2001-11-22 04:50:44 +00:00
|
|
|
tp->snd_wl1 = sc->sc_irs;
|
2007-04-04 16:13:45 +00:00
|
|
|
tp->snd_max = tp->iss + 1;
|
|
|
|
tp->snd_nxt = tp->iss + 1;
|
2001-11-22 04:50:44 +00:00
|
|
|
tp->rcv_up = sc->sc_irs + 1;
|
|
|
|
tp->rcv_wnd = sc->sc_wnd;
|
|
|
|
tp->rcv_adv += tp->rcv_wnd;
|
2007-04-04 16:13:45 +00:00
|
|
|
tp->last_ack_sent = tp->rcv_nxt;
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2002-02-20 16:47:11 +00:00
|
|
|
tp->t_flags = sototcpcb(lso)->t_flags & (TF_NOPUSH|TF_NODELAY);
|
2001-11-22 04:50:44 +00:00
|
|
|
if (sc->sc_flags & SCF_NOOPT)
|
|
|
|
tp->t_flags |= TF_NOOPT;
|
2006-06-26 16:14:19 +00:00
|
|
|
else {
|
|
|
|
if (sc->sc_flags & SCF_WINSCALE) {
|
|
|
|
tp->t_flags |= TF_REQ_SCALE|TF_RCVD_SCALE;
|
|
|
|
tp->snd_scale = sc->sc_requested_s_scale;
|
|
|
|
tp->request_r_scale = sc->sc_requested_r_scale;
|
|
|
|
}
|
|
|
|
if (sc->sc_flags & SCF_TIMESTAMP) {
|
|
|
|
tp->t_flags |= TF_REQ_TSTMP|TF_RCVD_TSTMP;
|
|
|
|
tp->ts_recent = sc->sc_tsreflect;
|
2012-02-15 16:09:56 +00:00
|
|
|
tp->ts_recent_age = tcp_ts_getticks();
|
2006-09-13 13:08:27 +00:00
|
|
|
tp->ts_offset = sc->sc_tsoff;
|
2006-06-26 16:14:19 +00:00
|
|
|
}
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
|
2006-06-26 16:14:19 +00:00
|
|
|
if (sc->sc_flags & SCF_SIGNATURE)
|
|
|
|
tp->t_flags |= TF_SIGNATURE;
|
2004-02-13 18:21:45 +00:00
|
|
|
#endif
|
2007-05-06 15:56:31 +00:00
|
|
|
if (sc->sc_flags & SCF_SACK)
|
2006-06-26 16:14:19 +00:00
|
|
|
tp->t_flags |= TF_SACK_PERMIT;
|
2004-06-23 21:04:37 +00:00
|
|
|
}
|
2006-06-17 17:32:38 +00:00
|
|
|
|
2008-07-31 15:10:09 +00:00
|
|
|
if (sc->sc_flags & SCF_ECN)
|
|
|
|
tp->t_flags |= TF_ECN_PERMIT;
|
|
|
|
|
2003-11-20 20:07:39 +00:00
|
|
|
/*
|
|
|
|
* Set up MSS and get cached values from tcp_hostcache.
|
|
|
|
* This might overwrite some of the defaults we just set.
|
|
|
|
*/
|
2001-11-22 04:50:44 +00:00
|
|
|
tcp_mss(tp, sc->sc_peer_mss);
|
|
|
|
|
|
|
|
/*
|
2012-10-28 17:25:08 +00:00
|
|
|
* If the SYN,ACK was retransmitted, indicate that CWND to be
|
|
|
|
* limited to one segment in cc_conn_init().
|
2010-07-30 21:45:53 +00:00
|
|
|
* NB: sc_rxmits counts all SYN,ACK transmits, not just retransmits.
|
2001-11-22 04:50:44 +00:00
|
|
|
*/
|
2010-07-30 21:45:53 +00:00
|
|
|
if (sc->sc_rxmits > 1)
|
2012-10-28 17:25:08 +00:00
|
|
|
tp->snd_cwnd = 1;
|
2012-02-05 16:53:02 +00:00
|
|
|
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
/*
|
|
|
|
* Allow a TOE driver to install its hooks. Note that we hold the
|
|
|
|
* pcbinfo lock too and that prevents tcp_usr_accept from accepting a
|
|
|
|
* new connection before the TOE driver has done its thing.
|
|
|
|
*/
|
|
|
|
if (ADDED_BY_TOE(sc)) {
|
|
|
|
struct toedev *tod = sc->sc_tod;
|
|
|
|
|
|
|
|
tod->tod_offload_socket(tod, sc->sc_todctx, so);
|
|
|
|
}
|
|
|
|
#endif
|
2012-02-05 16:53:02 +00:00
|
|
|
/*
|
|
|
|
* Copy and activate timers.
|
|
|
|
*/
|
|
|
|
tp->t_keepinit = sototcpcb(lso)->t_keepinit;
|
|
|
|
tp->t_keepidle = sototcpcb(lso)->t_keepidle;
|
|
|
|
tp->t_keepintvl = sototcpcb(lso)->t_keepintvl;
|
|
|
|
tp->t_keepcnt = sototcpcb(lso)->t_keepcnt;
|
|
|
|
tcp_timer_activate(tp, TT_KEEP, TP_KEEPINIT(tp));
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_accepts);
|
2001-11-22 04:50:44 +00:00
|
|
|
return (so);
|
|
|
|
|
|
|
|
abort:
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2003-11-11 17:54:47 +00:00
|
|
|
abort2:
|
2001-11-22 04:50:44 +00:00
|
|
|
if (so != NULL)
|
2006-03-16 07:03:14 +00:00
|
|
|
soabort(so);
|
2001-11-22 04:50:44 +00:00
|
|
|
return (NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This function gets called when we receive an ACK for a
|
|
|
|
* socket in the LISTEN state. We look up the connection
|
|
|
|
* in the syncache, and if its there, we pull it out of
|
|
|
|
* the cache and turn it into a full-blown connection in
|
|
|
|
* the SYN-RECEIVED state.
|
2015-08-03 12:13:54 +00:00
|
|
|
*
|
|
|
|
* On syncache_socket() success the newly created socket
|
|
|
|
* has its underlying inp locked.
|
2001-11-22 04:50:44 +00:00
|
|
|
*/
|
|
|
|
int
|
2006-09-13 13:08:27 +00:00
|
|
|
syncache_expand(struct in_conninfo *inc, struct tcpopt *to, struct tcphdr *th,
|
2006-06-17 17:49:11 +00:00
|
|
|
struct socket **lsop, struct mbuf *m)
|
2001-11-22 04:50:44 +00:00
|
|
|
{
|
|
|
|
struct syncache *sc;
|
|
|
|
struct syncache_head *sch;
|
2006-09-13 13:08:27 +00:00
|
|
|
struct syncache scs;
|
2007-05-18 21:13:01 +00:00
|
|
|
char *s;
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
/*
|
|
|
|
* Global TCP locks are held because we manipulate the PCB lists
|
|
|
|
* and create a new socket.
|
|
|
|
*/
|
2015-08-03 12:13:54 +00:00
|
|
|
INP_INFO_RLOCK_ASSERT(&V_tcbinfo);
|
2007-05-18 21:13:01 +00:00
|
|
|
KASSERT((th->th_flags & (TH_RST|TH_ACK|TH_SYN)) == TH_ACK,
|
|
|
|
("%s: can handle only ACK", __func__));
|
2003-11-11 17:54:47 +00:00
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
sc = syncache_lookup(inc, &sch); /* returns locked sch */
|
|
|
|
SCH_LOCK_ASSERT(sch);
|
2013-07-11 15:29:25 +00:00
|
|
|
|
|
|
|
#ifdef INVARIANTS
|
|
|
|
/*
|
|
|
|
* Test code for syncookies comparing the syncache stored
|
|
|
|
* values with the reconstructed values from the cookie.
|
|
|
|
*/
|
|
|
|
if (sc != NULL)
|
|
|
|
syncookie_cmp(inc, sch, sc, th, to, *lsop);
|
|
|
|
#endif
|
|
|
|
|
2001-12-19 06:12:14 +00:00
|
|
|
if (sc == NULL) {
|
|
|
|
/*
|
2004-08-16 18:32:07 +00:00
|
|
|
* There is no syncache entry, so see if this ACK is
|
2001-12-19 06:12:14 +00:00
|
|
|
* a returning syncookie. To do this, first:
|
2017-04-20 19:19:33 +00:00
|
|
|
* A. Check if syncookies are used in case of syncache
|
|
|
|
* overflows
|
|
|
|
* B. See if this socket has had a syncache entry dropped in
|
|
|
|
* the recent past. We don't want to accept a bogus
|
|
|
|
* syncookie if we've never received a SYN or accept it
|
|
|
|
* twice.
|
|
|
|
* C. check that the syncookie is valid. If it is, then
|
2001-12-19 06:12:14 +00:00
|
|
|
* cobble up a fake syncache entry, and return.
|
|
|
|
*/
|
2008-11-26 22:32:07 +00:00
|
|
|
if (!V_tcp_syncookies) {
|
2006-09-13 13:08:27 +00:00
|
|
|
SCH_UNLOCK(sch);
|
2007-05-18 21:13:01 +00:00
|
|
|
if ((s = tcp_log_addrs(inc, th, NULL, NULL)))
|
2007-05-28 23:27:44 +00:00
|
|
|
log(LOG_DEBUG, "%s; %s: Spurious ACK, "
|
|
|
|
"segment rejected (syncookies disabled)\n",
|
2007-05-18 21:13:01 +00:00
|
|
|
s, __func__);
|
2006-06-17 17:32:38 +00:00
|
|
|
goto failed;
|
2006-09-13 13:08:27 +00:00
|
|
|
}
|
2017-04-20 19:19:33 +00:00
|
|
|
if (!V_tcp_syncookiesonly &&
|
|
|
|
sch->sch_last_overflow < time_uptime - SYNCOOKIE_LIFETIME) {
|
|
|
|
SCH_UNLOCK(sch);
|
|
|
|
if ((s = tcp_log_addrs(inc, th, NULL, NULL)))
|
|
|
|
log(LOG_DEBUG, "%s; %s: Spurious ACK, "
|
|
|
|
"segment rejected (no syncache entry)\n",
|
|
|
|
s, __func__);
|
|
|
|
goto failed;
|
|
|
|
}
|
2006-09-13 13:08:27 +00:00
|
|
|
bzero(&scs, sizeof(scs));
|
2013-07-11 15:29:25 +00:00
|
|
|
sc = syncookie_lookup(inc, sch, &scs, th, to, *lsop);
|
2006-09-13 13:08:27 +00:00
|
|
|
SCH_UNLOCK(sch);
|
2007-05-18 21:13:01 +00:00
|
|
|
if (sc == NULL) {
|
|
|
|
if ((s = tcp_log_addrs(inc, th, NULL, NULL)))
|
|
|
|
log(LOG_DEBUG, "%s; %s: Segment failed "
|
2007-05-28 23:27:44 +00:00
|
|
|
"SYNCOOKIE authentication, segment rejected "
|
|
|
|
"(probably spoofed)\n", s, __func__);
|
2006-06-17 17:32:38 +00:00
|
|
|
goto failed;
|
2007-05-18 21:13:01 +00:00
|
|
|
}
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
|
|
|
|
/* If received ACK has MD5 signature, check it. */
|
|
|
|
if ((to->to_flags & TOF_SIGNATURE) != 0 &&
|
|
|
|
(!TCPMD5_ENABLED() ||
|
|
|
|
TCPMD5_INPUT(m, th, to->to_signature) != 0)) {
|
|
|
|
/* Drop the ACK. */
|
|
|
|
if ((s = tcp_log_addrs(inc, th, NULL, NULL))) {
|
|
|
|
log(LOG_DEBUG, "%s; %s: Segment rejected, "
|
|
|
|
"MD5 signature doesn't match.\n",
|
|
|
|
s, __func__);
|
|
|
|
free(s, M_TCPLOG);
|
|
|
|
}
|
|
|
|
TCPSTAT_INC(tcps_sig_err_sigopt);
|
|
|
|
return (-1); /* Do not send RST */
|
|
|
|
}
|
|
|
|
#endif /* TCP_SIGNATURE */
|
2006-06-17 17:32:38 +00:00
|
|
|
} else {
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
|
|
|
|
/*
|
|
|
|
* If listening socket requested TCP digests, check that
|
|
|
|
* received ACK has signature and it is correct.
|
|
|
|
* If not, drop the ACK and leave sc entry in th cache,
|
|
|
|
* because SYN was received with correct signature.
|
|
|
|
*/
|
|
|
|
if (sc->sc_flags & SCF_SIGNATURE) {
|
|
|
|
if ((to->to_flags & TOF_SIGNATURE) == 0) {
|
|
|
|
/* No signature */
|
|
|
|
TCPSTAT_INC(tcps_sig_err_nosigopt);
|
|
|
|
SCH_UNLOCK(sch);
|
|
|
|
if ((s = tcp_log_addrs(inc, th, NULL, NULL))) {
|
|
|
|
log(LOG_DEBUG, "%s; %s: Segment "
|
|
|
|
"rejected, MD5 signature wasn't "
|
|
|
|
"provided.\n", s, __func__);
|
|
|
|
free(s, M_TCPLOG);
|
|
|
|
}
|
|
|
|
return (-1); /* Do not send RST */
|
|
|
|
}
|
|
|
|
if (!TCPMD5_ENABLED() ||
|
|
|
|
TCPMD5_INPUT(m, th, to->to_signature) != 0) {
|
|
|
|
/* Doesn't match or no SA */
|
|
|
|
SCH_UNLOCK(sch);
|
|
|
|
if ((s = tcp_log_addrs(inc, th, NULL, NULL))) {
|
|
|
|
log(LOG_DEBUG, "%s; %s: Segment "
|
|
|
|
"rejected, MD5 signature doesn't "
|
|
|
|
"match.\n", s, __func__);
|
|
|
|
free(s, M_TCPLOG);
|
|
|
|
}
|
|
|
|
return (-1); /* Do not send RST */
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif /* TCP_SIGNATURE */
|
2016-01-27 00:45:46 +00:00
|
|
|
/*
|
|
|
|
* Pull out the entry to unlock the bucket row.
|
|
|
|
*
|
|
|
|
* NOTE: We must decrease TCPS_SYN_RECEIVED count here, not
|
|
|
|
* tcp_state_change(). The tcpcb is not existent at this
|
|
|
|
* moment. A new one will be allocated via syncache_socket->
|
|
|
|
* sonewconn->tcp_usr_attach in TCPS_CLOSED state, then
|
|
|
|
* syncache_socket() will change it to TCPS_SYN_RECEIVED.
|
|
|
|
*/
|
2016-03-15 00:15:10 +00:00
|
|
|
TCPSTATES_DEC(TCPS_SYN_RECEIVED);
|
2006-06-17 17:32:38 +00:00
|
|
|
TAILQ_REMOVE(&sch->sch_bucket, sc, sc_hash);
|
|
|
|
sch->sch_length--;
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
if (ADDED_BY_TOE(sc)) {
|
|
|
|
struct toedev *tod = sc->sc_tod;
|
|
|
|
|
|
|
|
tod->tod_syncache_removed(tod, sc->sc_todctx);
|
|
|
|
}
|
|
|
|
#endif
|
2006-06-17 17:32:38 +00:00
|
|
|
SCH_UNLOCK(sch);
|
2001-12-19 06:12:14 +00:00
|
|
|
}
|
2001-11-22 04:50:44 +00:00
|
|
|
|
|
|
|
/*
|
2007-05-18 21:13:01 +00:00
|
|
|
* Segment validation:
|
|
|
|
* ACK must match our initial sequence number + 1 (the SYN|ACK).
|
2001-11-22 04:50:44 +00:00
|
|
|
*/
|
2012-06-19 07:34:13 +00:00
|
|
|
if (th->th_ack != sc->sc_iss + 1) {
|
2007-05-18 21:13:01 +00:00
|
|
|
if ((s = tcp_log_addrs(inc, th, NULL, NULL)))
|
2007-05-28 23:27:44 +00:00
|
|
|
log(LOG_DEBUG, "%s; %s: ACK %u != ISS+1 %u, segment "
|
|
|
|
"rejected\n", s, __func__, th->th_ack, sc->sc_iss);
|
2006-06-17 17:32:38 +00:00
|
|
|
goto failed;
|
2007-05-18 21:13:01 +00:00
|
|
|
}
|
2008-08-05 21:59:20 +00:00
|
|
|
|
2007-05-18 21:42:25 +00:00
|
|
|
/*
|
2008-08-05 21:59:20 +00:00
|
|
|
* The SEQ must fall in the window starting at the received
|
|
|
|
* initial receive sequence number + 1 (the SYN).
|
2007-05-18 21:42:25 +00:00
|
|
|
*/
|
2012-06-19 07:34:13 +00:00
|
|
|
if (SEQ_LEQ(th->th_seq, sc->sc_irs) ||
|
|
|
|
SEQ_GT(th->th_seq, sc->sc_irs + sc->sc_wnd)) {
|
2007-05-18 21:42:25 +00:00
|
|
|
if ((s = tcp_log_addrs(inc, th, NULL, NULL)))
|
2007-05-28 23:27:44 +00:00
|
|
|
log(LOG_DEBUG, "%s; %s: SEQ %u != IRS+1 %u, segment "
|
2007-06-06 22:10:12 +00:00
|
|
|
"rejected\n", s, __func__, th->th_seq, sc->sc_irs);
|
2007-05-18 21:42:25 +00:00
|
|
|
goto failed;
|
|
|
|
}
|
2007-12-12 06:11:50 +00:00
|
|
|
|
2013-07-10 12:06:01 +00:00
|
|
|
/*
|
|
|
|
* If timestamps were not negotiated during SYN/ACK they
|
|
|
|
* must not appear on any segment during this session.
|
|
|
|
*/
|
2007-05-18 21:42:25 +00:00
|
|
|
if (!(sc->sc_flags & SCF_TIMESTAMP) && (to->to_flags & TOF_TS)) {
|
|
|
|
if ((s = tcp_log_addrs(inc, th, NULL, NULL)))
|
2007-05-28 23:27:44 +00:00
|
|
|
log(LOG_DEBUG, "%s; %s: Timestamp not expected, "
|
|
|
|
"segment rejected\n", s, __func__);
|
2007-05-18 21:42:25 +00:00
|
|
|
goto failed;
|
|
|
|
}
|
2013-07-10 12:06:01 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If timestamps were negotiated during SYN/ACK they should
|
|
|
|
* appear on every segment during this session.
|
|
|
|
* XXXAO: This is only informal as there have been unverified
|
|
|
|
* reports of non-compliants stacks.
|
|
|
|
*/
|
|
|
|
if ((sc->sc_flags & SCF_TIMESTAMP) && !(to->to_flags & TOF_TS)) {
|
2013-07-16 16:37:08 +00:00
|
|
|
if ((s = tcp_log_addrs(inc, th, NULL, NULL))) {
|
2013-07-10 12:06:01 +00:00
|
|
|
log(LOG_DEBUG, "%s; %s: Timestamp missing, "
|
|
|
|
"no action\n", s, __func__);
|
2013-07-16 16:37:08 +00:00
|
|
|
free(s, M_TCPLOG);
|
|
|
|
s = NULL;
|
|
|
|
}
|
2013-07-10 12:06:01 +00:00
|
|
|
}
|
|
|
|
|
2007-05-18 21:42:25 +00:00
|
|
|
/*
|
2016-11-21 20:53:11 +00:00
|
|
|
* If timestamps were negotiated, the reflected timestamp
|
|
|
|
* must be equal to what we actually sent in the SYN|ACK
|
|
|
|
* except in the case of 0. Some boxes are known for sending
|
|
|
|
* broken timestamp replies during the 3whs (and potentially
|
|
|
|
* during the connection also).
|
|
|
|
*
|
|
|
|
* Accept the final ACK of 3whs with reflected timestamp of 0
|
|
|
|
* instead of sending a RST and deleting the syncache entry.
|
2007-05-18 21:42:25 +00:00
|
|
|
*/
|
2016-11-21 20:53:11 +00:00
|
|
|
if ((to->to_flags & TOF_TS) && to->to_tsecr &&
|
|
|
|
to->to_tsecr != sc->sc_ts) {
|
2007-05-18 21:42:25 +00:00
|
|
|
if ((s = tcp_log_addrs(inc, th, NULL, NULL)))
|
2007-05-28 23:27:44 +00:00
|
|
|
log(LOG_DEBUG, "%s; %s: TSECR %u != TS %u, "
|
|
|
|
"segment rejected\n",
|
2007-05-18 21:42:25 +00:00
|
|
|
s, __func__, to->to_tsecr, sc->sc_ts);
|
|
|
|
goto failed;
|
|
|
|
}
|
2006-06-17 17:32:38 +00:00
|
|
|
|
2007-04-20 13:51:34 +00:00
|
|
|
*lsop = syncache_socket(sc, *lsop, m);
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2007-04-20 13:51:34 +00:00
|
|
|
if (*lsop == NULL)
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sc_aborted);
|
2007-04-20 13:51:34 +00:00
|
|
|
else
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sc_completed);
|
2003-11-20 20:07:39 +00:00
|
|
|
|
Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)
Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.
From my notes:
-----
One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.
Constraints:
------------
I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.
One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".
One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.
This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.
Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.
To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.
The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.
The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.
In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.
One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).
You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.
This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.
Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.
Packets fall into one of a number of classes.
1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..
setfib -3 ping target.example.com # will use fib 3 for ping.
It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.
2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)
3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).
4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.
5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.
6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.
Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)
In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.
In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.
Early testing experience:
-------------------------
Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.
For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.
Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.
ipfw has grown 2 new keywords:
setfib N ip from anay to any
count ip from any to any fib N
In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.
SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.
Where to next:
--------------------
After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.
Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.
My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.
When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.
Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.
This work was sponsored by Ironport Systems/Cisco
Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco
2008-05-09 23:03:00 +00:00
|
|
|
/* how do we find the inp for the new socket? */
|
2006-09-13 13:08:27 +00:00
|
|
|
if (sc != &scs)
|
|
|
|
syncache_free(sc);
|
2001-11-22 04:50:44 +00:00
|
|
|
return (1);
|
2006-06-17 17:32:38 +00:00
|
|
|
failed:
|
2006-09-13 13:08:27 +00:00
|
|
|
if (sc != NULL && sc != &scs)
|
2006-06-17 17:32:38 +00:00
|
|
|
syncache_free(sc);
|
2007-05-18 21:13:01 +00:00
|
|
|
if (s != NULL)
|
|
|
|
free(s, M_TCPLOG);
|
2007-04-20 13:51:34 +00:00
|
|
|
*lsop = NULL;
|
2006-06-17 17:32:38 +00:00
|
|
|
return (0);
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
|
|
|
|
2015-12-24 19:09:48 +00:00
|
|
|
#ifdef TCP_RFC7413
|
|
|
|
static void
|
|
|
|
syncache_tfo_expand(struct syncache *sc, struct socket **lsop, struct mbuf *m,
|
|
|
|
uint64_t response_cookie)
|
|
|
|
{
|
|
|
|
struct inpcb *inp;
|
|
|
|
struct tcpcb *tp;
|
|
|
|
unsigned int *pending_counter;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Global TCP locks are held because we manipulate the PCB lists
|
|
|
|
* and create a new socket.
|
|
|
|
*/
|
|
|
|
INP_INFO_RLOCK_ASSERT(&V_tcbinfo);
|
|
|
|
|
|
|
|
pending_counter = intotcpcb(sotoinpcb(*lsop))->t_tfo_pending;
|
|
|
|
*lsop = syncache_socket(sc, *lsop, m);
|
|
|
|
if (*lsop == NULL) {
|
|
|
|
TCPSTAT_INC(tcps_sc_aborted);
|
|
|
|
atomic_subtract_int(pending_counter, 1);
|
|
|
|
} else {
|
|
|
|
inp = sotoinpcb(*lsop);
|
|
|
|
tp = intotcpcb(inp);
|
|
|
|
tp->t_flags |= TF_FASTOPEN;
|
|
|
|
tp->t_tfo_cookie = response_cookie;
|
|
|
|
tp->snd_max = tp->iss;
|
|
|
|
tp->snd_nxt = tp->iss;
|
|
|
|
tp->t_tfo_pending = pending_counter;
|
|
|
|
TCPSTAT_INC(tcps_sc_completed);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif /* TCP_RFC7413 */
|
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
/*
|
|
|
|
* Given a LISTEN socket and an inbound SYN request, add
|
|
|
|
* this to the syn cache, and send back a segment:
|
|
|
|
* <SEQ=ISS><ACK=RCV_NXT><CTL=SYN,ACK>
|
|
|
|
* to the source.
|
|
|
|
*
|
|
|
|
* IMPORTANT NOTE: We do _NOT_ ACK data that might accompany the SYN.
|
|
|
|
* Doing so would require that we hold onto the data and deliver it
|
|
|
|
* to the application. However, if we are the target of a SYN-flood
|
|
|
|
* DoS attack, an attacker could send data which would eventually
|
|
|
|
* consume all available buffer space if it were ACKed. By not ACKing
|
|
|
|
* the data, we avoid this DoS scenario.
|
2015-12-24 19:09:48 +00:00
|
|
|
*
|
|
|
|
* The exception to the above is when a SYN with a valid TCP Fast Open (TFO)
|
2016-10-15 01:41:28 +00:00
|
|
|
* cookie is processed and a new socket is created. In this case, any data
|
|
|
|
* accompanying the SYN will be queued to the socket by tcp_input() and will
|
|
|
|
* be ACKed either when the application sends response data or the delayed
|
|
|
|
* ACK timer expires, whichever comes first.
|
2001-11-22 04:50:44 +00:00
|
|
|
*/
|
2015-12-24 19:09:48 +00:00
|
|
|
int
|
2012-06-19 07:34:13 +00:00
|
|
|
syncache_add(struct in_conninfo *inc, struct tcpopt *to, struct tcphdr *th,
|
|
|
|
struct inpcb *inp, struct socket **lsop, struct mbuf *m, void *tod,
|
|
|
|
void *todctx)
|
2001-11-22 04:50:44 +00:00
|
|
|
{
|
|
|
|
struct tcpcb *tp;
|
|
|
|
struct socket *so;
|
|
|
|
struct syncache *sc = NULL;
|
|
|
|
struct syncache_head *sch;
|
|
|
|
struct mbuf *ipopts = NULL;
|
2011-04-25 17:13:40 +00:00
|
|
|
u_int ltflags;
|
2016-12-21 22:47:10 +00:00
|
|
|
int win, ip_ttl, ip_tos;
|
2007-07-28 12:02:05 +00:00
|
|
|
char *s;
|
2015-12-24 19:09:48 +00:00
|
|
|
int rv = 0;
|
2006-06-17 18:42:07 +00:00
|
|
|
#ifdef INET6
|
|
|
|
int autoflowlabel = 0;
|
2006-12-13 06:00:57 +00:00
|
|
|
#endif
|
|
|
|
#ifdef MAC
|
|
|
|
struct label *maclabel;
|
2006-06-17 18:42:07 +00:00
|
|
|
#endif
|
2006-09-13 13:08:27 +00:00
|
|
|
struct syncache scs;
|
2008-08-23 14:22:12 +00:00
|
|
|
struct ucred *cred;
|
2015-12-24 19:09:48 +00:00
|
|
|
#ifdef TCP_RFC7413
|
|
|
|
uint64_t tfo_response_cookie;
|
2016-10-15 01:41:28 +00:00
|
|
|
unsigned int *tfo_pending = NULL;
|
2015-12-24 19:09:48 +00:00
|
|
|
int tfo_cookie_valid = 0;
|
|
|
|
int tfo_response_cookie_valid = 0;
|
|
|
|
#endif
|
2003-11-11 17:54:47 +00:00
|
|
|
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WLOCK_ASSERT(inp); /* listen socket */
|
2007-07-28 20:13:40 +00:00
|
|
|
KASSERT((th->th_flags & (TH_RST|TH_ACK|TH_SYN)) == TH_SYN,
|
2007-07-28 12:02:05 +00:00
|
|
|
("%s: unexpected tcp flags", __func__));
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
/*
|
|
|
|
* Combine all so/tp operations very early to drop the INP lock as
|
|
|
|
* soon as possible.
|
|
|
|
*/
|
|
|
|
so = *lsop;
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
KASSERT(SOLISTENING(so), ("%s: %p not listening", __func__, so));
|
2001-11-22 04:50:44 +00:00
|
|
|
tp = sototcpcb(so);
|
2008-08-23 14:22:12 +00:00
|
|
|
cred = crhold(so->so_cred);
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
#ifdef INET6
|
2008-12-17 12:52:34 +00:00
|
|
|
if ((inc->inc_flags & INC_ISIPV6) &&
|
Another step assimilating IPv[46] PCB code - directly use
the inpcb names rather than the following IPv6 compat macros:
in6pcb,in6p_sp, in6p_ip6_nxt,in6p_flowinfo,in6p_vflag,
in6p_flags,in6p_socket,in6p_lport,in6p_fport,in6p_ppcb and
sotoin6pcb().
Apart from removing duplicate code in netipsec, this is a pure
whitespace, not a functional change.
Discussed with: rwatson
Reviewed by: rwatson (version before review requested changes)
MFC after: 4 weeks (set the timer and see then)
2008-12-15 21:50:54 +00:00
|
|
|
(inp->inp_flags & IN6P_AUTOFLOWLABEL))
|
2006-06-17 17:32:38 +00:00
|
|
|
autoflowlabel = 1;
|
|
|
|
#endif
|
|
|
|
ip_ttl = inp->inp_ip_ttl;
|
|
|
|
ip_tos = inp->inp_ip_tos;
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
win = so->sol_sbrcv_hiwat;
|
2011-04-25 17:13:40 +00:00
|
|
|
ltflags = (tp->t_flags & (TF_NOOPT | TF_SIGNATURE));
|
2006-06-17 17:32:38 +00:00
|
|
|
|
2015-12-24 19:09:48 +00:00
|
|
|
#ifdef TCP_RFC7413
|
2016-10-12 19:06:50 +00:00
|
|
|
if (V_tcp_fastopen_enabled && IS_FASTOPEN(tp->t_flags) &&
|
2015-12-24 19:09:48 +00:00
|
|
|
(tp->t_tfo_pending != NULL) && (to->to_flags & TOF_FASTOPEN)) {
|
|
|
|
/*
|
|
|
|
* Limit the number of pending TFO connections to
|
|
|
|
* approximately half of the queue limit. This prevents TFO
|
|
|
|
* SYN floods from starving the service by filling the
|
|
|
|
* listen queue with bogus TFO connections.
|
|
|
|
*/
|
|
|
|
if (atomic_fetchadd_int(tp->t_tfo_pending, 1) <=
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
(so->sol_qlimit / 2)) {
|
2015-12-24 19:09:48 +00:00
|
|
|
int result;
|
|
|
|
|
|
|
|
result = tcp_fastopen_check_cookie(inc,
|
|
|
|
to->to_tfo_cookie, to->to_tfo_len,
|
|
|
|
&tfo_response_cookie);
|
|
|
|
tfo_cookie_valid = (result > 0);
|
|
|
|
tfo_response_cookie_valid = (result >= 0);
|
2016-10-15 01:41:28 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remember the TFO pending counter as it will have to be
|
|
|
|
* decremented below if we don't make it to syncache_tfo_expand().
|
|
|
|
*/
|
|
|
|
tfo_pending = tp->t_tfo_pending;
|
2015-12-24 19:09:48 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2008-08-23 12:27:18 +00:00
|
|
|
/* By the time we drop the lock these should no longer be used. */
|
2006-06-17 17:32:38 +00:00
|
|
|
so = NULL;
|
|
|
|
tp = NULL;
|
|
|
|
|
2006-12-13 06:00:57 +00:00
|
|
|
#ifdef MAC
|
2007-10-25 14:37:37 +00:00
|
|
|
if (mac_syncache_init(&maclabel) != 0) {
|
2008-04-17 21:38:18 +00:00
|
|
|
INP_WUNLOCK(inp);
|
2007-04-20 13:30:08 +00:00
|
|
|
goto done;
|
2006-12-13 06:00:57 +00:00
|
|
|
} else
|
2007-10-25 14:37:37 +00:00
|
|
|
mac_syncache_create(maclabel, inp);
|
2006-12-13 06:00:57 +00:00
|
|
|
#endif
|
2015-12-24 19:09:48 +00:00
|
|
|
#ifdef TCP_RFC7413
|
|
|
|
if (!tfo_cookie_valid)
|
|
|
|
#endif
|
|
|
|
INP_WUNLOCK(inp);
|
2006-06-17 17:32:38 +00:00
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
/*
|
|
|
|
* Remember the IP options, if any.
|
|
|
|
*/
|
|
|
|
#ifdef INET6
|
2008-12-17 12:52:34 +00:00
|
|
|
if (!(inc->inc_flags & INC_ISIPV6))
|
2001-11-22 04:50:44 +00:00
|
|
|
#endif
|
2011-04-30 11:21:29 +00:00
|
|
|
#ifdef INET
|
2007-12-17 07:56:27 +00:00
|
|
|
ipopts = (m) ? ip_srcroute(m) : NULL;
|
2011-04-30 11:21:29 +00:00
|
|
|
#else
|
|
|
|
ipopts = NULL;
|
|
|
|
#endif
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
|
|
|
|
/*
|
|
|
|
* If listening socket requested TCP digests, check that received
|
|
|
|
* SYN has signature and it is correct. If signature doesn't match
|
|
|
|
* or TCP_SIGNATURE support isn't enabled, drop the packet.
|
|
|
|
*/
|
|
|
|
if (ltflags & TF_SIGNATURE) {
|
|
|
|
if ((to->to_flags & TOF_SIGNATURE) == 0) {
|
|
|
|
TCPSTAT_INC(tcps_sig_err_nosigopt);
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
if (!TCPMD5_ENABLED() ||
|
|
|
|
TCPMD5_INPUT(m, th, to->to_signature) != 0)
|
|
|
|
goto done;
|
|
|
|
}
|
|
|
|
#endif /* TCP_SIGNATURE */
|
2001-11-22 04:50:44 +00:00
|
|
|
/*
|
|
|
|
* See if we already have an entry for this connection.
|
|
|
|
* If we do, resend the SYN,ACK, and reset the retransmit timer.
|
|
|
|
*
|
2006-06-17 17:49:11 +00:00
|
|
|
* XXX: should the syncache be re-initialized with the contents
|
2001-11-22 04:50:44 +00:00
|
|
|
* of the new SYN here (which may have different options?)
|
2007-07-28 12:02:05 +00:00
|
|
|
*
|
|
|
|
* XXX: We do not check the sequence number to see if this is a
|
|
|
|
* real retransmit or a new connection attempt. The question is
|
|
|
|
* how to handle such a case; either ignore it as spoofed, or
|
|
|
|
* drop the current entry and create a new one?
|
2001-11-22 04:50:44 +00:00
|
|
|
*/
|
2006-06-17 17:32:38 +00:00
|
|
|
sc = syncache_lookup(inc, &sch); /* returns locked entry */
|
|
|
|
SCH_LOCK_ASSERT(sch);
|
2001-11-22 04:50:44 +00:00
|
|
|
if (sc != NULL) {
|
2015-12-24 19:09:48 +00:00
|
|
|
#ifdef TCP_RFC7413
|
|
|
|
if (tfo_cookie_valid)
|
|
|
|
INP_WUNLOCK(inp);
|
|
|
|
#endif
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sc_dupsyn);
|
2001-11-22 04:50:44 +00:00
|
|
|
if (ipopts) {
|
|
|
|
/*
|
|
|
|
* If we were remembering a previous source route,
|
|
|
|
* forget it and use the new one we've been given.
|
|
|
|
*/
|
|
|
|
if (sc->sc_ipopts)
|
|
|
|
(void) m_free(sc->sc_ipopts);
|
|
|
|
sc->sc_ipopts = ipopts;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* Update timestamp if present.
|
|
|
|
*/
|
2007-04-20 13:36:48 +00:00
|
|
|
if ((sc->sc_flags & SCF_TIMESTAMP) && (to->to_flags & TOF_TS))
|
2006-06-26 16:14:19 +00:00
|
|
|
sc->sc_tsreflect = to->to_tsval;
|
2007-04-20 13:36:48 +00:00
|
|
|
else
|
|
|
|
sc->sc_flags &= ~SCF_TIMESTAMP;
|
2006-12-13 06:00:57 +00:00
|
|
|
#ifdef MAC
|
|
|
|
/*
|
|
|
|
* Since we have already unconditionally allocated label
|
|
|
|
* storage, free it up. The syncache entry will already
|
|
|
|
* have an initialized label we can use.
|
|
|
|
*/
|
2007-10-25 14:37:37 +00:00
|
|
|
mac_syncache_destroy(&maclabel);
|
2006-12-13 06:00:57 +00:00
|
|
|
#endif
|
2007-07-28 12:02:05 +00:00
|
|
|
/* Retransmit SYN|ACK and reset retransmit count. */
|
|
|
|
if ((s = tcp_log_addrs(&sc->sc_inc, th, NULL, NULL))) {
|
2007-07-29 20:13:22 +00:00
|
|
|
log(LOG_DEBUG, "%s; %s: Received duplicate SYN, "
|
2007-07-28 12:02:05 +00:00
|
|
|
"resetting timer and retransmitting SYN|ACK\n",
|
|
|
|
s, __func__);
|
|
|
|
free(s, M_TCPLOG);
|
|
|
|
}
|
2016-04-29 07:23:08 +00:00
|
|
|
if (syncache_respond(sc, sch, 1, m) == 0) {
|
2007-07-28 12:02:05 +00:00
|
|
|
sc->sc_rxmits = 0;
|
|
|
|
syncache_timeout(sc, sch, 1);
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sndacks);
|
|
|
|
TCPSTAT_INC(tcps_sndtotal);
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
2006-06-17 17:32:38 +00:00
|
|
|
SCH_UNLOCK(sch);
|
|
|
|
goto done;
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
|
|
|
|
2015-12-24 19:09:48 +00:00
|
|
|
#ifdef TCP_RFC7413
|
|
|
|
if (tfo_cookie_valid) {
|
|
|
|
bzero(&scs, sizeof(scs));
|
|
|
|
sc = &scs;
|
|
|
|
goto skip_alloc;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
sc = uma_zalloc(V_tcp_syncache.zone, M_NOWAIT | M_ZERO);
|
2001-11-22 04:50:44 +00:00
|
|
|
if (sc == NULL) {
|
|
|
|
/*
|
|
|
|
* The zone allocator couldn't provide more entries.
|
2004-08-16 18:32:07 +00:00
|
|
|
* Treat this as if the cache was full; drop the oldest
|
2001-11-22 04:50:44 +00:00
|
|
|
* entry and insert the new one.
|
|
|
|
*/
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sc_zonefail);
|
2017-04-20 19:19:33 +00:00
|
|
|
if ((sc = TAILQ_LAST(&sch->sch_bucket, sch_head)) != NULL) {
|
|
|
|
sch->sch_last_overflow = time_uptime;
|
2007-04-17 15:25:14 +00:00
|
|
|
syncache_drop(sc, sch);
|
2017-04-20 19:19:33 +00:00
|
|
|
}
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
sc = uma_zalloc(V_tcp_syncache.zone, M_NOWAIT | M_ZERO);
|
2001-11-22 04:50:44 +00:00
|
|
|
if (sc == NULL) {
|
2008-11-26 22:32:07 +00:00
|
|
|
if (V_tcp_syncookies) {
|
2006-09-13 13:08:27 +00:00
|
|
|
bzero(&scs, sizeof(scs));
|
|
|
|
sc = &scs;
|
|
|
|
} else {
|
|
|
|
SCH_UNLOCK(sch);
|
|
|
|
if (ipopts)
|
|
|
|
(void) m_free(ipopts);
|
|
|
|
goto done;
|
|
|
|
}
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
2006-09-13 13:08:27 +00:00
|
|
|
}
|
2015-12-24 19:09:48 +00:00
|
|
|
|
|
|
|
#ifdef TCP_RFC7413
|
|
|
|
skip_alloc:
|
|
|
|
if (!tfo_cookie_valid && tfo_response_cookie_valid)
|
|
|
|
sc->sc_tfo_cookie = &tfo_response_cookie;
|
|
|
|
#endif
|
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
/*
|
|
|
|
* Fill in the syncache values.
|
|
|
|
*/
|
2006-12-13 06:00:57 +00:00
|
|
|
#ifdef MAC
|
|
|
|
sc->sc_label = maclabel;
|
|
|
|
#endif
|
2008-08-23 14:22:12 +00:00
|
|
|
sc->sc_cred = cred;
|
|
|
|
cred = NULL;
|
2001-11-22 04:50:44 +00:00
|
|
|
sc->sc_ipopts = ipopts;
|
2006-06-26 16:14:19 +00:00
|
|
|
bcopy(inc, &sc->sc_inc, sizeof(struct in_conninfo));
|
2001-11-22 04:50:44 +00:00
|
|
|
#ifdef INET6
|
2008-12-17 12:52:34 +00:00
|
|
|
if (!(inc->inc_flags & INC_ISIPV6))
|
2001-11-22 04:50:44 +00:00
|
|
|
#endif
|
|
|
|
{
|
2006-06-17 17:32:38 +00:00
|
|
|
sc->sc_ip_tos = ip_tos;
|
|
|
|
sc->sc_ip_ttl = ip_ttl;
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
sc->sc_tod = tod;
|
|
|
|
sc->sc_todctx = todctx;
|
2007-12-12 20:35:59 +00:00
|
|
|
#endif
|
2001-11-22 04:50:44 +00:00
|
|
|
sc->sc_irs = th->th_seq;
|
2006-09-13 13:08:27 +00:00
|
|
|
sc->sc_iss = arc4random();
|
2003-01-29 03:49:49 +00:00
|
|
|
sc->sc_flags = 0;
|
2004-07-17 19:44:13 +00:00
|
|
|
sc->sc_flowlabel = 0;
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
/*
|
|
|
|
* Initial receive window: clip sbspace to [0 .. TCP_MAXWIN].
|
|
|
|
* win was derived from socket earlier in the function.
|
|
|
|
*/
|
2001-11-22 04:50:44 +00:00
|
|
|
win = imax(win, 0);
|
|
|
|
win = imin(win, TCP_MAXWIN);
|
|
|
|
sc->sc_wnd = win;
|
|
|
|
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if (V_tcp_do_rfc1323) {
|
2001-11-22 04:50:44 +00:00
|
|
|
/*
|
|
|
|
* A timestamp received in a SYN makes
|
|
|
|
* it ok to send timestamp requests and replies.
|
|
|
|
*/
|
|
|
|
if (to->to_flags & TOF_TS) {
|
2006-06-26 16:14:19 +00:00
|
|
|
sc->sc_tsreflect = to->to_tsval;
|
2012-02-15 16:09:56 +00:00
|
|
|
sc->sc_ts = tcp_ts_getticks();
|
2001-11-22 04:50:44 +00:00
|
|
|
sc->sc_flags |= SCF_TIMESTAMP;
|
|
|
|
}
|
|
|
|
if (to->to_flags & TOF_SCALE) {
|
|
|
|
int wscale = 0;
|
|
|
|
|
2007-02-01 17:39:18 +00:00
|
|
|
/*
|
2007-10-19 08:53:14 +00:00
|
|
|
* Pick the smallest possible scaling factor that
|
|
|
|
* will still allow us to scale up to sb_max, aka
|
|
|
|
* kern.ipc.maxsockbuf.
|
|
|
|
*
|
|
|
|
* We do this because there are broken firewalls that
|
|
|
|
* will corrupt the window scale option, leading to
|
|
|
|
* the other endpoint believing that our advertised
|
|
|
|
* window is unscaled. At scale factors larger than
|
|
|
|
* 5 the unscaled window will drop below 1500 bytes,
|
|
|
|
* leading to serious problems when traversing these
|
|
|
|
* broken firewalls.
|
|
|
|
*
|
|
|
|
* With the default maxsockbuf of 256K, a scale factor
|
|
|
|
* of 3 will be chosen by this algorithm. Those who
|
|
|
|
* choose a larger maxsockbuf should watch out
|
2016-05-03 18:05:43 +00:00
|
|
|
* for the compatibility problems mentioned above.
|
2007-03-15 15:59:28 +00:00
|
|
|
*
|
|
|
|
* RFC1323: The Window field in a SYN (i.e., a <SYN>
|
|
|
|
* or <SYN,ACK>) segment itself is never scaled.
|
2007-02-01 17:39:18 +00:00
|
|
|
*/
|
2001-11-22 04:50:44 +00:00
|
|
|
while (wscale < TCP_MAX_WINSHIFT &&
|
2007-10-19 08:53:14 +00:00
|
|
|
(TCP_MAXWIN << wscale) < sb_max)
|
2001-11-22 04:50:44 +00:00
|
|
|
wscale++;
|
2006-06-26 16:14:19 +00:00
|
|
|
sc->sc_requested_r_scale = wscale;
|
2007-03-15 15:59:28 +00:00
|
|
|
sc->sc_requested_s_scale = to->to_wscale;
|
2001-11-22 04:50:44 +00:00
|
|
|
sc->sc_flags |= SCF_WINSCALE;
|
|
|
|
}
|
|
|
|
}
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
|
Initial import of RFC 2385 (TCP-MD5) digest support.
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.
For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.
Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.
There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.
Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.
This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.
Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.
Sponsored by: sentex.net
2004-02-11 04:26:04 +00:00
|
|
|
/*
|
2017-02-06 08:49:57 +00:00
|
|
|
* If listening socket requested TCP digests, flag this in the
|
|
|
|
* syncache so that syncache_respond() will do the right thing
|
|
|
|
* with the SYN+ACK.
|
Initial import of RFC 2385 (TCP-MD5) digest support.
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.
For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.
Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.
There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.
Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.
This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.
Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.
Sponsored by: sentex.net
2004-02-11 04:26:04 +00:00
|
|
|
*/
|
2017-02-06 08:49:57 +00:00
|
|
|
if (ltflags & TF_SIGNATURE)
|
2005-09-14 15:06:22 +00:00
|
|
|
sc->sc_flags |= SCF_SIGNATURE;
|
2017-02-06 08:49:57 +00:00
|
|
|
#endif /* TCP_SIGNATURE */
|
2007-12-04 07:11:13 +00:00
|
|
|
if (to->to_flags & TOF_SACKPERM)
|
2004-06-23 21:04:37 +00:00
|
|
|
sc->sc_flags |= SCF_SACK;
|
2006-09-13 13:08:27 +00:00
|
|
|
if (to->to_flags & TOF_MSS)
|
|
|
|
sc->sc_peer_mss = to->to_mss; /* peer mss may be zero */
|
2011-04-25 17:13:40 +00:00
|
|
|
if (ltflags & TF_NOOPT)
|
2006-06-18 13:03:42 +00:00
|
|
|
sc->sc_flags |= SCF_NOOPT;
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if ((th->th_flags & (TH_ECE|TH_CWR)) && V_tcp_do_ecn)
|
2008-07-31 15:10:09 +00:00
|
|
|
sc->sc_flags |= SCF_ECN;
|
2004-06-23 21:04:37 +00:00
|
|
|
|
2013-07-11 15:29:25 +00:00
|
|
|
if (V_tcp_syncookies)
|
|
|
|
sc->sc_iss = syncookie_generate(sch, sc);
|
2006-09-13 13:08:27 +00:00
|
|
|
#ifdef INET6
|
2013-07-11 15:29:25 +00:00
|
|
|
if (autoflowlabel) {
|
|
|
|
if (V_tcp_syncookies)
|
|
|
|
sc->sc_flowlabel = sc->sc_iss;
|
|
|
|
else
|
|
|
|
sc->sc_flowlabel = ip6_randomflowlabel();
|
|
|
|
sc->sc_flowlabel = htonl(sc->sc_flowlabel) & IPV6_FLOWLABEL_MASK;
|
2006-09-13 13:08:27 +00:00
|
|
|
}
|
2013-07-11 15:29:25 +00:00
|
|
|
#endif
|
2006-09-13 13:08:27 +00:00
|
|
|
SCH_UNLOCK(sch);
|
|
|
|
|
2015-12-24 19:09:48 +00:00
|
|
|
#ifdef TCP_RFC7413
|
|
|
|
if (tfo_cookie_valid) {
|
|
|
|
syncache_tfo_expand(sc, lsop, m, tfo_response_cookie);
|
2016-10-15 01:41:28 +00:00
|
|
|
/* INP_WUNLOCK(inp) will be performed by the caller */
|
2015-12-24 19:09:48 +00:00
|
|
|
rv = 1;
|
2016-10-15 01:41:28 +00:00
|
|
|
goto tfo_expanded;
|
2015-12-24 19:09:48 +00:00
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
/*
|
2004-11-02 22:22:22 +00:00
|
|
|
* Do a standard 3-way handshake.
|
2001-11-22 04:50:44 +00:00
|
|
|
*/
|
2016-04-29 07:23:08 +00:00
|
|
|
if (syncache_respond(sc, sch, 0, m) == 0) {
|
2008-11-26 22:32:07 +00:00
|
|
|
if (V_tcp_syncookies && V_tcp_syncookiesonly && sc != &scs)
|
2006-09-13 13:08:27 +00:00
|
|
|
syncache_free(sc);
|
|
|
|
else if (sc != &scs)
|
|
|
|
syncache_insert(sc, sch); /* locks and unlocks sch */
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sndacks);
|
|
|
|
TCPSTAT_INC(tcps_sndtotal);
|
2001-11-22 04:50:44 +00:00
|
|
|
} else {
|
2006-12-13 06:00:57 +00:00
|
|
|
if (sc != &scs)
|
|
|
|
syncache_free(sc);
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sc_dropped);
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
2006-06-17 17:32:38 +00:00
|
|
|
|
|
|
|
done:
|
2015-12-24 19:09:48 +00:00
|
|
|
if (m) {
|
|
|
|
*lsop = NULL;
|
|
|
|
m_freem(m);
|
|
|
|
}
|
|
|
|
#ifdef TCP_RFC7413
|
2016-10-15 01:41:28 +00:00
|
|
|
/*
|
|
|
|
* If tfo_pending is not NULL here, then a TFO SYN that did not
|
|
|
|
* result in a new socket was processed and the associated pending
|
|
|
|
* counter has not yet been decremented. All such TFO processing paths
|
|
|
|
* transit this point.
|
|
|
|
*/
|
|
|
|
if (tfo_pending != NULL)
|
|
|
|
tcp_fastopen_decrement_counter(tfo_pending);
|
|
|
|
|
|
|
|
tfo_expanded:
|
2015-12-24 19:09:48 +00:00
|
|
|
#endif
|
2008-08-23 14:22:12 +00:00
|
|
|
if (cred != NULL)
|
|
|
|
crfree(cred);
|
2007-04-20 13:30:08 +00:00
|
|
|
#ifdef MAC
|
|
|
|
if (sc == &scs)
|
2007-10-25 14:37:37 +00:00
|
|
|
mac_syncache_destroy(&maclabel);
|
2007-04-20 13:30:08 +00:00
|
|
|
#endif
|
2015-12-24 19:09:48 +00:00
|
|
|
return (rv);
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
|
|
|
|
2016-05-10 04:59:04 +00:00
|
|
|
/*
|
|
|
|
* Send SYN|ACK to the peer. Either in response to the peer's SYN,
|
|
|
|
* i.e. m0 != NULL, or upon 3WHS ACK timeout, i.e. m0 == NULL.
|
|
|
|
*/
|
2001-11-22 04:50:44 +00:00
|
|
|
static int
|
2016-04-29 07:23:08 +00:00
|
|
|
syncache_respond(struct syncache *sc, struct syncache_head *sch, int locked,
|
|
|
|
const struct mbuf *m0)
|
2001-11-22 04:50:44 +00:00
|
|
|
{
|
|
|
|
struct ip *ip = NULL;
|
2007-04-20 13:30:08 +00:00
|
|
|
struct mbuf *m;
|
2011-04-30 11:21:29 +00:00
|
|
|
struct tcphdr *th = NULL;
|
|
|
|
int optlen, error = 0; /* Make compiler happy */
|
2007-03-15 15:59:28 +00:00
|
|
|
u_int16_t hlen, tlen, mssopt;
|
|
|
|
struct tcpopt to;
|
2001-11-22 04:50:44 +00:00
|
|
|
#ifdef INET6
|
|
|
|
struct ip6_hdr *ip6 = NULL;
|
|
|
|
#endif
|
2003-11-20 20:07:39 +00:00
|
|
|
hlen =
|
2001-11-22 04:50:44 +00:00
|
|
|
#ifdef INET6
|
2008-12-17 12:52:34 +00:00
|
|
|
(sc->sc_inc.inc_flags & INC_ISIPV6) ? sizeof(struct ip6_hdr) :
|
2001-11-22 04:50:44 +00:00
|
|
|
#endif
|
2003-11-20 20:07:39 +00:00
|
|
|
sizeof(struct ip);
|
2007-03-15 15:59:28 +00:00
|
|
|
tlen = hlen + sizeof(struct tcphdr);
|
2003-11-20 20:07:39 +00:00
|
|
|
|
2006-06-17 17:49:11 +00:00
|
|
|
/* Determine MSS we advertize to other end of connection. */
|
2003-11-20 20:07:39 +00:00
|
|
|
mssopt = tcp_mssopt(&sc->sc_inc);
|
2006-06-26 17:54:53 +00:00
|
|
|
if (sc->sc_peer_mss)
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
mssopt = max( min(sc->sc_peer_mss, mssopt), V_tcp_minmss);
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2007-03-15 15:59:28 +00:00
|
|
|
/* XXX: Assume that the entire packet will fit in a header mbuf. */
|
2007-04-20 15:08:09 +00:00
|
|
|
KASSERT(max_linkhdr + tlen + TCP_MAXOLEN <= MHLEN,
|
2007-03-15 15:59:28 +00:00
|
|
|
("syncache: mbuf too small"));
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
/* Create the IP+TCP header from scratch. */
|
2012-12-05 08:04:20 +00:00
|
|
|
m = m_gethdr(M_NOWAIT, MT_DATA);
|
2001-11-22 04:50:44 +00:00
|
|
|
if (m == NULL)
|
|
|
|
return (ENOBUFS);
|
2006-12-13 06:00:57 +00:00
|
|
|
#ifdef MAC
|
2007-10-25 14:37:37 +00:00
|
|
|
mac_syncache_create_mbuf(sc->sc_label, m);
|
2006-12-13 06:00:57 +00:00
|
|
|
#endif
|
2001-11-22 04:50:44 +00:00
|
|
|
m->m_data += max_linkhdr;
|
|
|
|
m->m_len = tlen;
|
|
|
|
m->m_pkthdr.len = tlen;
|
|
|
|
m->m_pkthdr.rcvif = NULL;
|
2006-06-17 17:32:38 +00:00
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
#ifdef INET6
|
2008-12-17 12:52:34 +00:00
|
|
|
if (sc->sc_inc.inc_flags & INC_ISIPV6) {
|
2001-11-22 04:50:44 +00:00
|
|
|
ip6 = mtod(m, struct ip6_hdr *);
|
|
|
|
ip6->ip6_vfc = IPV6_VERSION;
|
|
|
|
ip6->ip6_nxt = IPPROTO_TCP;
|
|
|
|
ip6->ip6_src = sc->sc_inc.inc6_laddr;
|
|
|
|
ip6->ip6_dst = sc->sc_inc.inc6_faddr;
|
|
|
|
ip6->ip6_plen = htons(tlen - hlen);
|
|
|
|
/* ip6_hlim is set after checksum */
|
2004-07-17 19:44:13 +00:00
|
|
|
ip6->ip6_flow &= ~IPV6_FLOWLABEL_MASK;
|
|
|
|
ip6->ip6_flow |= sc->sc_flowlabel;
|
2001-11-22 04:50:44 +00:00
|
|
|
|
|
|
|
th = (struct tcphdr *)(ip6 + 1);
|
2011-04-30 11:21:29 +00:00
|
|
|
}
|
2001-11-22 04:50:44 +00:00
|
|
|
#endif
|
2011-04-30 11:21:29 +00:00
|
|
|
#if defined(INET6) && defined(INET)
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
#ifdef INET
|
2001-11-22 04:50:44 +00:00
|
|
|
{
|
|
|
|
ip = mtod(m, struct ip *);
|
|
|
|
ip->ip_v = IPVERSION;
|
|
|
|
ip->ip_hl = sizeof(struct ip) >> 2;
|
2012-10-22 21:09:03 +00:00
|
|
|
ip->ip_len = htons(tlen);
|
2001-11-22 04:50:44 +00:00
|
|
|
ip->ip_id = 0;
|
|
|
|
ip->ip_off = 0;
|
|
|
|
ip->ip_sum = 0;
|
|
|
|
ip->ip_p = IPPROTO_TCP;
|
|
|
|
ip->ip_src = sc->sc_inc.inc_laddr;
|
|
|
|
ip->ip_dst = sc->sc_inc.inc_faddr;
|
2006-06-17 17:32:38 +00:00
|
|
|
ip->ip_ttl = sc->sc_ip_ttl;
|
|
|
|
ip->ip_tos = sc->sc_ip_tos;
|
2002-06-14 03:08:05 +00:00
|
|
|
|
|
|
|
/*
|
2002-12-20 11:24:02 +00:00
|
|
|
* See if we should do MTU discovery. Route lookups are
|
|
|
|
* expensive, so we will only unset the DF bit if:
|
2002-08-05 22:34:15 +00:00
|
|
|
*
|
|
|
|
* 1) path_mtu_discovery is disabled
|
|
|
|
* 2) the SCF_UNREACH flag has been set
|
2002-06-14 03:08:05 +00:00
|
|
|
*/
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
if (V_path_mtu_discovery && ((sc->sc_flags & SCF_UNREACH) == 0))
|
2012-10-22 21:09:03 +00:00
|
|
|
ip->ip_off |= htons(IP_DF);
|
2001-11-22 04:50:44 +00:00
|
|
|
|
|
|
|
th = (struct tcphdr *)(ip + 1);
|
|
|
|
}
|
2011-04-30 11:21:29 +00:00
|
|
|
#endif /* INET */
|
2001-11-22 04:50:44 +00:00
|
|
|
th->th_sport = sc->sc_inc.inc_lport;
|
|
|
|
th->th_dport = sc->sc_inc.inc_fport;
|
|
|
|
|
|
|
|
th->th_seq = htonl(sc->sc_iss);
|
|
|
|
th->th_ack = htonl(sc->sc_irs + 1);
|
2007-03-15 15:59:28 +00:00
|
|
|
th->th_off = sizeof(struct tcphdr) >> 2;
|
2001-11-22 04:50:44 +00:00
|
|
|
th->th_x2 = 0;
|
|
|
|
th->th_flags = TH_SYN|TH_ACK;
|
|
|
|
th->th_win = htons(sc->sc_wnd);
|
|
|
|
th->th_urp = 0;
|
|
|
|
|
2008-07-31 15:10:09 +00:00
|
|
|
if (sc->sc_flags & SCF_ECN) {
|
|
|
|
th->th_flags |= TH_ECE;
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_ecn_shs);
|
2008-07-31 15:10:09 +00:00
|
|
|
}
|
|
|
|
|
2001-11-22 04:50:44 +00:00
|
|
|
/* Tack on the TCP options. */
|
2007-03-15 15:59:28 +00:00
|
|
|
if ((sc->sc_flags & SCF_NOOPT) == 0) {
|
|
|
|
to.to_flags = 0;
|
2002-12-20 11:24:02 +00:00
|
|
|
|
2007-03-15 15:59:28 +00:00
|
|
|
to.to_mss = mssopt;
|
|
|
|
to.to_flags = TOF_MSS;
|
2002-12-20 11:24:02 +00:00
|
|
|
if (sc->sc_flags & SCF_WINSCALE) {
|
2007-03-15 15:59:28 +00:00
|
|
|
to.to_wscale = sc->sc_requested_r_scale;
|
|
|
|
to.to_flags |= TOF_SCALE;
|
2002-12-20 11:24:02 +00:00
|
|
|
}
|
|
|
|
if (sc->sc_flags & SCF_TIMESTAMP) {
|
2007-03-15 15:59:28 +00:00
|
|
|
/* Virgin timestamp or TCP cookie enhanced one. */
|
2007-05-18 21:42:25 +00:00
|
|
|
to.to_tsval = sc->sc_ts;
|
2007-03-15 15:59:28 +00:00
|
|
|
to.to_tsecr = sc->sc_tsreflect;
|
|
|
|
to.to_flags |= TOF_TS;
|
2002-12-20 11:24:02 +00:00
|
|
|
}
|
2007-03-15 15:59:28 +00:00
|
|
|
if (sc->sc_flags & SCF_SACK)
|
|
|
|
to.to_flags |= TOF_SACKPERM;
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
|
|
|
|
if (sc->sc_flags & SCF_SIGNATURE)
|
|
|
|
to.to_flags |= TOF_SIGNATURE;
|
2007-03-15 15:59:28 +00:00
|
|
|
#endif
|
2015-12-24 19:09:48 +00:00
|
|
|
#ifdef TCP_RFC7413
|
|
|
|
if (sc->sc_tfo_cookie) {
|
|
|
|
to.to_flags |= TOF_FASTOPEN;
|
|
|
|
to.to_tfo_len = TCP_FASTOPEN_COOKIE_LEN;
|
|
|
|
to.to_tfo_cookie = sc->sc_tfo_cookie;
|
|
|
|
/* don't send cookie again when retransmitting response */
|
|
|
|
sc->sc_tfo_cookie = NULL;
|
|
|
|
}
|
|
|
|
#endif
|
2007-03-15 15:59:28 +00:00
|
|
|
optlen = tcp_addoptions(&to, (u_char *)(th + 1));
|
2004-06-23 21:04:37 +00:00
|
|
|
|
2007-03-15 15:59:28 +00:00
|
|
|
/* Adjust headers by option size. */
|
|
|
|
th->th_off = (sizeof(struct tcphdr) + optlen) >> 2;
|
|
|
|
m->m_len += optlen;
|
|
|
|
m->m_pkthdr.len += optlen;
|
|
|
|
#ifdef INET6
|
2008-12-17 12:52:34 +00:00
|
|
|
if (sc->sc_inc.inc_flags & INC_ISIPV6)
|
2007-03-15 15:59:28 +00:00
|
|
|
ip6->ip6_plen = htons(ntohs(ip6->ip6_plen) + optlen);
|
2007-03-17 06:40:09 +00:00
|
|
|
else
|
2007-03-15 15:59:28 +00:00
|
|
|
#endif
|
2012-10-22 21:09:03 +00:00
|
|
|
ip->ip_len = htons(ntohs(ip->ip_len) + optlen);
|
2017-02-06 08:49:57 +00:00
|
|
|
#if defined(IPSEC_SUPPORT) || defined(TCP_SIGNATURE)
|
|
|
|
if (sc->sc_flags & SCF_SIGNATURE) {
|
|
|
|
KASSERT(to.to_flags & TOF_SIGNATURE,
|
|
|
|
("tcp_addoptions() didn't set tcp_signature"));
|
|
|
|
|
|
|
|
/* NOTE: to.to_signature is inside of mbuf */
|
|
|
|
if (!TCPMD5_ENABLED() ||
|
|
|
|
TCPMD5_OUTPUT(m, th, to.to_signature) != 0) {
|
|
|
|
m_freem(m);
|
|
|
|
return (EACCES);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
2007-03-15 15:59:28 +00:00
|
|
|
} else
|
|
|
|
optlen = 0;
|
2001-11-22 04:50:44 +00:00
|
|
|
|
2009-07-28 19:43:27 +00:00
|
|
|
M_SETFIB(m, sc->sc_inc.inc_fibnum);
|
2012-05-25 02:23:26 +00:00
|
|
|
m->m_pkthdr.csum_data = offsetof(struct tcphdr, th_sum);
|
2016-05-10 04:59:04 +00:00
|
|
|
/*
|
|
|
|
* If we have peer's SYN and it has a flowid, then let's assign it to
|
|
|
|
* our SYN|ACK. ip6_output() and ip_output() will not assign flowid
|
|
|
|
* to SYN|ACK due to lack of inp here.
|
|
|
|
*/
|
2016-04-29 07:23:08 +00:00
|
|
|
if (m0 != NULL && M_HASHTYPE_GET(m0) != M_HASHTYPE_NONE) {
|
|
|
|
m->m_pkthdr.flowid = m0->m_pkthdr.flowid;
|
|
|
|
M_HASHTYPE_SET(m, M_HASHTYPE_GET(m0));
|
|
|
|
}
|
2001-11-22 04:50:44 +00:00
|
|
|
#ifdef INET6
|
2008-12-17 12:52:34 +00:00
|
|
|
if (sc->sc_inc.inc_flags & INC_ISIPV6) {
|
It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.
To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.
Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.
This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.
Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.
Individual driver updates will have to follow, as will SCTP.
Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958
2012-05-28 09:30:13 +00:00
|
|
|
m->m_pkthdr.csum_flags = CSUM_TCP_IPV6;
|
2012-05-25 02:23:26 +00:00
|
|
|
th->th_sum = in6_cksum_pseudo(ip6, tlen + optlen - hlen,
|
|
|
|
IPPROTO_TCP, 0);
|
2003-11-20 20:07:39 +00:00
|
|
|
ip6->ip6_hlim = in6_selecthlim(NULL, NULL);
|
2013-01-25 22:16:35 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
if (ADDED_BY_TOE(sc)) {
|
|
|
|
struct toedev *tod = sc->sc_tod;
|
|
|
|
|
|
|
|
error = tod->tod_syncache_respond(tod, sc->sc_todctx, m);
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
#endif
|
2006-06-17 17:32:38 +00:00
|
|
|
error = ip6_output(m, NULL, NULL, 0, NULL, NULL, NULL);
|
2011-04-30 11:21:29 +00:00
|
|
|
}
|
2001-11-22 04:50:44 +00:00
|
|
|
#endif
|
2011-04-30 11:21:29 +00:00
|
|
|
#if defined(INET6) && defined(INET)
|
|
|
|
else
|
|
|
|
#endif
|
|
|
|
#ifdef INET
|
2001-11-22 04:50:44 +00:00
|
|
|
{
|
It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading. Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.
To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.
Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible). Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation. Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.
This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.
Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.
Individual driver updates will have to follow, as will SCTP.
Reported by: gallatin, dim, ..
Reviewed by: gallatin (glanced at?)
MFC after: 3 days
X-MFC with: r235961,235959,235958
2012-05-28 09:30:13 +00:00
|
|
|
m->m_pkthdr.csum_flags = CSUM_TCP;
|
2004-08-16 18:32:07 +00:00
|
|
|
th->th_sum = in_pseudo(ip->ip_src.s_addr, ip->ip_dst.s_addr,
|
2007-03-15 15:59:28 +00:00
|
|
|
htons(tlen + optlen - hlen + IPPROTO_TCP));
|
2012-06-19 07:34:13 +00:00
|
|
|
#ifdef TCP_OFFLOAD
|
|
|
|
if (ADDED_BY_TOE(sc)) {
|
|
|
|
struct toedev *tod = sc->sc_tod;
|
|
|
|
|
|
|
|
error = tod->tod_syncache_respond(tod, sc->sc_todctx, m);
|
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
#endif
|
2006-06-17 17:32:38 +00:00
|
|
|
error = ip_output(m, sc->sc_ipopts, NULL, 0, NULL, NULL);
|
2001-11-22 04:50:44 +00:00
|
|
|
}
|
2011-04-30 11:21:29 +00:00
|
|
|
#endif
|
2001-11-22 04:50:44 +00:00
|
|
|
return (error);
|
|
|
|
}
|
2001-12-19 06:12:14 +00:00
|
|
|
|
|
|
|
/*
|
2013-07-11 15:29:25 +00:00
|
|
|
* The purpose of syncookies is to handle spoofed SYN flooding DoS attacks
|
|
|
|
* that exceed the capacity of the syncache by avoiding the storage of any
|
|
|
|
* of the SYNs we receive. Syncookies defend against blind SYN flooding
|
|
|
|
* attacks where the attacker does not have access to our responses.
|
2001-12-19 06:12:14 +00:00
|
|
|
*
|
2013-07-11 15:29:25 +00:00
|
|
|
* Syncookies encode and include all necessary information about the
|
|
|
|
* connection setup within the SYN|ACK that we send back. That way we
|
|
|
|
* can avoid keeping any local state until the ACK to our SYN|ACK returns
|
|
|
|
* (if ever). Normally the syncache and syncookies are running in parallel
|
|
|
|
* with the latter taking over when the former is exhausted. When matching
|
|
|
|
* syncache entry is found the syncookie is ignored.
|
2006-09-13 13:08:27 +00:00
|
|
|
*
|
2016-05-03 18:05:43 +00:00
|
|
|
* The only reliable information persisting the 3WHS is our initial sequence
|
2013-07-11 15:29:25 +00:00
|
|
|
* number ISS of 32 bits. Syncookies embed a cryptographically sufficient
|
|
|
|
* strong hash (MAC) value and a few bits of TCP SYN options in the ISS
|
|
|
|
* of our SYN|ACK. The MAC can be recomputed when the ACK to our SYN|ACK
|
|
|
|
* returns and signifies a legitimate connection if it matches the ACK.
|
2006-09-13 13:08:27 +00:00
|
|
|
*
|
2013-07-11 15:29:25 +00:00
|
|
|
* The available space of 32 bits to store the hash and to encode the SYN
|
|
|
|
* option information is very tight and we should have at least 24 bits for
|
|
|
|
* the MAC to keep the number of guesses by blind spoofing reasonably high.
|
2006-09-13 13:08:27 +00:00
|
|
|
*
|
2013-07-11 15:29:25 +00:00
|
|
|
* SYN option information we have to encode to fully restore a connection:
|
|
|
|
* MSS: is imporant to chose an optimal segment size to avoid IP level
|
|
|
|
* fragmentation along the path. The common MSS values can be encoded
|
|
|
|
* in a 3-bit table. Uncommon values are captured by the next lower value
|
|
|
|
* in the table leading to a slight increase in packetization overhead.
|
|
|
|
* WSCALE: is necessary to allow large windows to be used for high delay-
|
|
|
|
* bandwidth product links. Not scaling the window when it was initially
|
|
|
|
* negotiated is bad for performance as lack of scaling further decreases
|
|
|
|
* the apparent available send window. We only need to encode the WSCALE
|
|
|
|
* we received from the remote end. Our end can be recalculated at any
|
|
|
|
* time. The common WSCALE values can be encoded in a 3-bit table.
|
|
|
|
* Uncommon values are captured by the next lower value in the table
|
|
|
|
* making us under-estimate the available window size halving our
|
|
|
|
* theoretically possible maximum throughput for that connection.
|
|
|
|
* SACK: Greatly assists in packet loss recovery and requires 1 bit.
|
|
|
|
* TIMESTAMP and SIGNATURE is not encoded because they are permanent options
|
|
|
|
* that are included in all segments on a connection. We enable them when
|
|
|
|
* the ACK has them.
|
2006-09-13 13:08:27 +00:00
|
|
|
*
|
2013-07-11 15:29:25 +00:00
|
|
|
* Security of syncookies and attack vectors:
|
2006-09-13 13:08:27 +00:00
|
|
|
*
|
2013-07-11 15:29:25 +00:00
|
|
|
* The MAC is computed over (faddr||laddr||fport||lport||irs||flags||secmod)
|
|
|
|
* together with the gloabl secret to make it unique per connection attempt.
|
|
|
|
* Thus any change of any of those parameters results in a different MAC output
|
|
|
|
* in an unpredictable way unless a collision is encountered. 24 bits of the
|
|
|
|
* MAC are embedded into the ISS.
|
2006-09-13 13:08:27 +00:00
|
|
|
*
|
2013-07-11 15:29:25 +00:00
|
|
|
* To prevent replay attacks two rotating global secrets are updated with a
|
|
|
|
* new random value every 15 seconds. The life-time of a syncookie is thus
|
|
|
|
* 15-30 seconds.
|
2006-09-13 13:08:27 +00:00
|
|
|
*
|
2013-07-11 15:29:25 +00:00
|
|
|
* Vector 1: Attacking the secret. This requires finding a weakness in the
|
|
|
|
* MAC itself or the way it is used here. The attacker can do a chosen plain
|
|
|
|
* text attack by varying and testing the all parameters under his control.
|
|
|
|
* The strength depends on the size and randomness of the secret, and the
|
|
|
|
* cryptographic security of the MAC function. Due to the constant updating
|
|
|
|
* of the secret the attacker has at most 29.999 seconds to find the secret
|
|
|
|
* and launch spoofed connections. After that he has to start all over again.
|
2006-09-13 13:08:27 +00:00
|
|
|
*
|
2013-07-11 15:29:25 +00:00
|
|
|
* Vector 2: Collision attack on the MAC of a single ACK. With a 24 bit MAC
|
|
|
|
* size an average of 4,823 attempts are required for a 50% chance of success
|
|
|
|
* to spoof a single syncookie (birthday collision paradox). However the
|
|
|
|
* attacker is blind and doesn't know if one of his attempts succeeded unless
|
|
|
|
* he has a side channel to interfere success from. A single connection setup
|
|
|
|
* success average of 90% requires 8,790 packets, 99.99% requires 17,578 packets.
|
|
|
|
* This many attempts are required for each one blind spoofed connection. For
|
|
|
|
* every additional spoofed connection he has to launch another N attempts.
|
|
|
|
* Thus for a sustained rate 100 spoofed connections per second approximately
|
|
|
|
* 1,800,000 packets per second would have to be sent.
|
2001-12-19 06:12:14 +00:00
|
|
|
*
|
2013-07-11 15:29:25 +00:00
|
|
|
* NB: The MAC function should be fast so that it doesn't become a CPU
|
|
|
|
* exhaustion attack vector itself.
|
|
|
|
*
|
|
|
|
* References:
|
|
|
|
* RFC4987 TCP SYN Flooding Attacks and Common Mitigations
|
|
|
|
* SYN cookies were first proposed by cryptographer Dan J. Bernstein in 1996
|
|
|
|
* http://cr.yp.to/syncookies.html (overview)
|
|
|
|
* http://cr.yp.to/syncookies/archive (details)
|
|
|
|
*
|
|
|
|
*
|
|
|
|
* Schematic construction of a syncookie enabled Initial Sequence Number:
|
|
|
|
* 0 1 2 3
|
|
|
|
* 12345678901234567890123456789012
|
|
|
|
* |xxxxxxxxxxxxxxxxxxxxxxxxWWWMMMSP|
|
|
|
|
*
|
|
|
|
* x 24 MAC (truncated)
|
|
|
|
* W 3 Send Window Scale index
|
|
|
|
* M 3 MSS index
|
|
|
|
* S 1 SACK permitted
|
|
|
|
* P 1 Odd/even secret
|
2001-12-19 06:12:14 +00:00
|
|
|
*/
|
|
|
|
|
2013-07-11 15:29:25 +00:00
|
|
|
/*
|
|
|
|
* Distribution and probability of certain MSS values. Those in between are
|
|
|
|
* rounded down to the next lower one.
|
|
|
|
* [An Analysis of TCP Maximum Segment Sizes, S. Alcock and R. Nelson, 2011]
|
|
|
|
* .2% .3% 5% 7% 7% 20% 15% 45%
|
|
|
|
*/
|
|
|
|
static int tcp_sc_msstab[] = { 216, 536, 1200, 1360, 1400, 1440, 1452, 1460 };
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Distribution and probability of certain WSCALE values. We have to map the
|
|
|
|
* (send) window scale (shift) option with a range of 0-14 from 4 bits into 3
|
|
|
|
* bits based on prevalence of certain values. Where we don't have an exact
|
|
|
|
* match for are rounded down to the next lower one letting us under-estimate
|
|
|
|
* the true available window. At the moment this would happen only for the
|
|
|
|
* very uncommon values 3, 5 and those above 8 (more than 16MB socket buffer
|
|
|
|
* and window size). The absence of the WSCALE option (no scaling in either
|
|
|
|
* direction) is encoded with index zero.
|
|
|
|
* [WSCALE values histograms, Allman, 2012]
|
|
|
|
* X 10 10 35 5 6 14 10% by host
|
|
|
|
* X 11 4 5 5 18 49 3% by connections
|
|
|
|
*/
|
|
|
|
static int tcp_sc_wstab[] = { 0, 0, 1, 2, 4, 6, 7, 8 };
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Compute the MAC for the SYN cookie. SIPHASH-2-4 is chosen for its speed
|
|
|
|
* and good cryptographic properties.
|
|
|
|
*/
|
|
|
|
static uint32_t
|
|
|
|
syncookie_mac(struct in_conninfo *inc, tcp_seq irs, uint8_t flags,
|
|
|
|
uint8_t *secbits, uintptr_t secmod)
|
2001-12-19 06:12:14 +00:00
|
|
|
{
|
2013-07-11 15:29:25 +00:00
|
|
|
SIPHASH_CTX ctx;
|
|
|
|
uint32_t siphash[2];
|
|
|
|
|
|
|
|
SipHash24_Init(&ctx);
|
|
|
|
SipHash_SetKey(&ctx, secbits);
|
|
|
|
switch (inc->inc_flags & INC_ISIPV6) {
|
|
|
|
#ifdef INET
|
|
|
|
case 0:
|
|
|
|
SipHash_Update(&ctx, &inc->inc_faddr, sizeof(inc->inc_faddr));
|
|
|
|
SipHash_Update(&ctx, &inc->inc_laddr, sizeof(inc->inc_laddr));
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
#ifdef INET6
|
|
|
|
case INC_ISIPV6:
|
|
|
|
SipHash_Update(&ctx, &inc->inc6_faddr, sizeof(inc->inc6_faddr));
|
|
|
|
SipHash_Update(&ctx, &inc->inc6_laddr, sizeof(inc->inc6_laddr));
|
|
|
|
break;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
SipHash_Update(&ctx, &inc->inc_fport, sizeof(inc->inc_fport));
|
|
|
|
SipHash_Update(&ctx, &inc->inc_lport, sizeof(inc->inc_lport));
|
2015-01-30 17:29:07 +00:00
|
|
|
SipHash_Update(&ctx, &irs, sizeof(irs));
|
2013-07-11 15:29:25 +00:00
|
|
|
SipHash_Update(&ctx, &flags, sizeof(flags));
|
|
|
|
SipHash_Update(&ctx, &secmod, sizeof(secmod));
|
|
|
|
SipHash_Final((u_int8_t *)&siphash, &ctx);
|
|
|
|
|
|
|
|
return (siphash[0] ^ siphash[1]);
|
|
|
|
}
|
|
|
|
|
|
|
|
static tcp_seq
|
|
|
|
syncookie_generate(struct syncache_head *sch, struct syncache *sc)
|
|
|
|
{
|
|
|
|
u_int i, mss, secbit, wscale;
|
|
|
|
uint32_t iss, hash;
|
|
|
|
uint8_t *secbits;
|
|
|
|
union syncookie cookie;
|
2006-09-13 13:08:27 +00:00
|
|
|
|
|
|
|
SCH_LOCK_ASSERT(sch);
|
|
|
|
|
2013-07-11 15:29:25 +00:00
|
|
|
cookie.cookie = 0;
|
|
|
|
|
|
|
|
/* Map our computed MSS into the 3-bit index. */
|
|
|
|
mss = min(tcp_mssopt(&sc->sc_inc), max(sc->sc_peer_mss, V_tcp_minmss));
|
2016-04-20 16:19:44 +00:00
|
|
|
for (i = nitems(tcp_sc_msstab) - 1; tcp_sc_msstab[i] > mss && i > 0;
|
2013-07-11 15:29:25 +00:00
|
|
|
i--)
|
|
|
|
;
|
|
|
|
cookie.flags.mss_idx = i;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Map the send window scale into the 3-bit index but only if
|
|
|
|
* the wscale option was received.
|
|
|
|
*/
|
|
|
|
if (sc->sc_flags & SCF_WINSCALE) {
|
|
|
|
wscale = sc->sc_requested_s_scale;
|
2016-04-19 23:48:27 +00:00
|
|
|
for (i = nitems(tcp_sc_wstab) - 1;
|
2016-04-20 16:19:44 +00:00
|
|
|
tcp_sc_wstab[i] > wscale && i > 0;
|
2013-07-11 15:29:25 +00:00
|
|
|
i--)
|
|
|
|
;
|
|
|
|
cookie.flags.wscale_idx = i;
|
2001-12-19 06:12:14 +00:00
|
|
|
}
|
2006-09-13 13:08:27 +00:00
|
|
|
|
2013-07-11 15:29:25 +00:00
|
|
|
/* Can we do SACK? */
|
|
|
|
if (sc->sc_flags & SCF_SACK)
|
|
|
|
cookie.flags.sack_ok = 1;
|
2006-09-13 13:08:27 +00:00
|
|
|
|
2013-07-11 15:29:25 +00:00
|
|
|
/* Which of the two secrets to use. */
|
|
|
|
secbit = sch->sch_sc->secret.oddeven & 0x1;
|
|
|
|
cookie.flags.odd_even = secbit;
|
2006-09-13 13:08:27 +00:00
|
|
|
|
2013-07-11 15:29:25 +00:00
|
|
|
secbits = sch->sch_sc->secret.key[secbit];
|
|
|
|
hash = syncookie_mac(&sc->sc_inc, sc->sc_irs, cookie.cookie, secbits,
|
|
|
|
(uintptr_t)sch);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Put the flags into the hash and XOR them to get better ISS number
|
|
|
|
* variance. This doesn't enhance the cryptographic strength and is
|
|
|
|
* done to prevent the 8 cookie bits from showing up directly on the
|
|
|
|
* wire.
|
|
|
|
*/
|
|
|
|
iss = hash & ~0xff;
|
|
|
|
iss |= cookie.cookie ^ (hash >> 24);
|
|
|
|
|
|
|
|
/* Randomize the timestamp. */
|
2006-09-13 13:08:27 +00:00
|
|
|
if (sc->sc_flags & SCF_TIMESTAMP) {
|
2013-07-11 15:29:25 +00:00
|
|
|
sc->sc_ts = arc4random();
|
|
|
|
sc->sc_tsoff = sc->sc_ts - tcp_ts_getticks();
|
2007-05-18 21:42:25 +00:00
|
|
|
}
|
2006-09-13 13:08:27 +00:00
|
|
|
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sc_sendcookie);
|
2013-07-11 15:29:25 +00:00
|
|
|
return (iss);
|
2001-12-19 06:12:14 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static struct syncache *
|
2006-09-13 13:08:27 +00:00
|
|
|
syncookie_lookup(struct in_conninfo *inc, struct syncache_head *sch,
|
2013-07-11 15:29:25 +00:00
|
|
|
struct syncache *sc, struct tcphdr *th, struct tcpopt *to,
|
|
|
|
struct socket *lso)
|
2001-12-19 06:12:14 +00:00
|
|
|
{
|
2013-07-11 15:29:25 +00:00
|
|
|
uint32_t hash;
|
|
|
|
uint8_t *secbits;
|
2006-09-13 13:08:27 +00:00
|
|
|
tcp_seq ack, seq;
|
2013-07-11 15:29:25 +00:00
|
|
|
int wnd, wscale = 0;
|
|
|
|
union syncookie cookie;
|
2006-09-13 13:08:27 +00:00
|
|
|
|
|
|
|
SCH_LOCK_ASSERT(sch);
|
|
|
|
|
|
|
|
/*
|
2013-07-11 15:29:25 +00:00
|
|
|
* Pull information out of SYN-ACK/ACK and revert sequence number
|
|
|
|
* advances.
|
2006-09-13 13:08:27 +00:00
|
|
|
*/
|
|
|
|
ack = th->th_ack - 1;
|
|
|
|
seq = th->th_seq - 1;
|
|
|
|
|
|
|
|
/*
|
2013-07-11 15:29:25 +00:00
|
|
|
* Unpack the flags containing enough information to restore the
|
|
|
|
* connection.
|
2006-09-13 13:08:27 +00:00
|
|
|
*/
|
2013-07-11 15:29:25 +00:00
|
|
|
cookie.cookie = (ack & 0xff) ^ (ack >> 24);
|
2001-12-19 06:12:14 +00:00
|
|
|
|
2013-07-11 15:29:25 +00:00
|
|
|
/* Which of the two secrets to use. */
|
|
|
|
secbits = sch->sch_sc->secret.key[cookie.flags.odd_even];
|
2006-09-13 13:08:27 +00:00
|
|
|
|
2013-07-11 15:29:25 +00:00
|
|
|
hash = syncookie_mac(inc, seq, cookie.cookie, secbits, (uintptr_t)sch);
|
|
|
|
|
|
|
|
/* The recomputed hash matches the ACK if this was a genuine cookie. */
|
|
|
|
if ((ack & ~0xff) != (hash & ~0xff))
|
|
|
|
return (NULL);
|
2006-09-13 13:08:27 +00:00
|
|
|
|
|
|
|
/* Fill in the syncache values. */
|
2013-07-11 15:29:25 +00:00
|
|
|
sc->sc_flags = 0;
|
2006-06-26 16:14:19 +00:00
|
|
|
bcopy(inc, &sc->sc_inc, sizeof(struct in_conninfo));
|
2006-09-13 13:08:27 +00:00
|
|
|
sc->sc_ipopts = NULL;
|
|
|
|
|
|
|
|
sc->sc_irs = seq;
|
|
|
|
sc->sc_iss = ack;
|
|
|
|
|
2013-07-11 15:29:25 +00:00
|
|
|
switch (inc->inc_flags & INC_ISIPV6) {
|
|
|
|
#ifdef INET
|
|
|
|
case 0:
|
|
|
|
sc->sc_ip_ttl = sotoinpcb(lso)->inp_ip_ttl;
|
|
|
|
sc->sc_ip_tos = sotoinpcb(lso)->inp_ip_tos;
|
|
|
|
break;
|
|
|
|
#endif
|
2001-12-19 06:12:14 +00:00
|
|
|
#ifdef INET6
|
2013-07-11 15:29:25 +00:00
|
|
|
case INC_ISIPV6:
|
|
|
|
if (sotoinpcb(lso)->inp_flags & IN6P_AUTOFLOWLABEL)
|
|
|
|
sc->sc_flowlabel = sc->sc_iss & IPV6_FLOWLABEL_MASK;
|
|
|
|
break;
|
2001-12-19 06:12:14 +00:00
|
|
|
#endif
|
|
|
|
}
|
2006-09-13 13:08:27 +00:00
|
|
|
|
2013-07-11 15:29:25 +00:00
|
|
|
sc->sc_peer_mss = tcp_sc_msstab[cookie.flags.mss_idx];
|
|
|
|
|
|
|
|
/* We can simply recompute receive window scale we sent earlier. */
|
|
|
|
while (wscale < TCP_MAX_WINSHIFT && (TCP_MAXWIN << wscale) < sb_max)
|
|
|
|
wscale++;
|
|
|
|
|
|
|
|
/* Only use wscale if it was enabled in the orignal SYN. */
|
|
|
|
if (cookie.flags.wscale_idx > 0) {
|
|
|
|
sc->sc_requested_r_scale = wscale;
|
|
|
|
sc->sc_requested_s_scale = tcp_sc_wstab[cookie.flags.wscale_idx];
|
|
|
|
sc->sc_flags |= SCF_WINSCALE;
|
|
|
|
}
|
|
|
|
|
Listening sockets improvements.
o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().
o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.
o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.
o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
unp_list_lock.
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
2017-06-08 21:30:34 +00:00
|
|
|
wnd = lso->sol_sbrcv_hiwat;
|
2013-07-11 15:29:25 +00:00
|
|
|
wnd = imax(wnd, 0);
|
|
|
|
wnd = imin(wnd, TCP_MAXWIN);
|
|
|
|
sc->sc_wnd = wnd;
|
|
|
|
|
|
|
|
if (cookie.flags.sack_ok)
|
|
|
|
sc->sc_flags |= SCF_SACK;
|
|
|
|
|
|
|
|
if (to->to_flags & TOF_TS) {
|
2006-09-13 13:08:27 +00:00
|
|
|
sc->sc_flags |= SCF_TIMESTAMP;
|
|
|
|
sc->sc_tsreflect = to->to_tsval;
|
2007-05-18 21:42:25 +00:00
|
|
|
sc->sc_ts = to->to_tsecr;
|
2012-02-15 16:09:56 +00:00
|
|
|
sc->sc_tsoff = to->to_tsecr - tcp_ts_getticks();
|
2013-07-11 15:29:25 +00:00
|
|
|
}
|
2006-09-13 13:08:27 +00:00
|
|
|
|
2013-07-11 15:29:25 +00:00
|
|
|
if (to->to_flags & TOF_SIGNATURE)
|
|
|
|
sc->sc_flags |= SCF_SIGNATURE;
|
2006-09-13 13:08:27 +00:00
|
|
|
|
2006-06-17 17:32:38 +00:00
|
|
|
sc->sc_rxmits = 0;
|
2006-09-13 13:08:27 +00:00
|
|
|
|
2009-04-11 22:07:19 +00:00
|
|
|
TCPSTAT_INC(tcps_sc_recvcookie);
|
2001-12-19 06:12:14 +00:00
|
|
|
return (sc);
|
|
|
|
}
|
2007-07-27 00:57:06 +00:00
|
|
|
|
2013-07-11 15:29:25 +00:00
|
|
|
#ifdef INVARIANTS
|
|
|
|
static int
|
|
|
|
syncookie_cmp(struct in_conninfo *inc, struct syncache_head *sch,
|
|
|
|
struct syncache *sc, struct tcphdr *th, struct tcpopt *to,
|
|
|
|
struct socket *lso)
|
|
|
|
{
|
|
|
|
struct syncache scs, *scx;
|
|
|
|
char *s;
|
|
|
|
|
|
|
|
bzero(&scs, sizeof(scs));
|
|
|
|
scx = syncookie_lookup(inc, sch, &scs, th, to, lso);
|
|
|
|
|
|
|
|
if ((s = tcp_log_addrs(inc, th, NULL, NULL)) == NULL)
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
if (scx != NULL) {
|
|
|
|
if (sc->sc_peer_mss != scx->sc_peer_mss)
|
|
|
|
log(LOG_DEBUG, "%s; %s: mss different %i vs %i\n",
|
|
|
|
s, __func__, sc->sc_peer_mss, scx->sc_peer_mss);
|
|
|
|
|
|
|
|
if (sc->sc_requested_r_scale != scx->sc_requested_r_scale)
|
|
|
|
log(LOG_DEBUG, "%s; %s: rwscale different %i vs %i\n",
|
|
|
|
s, __func__, sc->sc_requested_r_scale,
|
|
|
|
scx->sc_requested_r_scale);
|
|
|
|
|
|
|
|
if (sc->sc_requested_s_scale != scx->sc_requested_s_scale)
|
|
|
|
log(LOG_DEBUG, "%s; %s: swscale different %i vs %i\n",
|
|
|
|
s, __func__, sc->sc_requested_s_scale,
|
|
|
|
scx->sc_requested_s_scale);
|
|
|
|
|
|
|
|
if ((sc->sc_flags & SCF_SACK) != (scx->sc_flags & SCF_SACK))
|
|
|
|
log(LOG_DEBUG, "%s; %s: SACK different\n", s, __func__);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s != NULL)
|
|
|
|
free(s, M_TCPLOG);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
#endif /* INVARIANTS */
|
|
|
|
|
|
|
|
static void
|
|
|
|
syncookie_reseed(void *arg)
|
|
|
|
{
|
|
|
|
struct tcp_syncache *sc = arg;
|
|
|
|
uint8_t *secbits;
|
|
|
|
int secbit;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Reseeding the secret doesn't have to be protected by a lock.
|
|
|
|
* It only must be ensured that the new random values are visible
|
|
|
|
* to all CPUs in a SMP environment. The atomic with release
|
|
|
|
* semantics ensures that.
|
|
|
|
*/
|
|
|
|
secbit = (sc->secret.oddeven & 0x1) ? 0 : 1;
|
|
|
|
secbits = sc->secret.key[secbit];
|
|
|
|
arc4rand(secbits, SYNCOOKIE_SECRET_SIZE, 0);
|
|
|
|
atomic_add_rel_int(&sc->secret.oddeven, 1);
|
|
|
|
|
|
|
|
/* Reschedule ourself. */
|
|
|
|
callout_schedule(&sc->secret.reseed, SYNCOOKIE_LIFETIME * hz);
|
|
|
|
}
|
|
|
|
|
2007-07-27 00:57:06 +00:00
|
|
|
/*
|
|
|
|
* Exports the syncache entries to userland so that netstat can display
|
|
|
|
* them alongside the other sockets. This function is intended to be
|
|
|
|
* called only from tcp_pcblist.
|
|
|
|
*
|
|
|
|
* Due to concurrency on an active system, the number of pcbs exported
|
|
|
|
* may have no relation to max_pcbs. max_pcbs merely indicates the
|
|
|
|
* amount of space the caller allocated for this function to use.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
syncache_pcblist(struct sysctl_req *req, int max_pcbs, int *pcbs_exported)
|
|
|
|
{
|
|
|
|
struct xtcpcb xt;
|
|
|
|
struct syncache *sc;
|
|
|
|
struct syncache_head *sch;
|
|
|
|
int count, error, i;
|
|
|
|
|
Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
2008-08-17 23:27:27 +00:00
|
|
|
for (count = 0, error = 0, i = 0; i < V_tcp_syncache.hashsize; i++) {
|
|
|
|
sch = &V_tcp_syncache.hashbase[i];
|
2007-07-27 00:57:06 +00:00
|
|
|
SCH_LOCK(sch);
|
|
|
|
TAILQ_FOREACH(sc, &sch->sch_bucket, sc_hash) {
|
|
|
|
if (count >= max_pcbs) {
|
|
|
|
SCH_UNLOCK(sch);
|
|
|
|
goto exit;
|
|
|
|
}
|
2008-08-23 14:22:12 +00:00
|
|
|
if (cr_cansee(req->td->td_ucred, sc->sc_cred) != 0)
|
|
|
|
continue;
|
2007-07-27 00:57:06 +00:00
|
|
|
bzero(&xt, sizeof(xt));
|
|
|
|
xt.xt_len = sizeof(xt);
|
2008-12-17 12:52:34 +00:00
|
|
|
if (sc->sc_inc.inc_flags & INC_ISIPV6)
|
2007-07-27 00:57:06 +00:00
|
|
|
xt.xt_inp.inp_vflag = INP_IPV6;
|
|
|
|
else
|
|
|
|
xt.xt_inp.inp_vflag = INP_IPV4;
|
Hide struct inpcb, struct tcpcb from the userland.
This is a painful change, but it is needed. On the one hand, we avoid
modifying them, and this slows down some ideas, on the other hand we still
eventually modify them and tools like netstat(1) never work on next version of
FreeBSD. We maintain a ton of spares in them, and we already got some ifdef
hell at the end of tcpcb.
Details:
- Hide struct inpcb, struct tcpcb under _KERNEL || _WANT_FOO.
- Make struct xinpcb, struct xtcpcb pure API structures, not including
kernel structures inpcb and tcpcb inside. Export into these structures
the fields from inpcb and tcpcb that are known to be used, and put there
a ton of spare space.
- Make kernel and userland utilities compilable after these changes.
- Bump __FreeBSD_version.
Reviewed by: rrs, gnn
Differential Revision: D10018
2017-03-21 06:39:49 +00:00
|
|
|
bcopy(&sc->sc_inc, &xt.xt_inp.inp_inc,
|
|
|
|
sizeof (struct in_conninfo));
|
|
|
|
xt.t_state = TCPS_SYN_RECEIVED;
|
|
|
|
xt.xt_inp.xi_socket.xso_protocol = IPPROTO_TCP;
|
|
|
|
xt.xt_inp.xi_socket.xso_len = sizeof (struct xsocket);
|
|
|
|
xt.xt_inp.xi_socket.so_type = SOCK_STREAM;
|
|
|
|
xt.xt_inp.xi_socket.so_state = SS_ISCONNECTING;
|
2007-07-27 00:57:06 +00:00
|
|
|
error = SYSCTL_OUT(req, &xt, sizeof xt);
|
|
|
|
if (error) {
|
|
|
|
SCH_UNLOCK(sch);
|
|
|
|
goto exit;
|
|
|
|
}
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
SCH_UNLOCK(sch);
|
|
|
|
}
|
|
|
|
exit:
|
|
|
|
*pcbs_exported = count;
|
|
|
|
return error;
|
|
|
|
}
|