Commit Graph

250 Commits

Author SHA1 Message Date
Bjoern A. Zeeb
3f58662dd9 The pr_destroy field does not allow us to run the teardown code in a
specific order.  VNET_SYSUNINITs however are doing exactly that.
Thus remove the VIMAGE conditional field from the domain(9) protosw
structure and replace it with VNET_SYSUNINITs.
This also allows us to change some order and to make the teardown functions
file local static.
Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use
internally.

Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g.,
for pfil consumers (firewalls), partially for this commit and for others
to come.

Reviewed by:		gnn, tuexen (sctp), jhb (kernel.h)
Obtained from:		projects/vnet
MFC after:		2 weeks
X-MFC:			do not remove pr_destroy
Sponsored by:		The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D6652
2016-06-01 10:14:04 +00:00
Gleb Smirnoff
f59d975e10 Tiny refactor of r294869/r296881: use defines to mask the VNET() macro.
Suggested by:	bz
2016-05-17 23:14:17 +00:00
Randall Stewart
5105a92c49 This small change adopts the excellent suggestion for using named
structures in the add of a new tcp-stack that came in late to me
via email after the last commit. It also makes it so that a new
stack may optionally get a callback during a retransmit
timeout. This allows the new stack to clear specific state (think
sack scoreboards or other such structures).

Sponsored by:	Netflix Inc.
Differential Revision:	http://reviews.freebsd.org/D6303
2016-05-17 09:53:22 +00:00
Randall Stewart
e5ad64562a This cleans up the timers code in TCP to start using the new
async_drain functionality. This as been tested in NF as well as
by Verisign. Still to do in here is to remove all the old flags. They
are currently left being maintained but probably are no longer needed.

Sponsored by:	Netflix Inc.
Differential Revision:	http://reviews.freebsd.org/D5924
2016-04-28 13:27:12 +00:00
Jonathan T. Looney
5d20f97461 to_flags is currently a 64-bit integer; however, we only use 7 bits.
Furthermore, there is no reason this needs to be a 64-bit integer
for the forseeable future.

Also, there is an inconsistency between to_flags and the mask in
tcp_addoptions(). Before r195654, to_flags was a u_long and the mask in
tcp_addoptions() was a u_int. r195654 changed to_flags to be a u_int64_t
but left the mask in tcp_addoptions() as a u_int, meaning that these
variables will only be the same width on platforms with 64-bit integers.

Convert both to_flags and the mask in tcp_addoptions() to be explicitly
32-bit variables. This may save a few cycles on 32-bit platforms, and
avoids unnecessarily mixing types.

Differential Revision:	https://reviews.freebsd.org/D5584
Reviewed by:	hiren
MFC after:	2 weeks
Sponsored by:	Juniper Networks
2016-03-22 15:55:17 +00:00
Gleb Smirnoff
bf840a1707 Redo r294869. The array of counters for TCP states doesn't belong to
struct tcpstat, because the structure can be zeroed out by netstat(1) -z,
and of course running connection counts shouldn't be touched.

Place running connection counts into separate array, and provide
separate read-only sysctl oid for it.
2016-03-15 00:15:10 +00:00
Gleb Smirnoff
2f06d2ab91 Comment fix: statistics are not read-only. 2016-03-14 18:06:59 +00:00
Gleb Smirnoff
57a78e3bae Augment struct tcpstat with tcps_states[], which is used for book-keeping
the amount of TCP connections by state.  Provides a cheap way to get
connection count without traversing the whole pcb list.

Sponsored by:	Netflix
2016-01-27 00:45:46 +00:00
Gleb Smirnoff
d17d4c6b2a Provide TCPSTAT_DEC() and TCPSTAT_FETCH() macros. 2016-01-27 00:20:07 +00:00
Gleb Smirnoff
0c39d38d21 Historically we have two fields in tcpcb to describe sender MSS: t_maxopd,
and t_maxseg. This dualism emerged with T/TCP, but was not properly cleaned
up after T/TCP removal. After all permutations over the years the result is
that t_maxopd stores a minimum of peer offered MSS and MTU reduced by minimum
protocol header. And t_maxseg stores (t_maxopd - TCPOLEN_TSTAMP_APPA) if
timestamps are in action, or is equal to t_maxopd otherwise. That's a very
rough estimate of MSS reduced by options length. Throughout the code it
was used in places, where preciseness was not important, like cwnd or
ssthresh calculations.

With this change:

- t_maxopd goes away.
- t_maxseg now stores MSS not adjusted by options.
- new function tcp_maxseg() is provided, that calculates MSS reduced by
  options length. The functions gives a better estimate, since it takes
  into account SACK state as well.

Reviewed by:	jtl
Differential Revision:	https://reviews.freebsd.org/D3593
2016-01-07 00:14:42 +00:00
Patrick Kelsey
281a0fd4f9 Implementation of server-side TCP Fast Open (TFO) [RFC7413].
TFO is disabled by default in the kernel build.  See the top comment
in sys/netinet/tcp_fastopen.c for implementation particulars.

Reviewed by:	gnn, jch, stas
MFC after:	3 days
Sponsored by:	Verisign, Inc.
Differential Revision:	https://reviews.freebsd.org/D4350
2015-12-24 19:09:48 +00:00
Randall Stewart
55bceb1e2b First cut of the modularization of our TCP stack. Still
to do is to clean up the timer handling using the async-drain.
Other optimizations may be coming to go with this. Whats here
will allow differnet tcp implementations (one included).
Reviewed by:	jtl, hiren, transports
Sponsored by:	Netflix Inc.
Differential Revision:	D4055
2015-12-16 00:56:45 +00:00
Hiren Panchasara
4d16338223 Clean up unused bandwidth entry in the TCP hostcache.
Submitted by:		Jason Wolfe (j at nitrology dot com)
Reviewed by:		rrs, hiren
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D4154
2015-12-11 06:22:58 +00:00
Hiren Panchasara
b87170f210 r290122 added 4 bytes and removed 8 in struct sackhint. Add a pad entry of 4
bytes to restore the size.

Spotted by:	rrs
Reviewed by:	rrs
X-MFC with:	r290122
Sponsored by:	Limelight Networks
2015-12-10 03:20:10 +00:00
Hiren Panchasara
021eaf7996 One of the ways to detect loss is to count duplicate acks coming back from the
other end till it reaches predetermined threshold which is 3 for us right now.
Once that happens, we trigger fast-retransmit to do loss recovery.

Main problem with the current implementation is that we don't honor SACK
information well to detect whether an incoming ack is a dupack or not. RFC6675
has latest recommendations for that. According to it, dupack is a segment that
arrives carrying a SACK block that identifies previously unknown information
between snd_una and snd_max even if it carries new data, changes the advertised
window, or moves the cumulative acknowledgment point.

With the prevalence of Selective ACK (SACK) these days, improper handling can
lead to delayed loss recovery.

With the fix, new behavior looks like following:

0) th_ack < snd_una --> ignore
Old acks are ignored.
1) th_ack == snd_una, !sack_changed --> ignore
Acks with SACK enabled but without any new SACK info in them are ignored.
2) th_ack == snd_una, window == old_window --> increment
Increment on a good dupack.
3) th_ack == snd_una, window != old_window, sack_changed --> increment
When SACK enabled, it's okay to have advertized window changed if the ack has
new SACK info.
4) th_ack > snd_una --> reset to 0
Reset to 0 when left edge moves.
5) th_ack > snd_una, sack_changed --> increment
Increment if left edge moves but there is new SACK info.

Here, sack_changed is the indicator that incoming ack has previously unknown
SACK info in it.

Note: This fix is not fully compliant to RFC6675. That may require a few
changes to current implementation in order to keep per-sackhole dupack counter
and change to the way we mark/handle sack holes.

PR:			203663
Reviewed by:		jtl
MFC after:		3 weeks
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D4225
2015-12-08 21:21:48 +00:00
Hiren Panchasara
12eeb81fc1 Calculate the correct amount of bytes that are in-flight for a connection as
suggested by RFC 6675.

Currently differnt places in the stack tries to guess this in suboptimal ways.
The main problem is that current calculations don't take sacked bytes into
account. Sacked bytes are the bytes receiver acked via SACK option. This is
suboptimal because it assumes that network has more outstanding (unacked) bytes
than the actual value and thus sends less data by setting congestion window
lower than what's possible which in turn may cause slower recovery from losses.

As an example, one of the current calculations looks something like this:
snd_nxt - snd_fack + sackhint.sack_bytes_rexmit
New proposal from RFC 6675 is:
snd_max - snd_una - sackhint.sacked_bytes + sackhint.sack_bytes_rexmit
which takes sacked bytes into account which is a new addition to the sackhint
struct. Only thing we are missing from RFC 6675 is isLost() i.e. segment being
considered lost and thus adjusting pipe based on that which makes this
calculation a bit on conservative side.

The approach is very simple. We already process each ack with sack info in
tcp_sack_doack() and extract sack blocks/holes out of it. We'd now also track
this new variable sacked_bytes which keeps track of total sacked bytes reported.

One downside to this approach is that we may get incorrect count of sacked_bytes
if the other end decides to drop sack info in the ack because of memory pressure
or some other reasons. But in this (not very likely) case also the pipe
calculation would be conservative which is okay as opposed to being aggressive
in sending packets into the network.

Next step is to use this more accurate pipe estimation to drive congestion
window adjustments.

In collaboration with:	rrs
Reviewed by:		jason_eggnet dot com, rrs
MFC after:		2 weeks
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D3971
2015-10-28 22:57:51 +00:00
Hiren Panchasara
356c7958a4 Add sysctl tunable net.inet.tcp.initcwnd_segments to specify initial congestion
window in number of segments on fly. It is set to 10 segments by default.

Remove net.inet.tcp.experimental.initcwnd10 which is now redundant. Also remove
the parent node net.inet.tcp.experimental as it's not needed anymore and also
because it was not well thought out.

Differential Revision:	https://reviews.freebsd.org/D3858
In collaboration with:	lstewart
Reviewed by:		gnn (prev version), rwatson, allanjude, wblock (man page)
MFC after:		2 weeks
Relnotes:		yes
Sponsored by:		Limelight Networks
2015-10-27 09:43:05 +00:00
Hiren Panchasara
86a996e6bd There are times when it would be really nice to have a record of the last few
packets and/or state transitions from each TCP socket. That would help with
narrowing down certain problems we see in the field that are hard to reproduce
without understanding the history of how we got into a certain state. This
change provides just that.

It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is
destroyed, the list is freed. I thought this was likely to be more
performance-friendly than saving copies of the tcpcb. Plus, with the packets,
you should be able to reverse-engineer what happened to the tcpcb.

To enable the feature, you will need to compile a kernel with the TCPPCAP
option. Even then, the feature defaults to being deactivated. You can activate
it by setting a positive value for the number of captured packets. You can do
that on either a global basis or on a per-socket basis (via a setsockopt call).

There is no way to get the packets out of the kernel other than using kmem or
getting a coredump. I thought that would help some of the legal/privacy concerns
regarding such a feature. However, it should be possible to add a future effort
to export them in PCAP format.

I tested this at low scale, and found that there were no mbuf leaks and the peak
mbuf usage appeared to be unchanged with and without the feature.

The main performance concern I can envision is the number of mbufs that would be
used on systems with a large number of sockets. If you save five packets per
direction per socket and have 3,000 sockets, that will consume at least 30,000
mbufs just to keep these packets. I tried to reduce the concerns associated with
this by limiting the number of clusters (not mbufs) that could be used for this
feature. Again, in my testing, that appears to work correctly.

Differential Revision:	D3100
Submitted by:		Jonathan Looney <jlooney at juniper dot net>
Reviewed by:		gnn, hiren
2015-10-14 00:35:37 +00:00
Alexander V. Chernikov
1558cb2448 Eliminate nd6_nud_hint() and its TCP bindings.
Initially function was introduced in r53541 (KAME initial commit) to
  "provide hints from upper layer protocols that indicate a connection
  is making "forward progress"" (quote from RFC 2461 7.3.1 Reachability
  Confirmation).
However, it was converted to do nothing (e.g. just return) in r122922
  (tcp_hostcache implementation) back in 2003. Some defines were moved
  to tcp_var.h in r169541. Then, it was broken (for non-corner cases)
  by r186119 (L2<>L3 split) in 2008 (NULL ifp in nd6_lookup). So,
  right now this code is broken and has no "real" base users.

Differential Revision:	https://reviews.freebsd.org/D3699
2015-09-27 05:29:34 +00:00
Gleb Smirnoff
24067db8ca Make tcp_mtudisc() static and void. No functional changes.
Sponsored by:	Nginx, Inc.
2015-09-04 12:02:12 +00:00
Hiren Panchasara
03041aaac8 Update snd_una description to make it more readable.
Differential Revision:	https://reviews.freebsd.org/D3179
Reviewed by:		gnn
Sponsored by:		Limelight Networks
2015-07-30 19:24:49 +00:00
Patrick Kelsey
4741bfcb57 Revert r265338, r271089 and r271123 as those changes do not handle
non-inline urgent data and introduce an mbuf exhaustion attack vector
similar to FreeBSD-SA-15:15.tcp, but not requiring VNETs.

Address the issue described in FreeBSD-SA-15:15.tcp.

Reviewed by:	glebius
Approved by:	so
Approved by:	jmallett (mentor)
Security:	FreeBSD-SA-15:15.tcp
Sponsored by:	Norse Corp, Inc.
2015-07-29 17:59:13 +00:00
Julien Charbon
5571f9cf81 Fix an old and well-documented use-after-free race condition in
TCP timers:
 - Add a reference from tcpcb to its inpcb
 - Defer tcpcb deletion until TCP timers have finished

Differential Revision:	https://reviews.freebsd.org/D2079
Submitted by:		jch, Marc De La Gueronniere <mdelagueronniere@verisign.com>
Reviewed by:		imp, rrs, adrian, jhb, bz
Approved by:		jhb
Sponsored by:		Verisign, Inc.
2015-04-16 10:00:06 +00:00
Julien Charbon
71da715374 Re-introduce padding fields removed with r264321 to keep
struct tcptw ABI unchanged.

Suggested by:	jhb
Approved by:	jhb (mentor)
MFC after:	1 day
X-MFC-With:	r264321
2014-11-17 14:56:02 +00:00
Hans Petter Selasky
4952ad427a Restore spares used in "struct tcpcb" and bump "__FreeBSD_version" to
indicate need for kernel module re-compilation.

Sponsored by:	Mellanox Technologies
2014-11-03 13:01:58 +00:00
Julien Charbon
cea40c4888 Fix a race condition in TCP timewait between tcp_tw_2msl_reuse() and
tcp_tw_2msl_scan().  This race condition drives unplanned timewait
timeout cancellation.  Also simplify implementation by holding inpcb
reference and removing tcptw reference counting.

Differential Revision:	https://reviews.freebsd.org/D826
Submitted by:		Marc De la Gueronniere <mdelagueronniere@verisign.com>
Submitted by:		jch
Reviewed By:		jhb (mentor), adrian, rwatson
Sponsored by:		Verisign, Inc.
MFC after:		2 weeks
X-MFC-With:		r264321
2014-10-30 08:53:56 +00:00
Sean Bruno
f6f6703f27 Implement PLPMTUD blackhole detection (RFC 4821), inspired by code
from xnu sources.  If we encounter a network where ICMP is blocked
the Needs Frag indicator may not propagate back to us.  Attempt to
downshift the mss once to a preconfigured value.

Default this feature to off for now while we do not have a full PLPMTUD
implementation in our stack.

Adds the following new sysctl's for control:
net.inet.tcp.pmtud_blackhole_detection -- turns on/off this feature
net.inet.tcp.pmtud_blackhole_mss       -- mss to try for ipv4
net.inet.tcp.v6pmtud_blackhole_mss     -- mss to try for ipv6

Adds the following new sysctl's for monitoring:
-- Number of times the code was activated to attempt a mss downshift
net.inet.tcp.pmtud_blackhole_activated
-- Number of times the blackhole mss was used in an attempt to downshift
net.inet.tcp.pmtud_blackhole_min_activated
-- Number of times that we failed to connect after we downshifted the mss
net.inet.tcp.pmtud_blackhole_failed

Phabricator:	https://reviews.freebsd.org/D506
Reviewed by:	rpaulo bz
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	Limelight Networks
2014-10-07 21:50:28 +00:00
Alexander V. Chernikov
29c47f18da * Split tcp_signature_compute() into 2 pieces:
- tcp_get_sav() - SADB key lookup
 - tcp_signature_do_compute() - actual computation
* Fix TCP signature case for listening socket:
  do not assume EVERY connection coming to socket
  with TCP_SIGNATURE set to be md5 signed regardless
  of SADB key existance for particular address. This
  fixes the case for routing software having _some_
  BGP sessions secured by md5.
* Simplify TCP_SIGNATURE handling in tcp_input()

MFC after:	2 weeks
2014-09-27 07:04:12 +00:00
Hans Petter Selasky
9fd573c39d Improve transmit sending offload, TSO, algorithm in general.
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

Reviewed by:	adrian, rmacklem
Sponsored by:	Mellanox Technologies
MFC after:	1 week
2014-09-22 08:27:27 +00:00
Kevin Lo
8f5a8818f5 Merge 'struct ip6protosw' and 'struct protosw' into one. Now we have
only one protocol switch structure that is shared between ipv4 and ipv6.

Phabric:	D476
Reviewed by:	jhb
2014-08-08 01:57:15 +00:00
Bjoern A. Zeeb
4fd2b4eb53 Make tcp_twrespond() file local private; this removes it from the
public KPI; it is not used anywhere else and seems it never was.

MFC after:	2 weeks
2014-05-24 14:01:18 +00:00
Bjoern A. Zeeb
255cd9fd58 Move the tcp_fields_to_host() and tcp_fields_to_net() (inline)
functions to the tcp_var.h header file in order to avoid further
duplication with upcoming commits.

Reviewed by:	np
MFC after:	2 weeks
2014-05-23 20:15:01 +00:00
Gleb Smirnoff
b1a4156614 Provide compatibility #define after r265408.
Suggested by:	truckman
2014-05-17 12:33:27 +00:00
Mike Silbersack
f1395664e5 Remove the function tcp_twrecycleable; it has been #if 0'd for
eight years.  The original concept was to improve the
corner case where you run out of ephemeral ports, but it
was causing performance problems and the mechanism
of limiting the number of time_wait sockets serves
the same purpose in the end.

Reviewed by:	bz
2014-05-16 01:38:38 +00:00
Gleb Smirnoff
c669105d17 - Remove net.inet.tcp.reass.overflows sysctl. It counts exactly
same events that tcpstat's tcps_rcvmemdrop counter counts.
- Rename tcps_rcvmemdrop to tcps_rcvreassfull and improve its
  description in netstat(1) output.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-05-06 00:00:07 +00:00
Gleb Smirnoff
e407b67be4 The FreeBSD-SA-14:08.tcp was a lesson on not doing acrobatics with
mixing on stack memory and UMA memory in one linked list.

Thus, rewrite TCP reassembly code in terms of memory usage. The
algorithm remains unchanged.

We actually do not need extra memory to build a reassembly queue.
Arriving mbufs are always packet header mbufs. So we got the length
of data as pkthdr.len. We got m_nextpkt for linkage. And we need
only one pointer to point at the tcphdr, use PH_loc for that.

In tcpcb the t_segq fields becomes mbuf pointer. The t_segqlen
field now counts not packets, but bytes in the queue. This gives
us more precision when comparing to socket buffer limits.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-05-04 23:25:32 +00:00
John Baldwin
66eefb1eae Currently, the TCP slow timer can starve TCP input processing while it
walks the list of connections in TIME_WAIT closing expired connections
due to contention on the global TCP pcbinfo lock.

To remediate, introduce a new global lock to protect the list of
connections in TIME_WAIT.  Only acquire the TCP pcbinfo lock when
closing an expired connection.  This limits the window of time when
TCP input processing is stopped to the amount of time needed to close
a single connection.

Submitted by:	Julien Charbon <jcharbon@verisign.com>
Reviewed by:	rwatson, rrs, adrian
MFC after:	2 months
2014-04-10 18:15:35 +00:00
John Baldwin
5b26ea5df3 Remove more constants related to static sysctl nodes. The MAXID constants
were primarily used to size the sysctl name list macros that were removed
in r254295.  A few other constants either did not have an associated
sysctl node, or the associated node used OID_AUTO instead.

PR:		ports/184525 (exp-run)
2014-02-25 18:44:33 +00:00
Bjoern A. Zeeb
a5f44cd7a1 Introduce spares in the TCP syncache and timewait structures
so that fixed TCP_SIGNATURE handling can later be merged.

This is derived from follow-up work to SVN r183001 posted to
net@ on Sep 13 2008.

Approved by:	re (gjb)
2013-09-21 10:01:51 +00:00
John Baldwin
fd77bbb967 Remove most of the remaining sysctl name list macros. They were only
ever intended for use in sysctl(8) and it has not used them for many
years.

Reviewed by:	bde
Tested by:	exp-run by bdrewery
2013-08-26 18:16:05 +00:00
Mark Johnston
57f6086735 Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by:	gnn, hiren
MFC after:	1 month
2013-08-25 21:54:41 +00:00
Andrey V. Elsukov
5da0521fce Use new macros to implement ipstat and tcpstat using PCPU counters.
Change interface of kread_counters() similar ot kread() in the netstat(1).
2013-07-09 09:43:03 +00:00
Andre Oppermann
3c914c547e Allow drivers to specify a maximum TSO length in bytes if they are
limited in the amount of data they can handle at once.

Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to
change the limit.

The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything
less wouldn't be very useful anymore.  The upper limit is still at
IP_MAXPACKET (65536 bytes).  Raising it requires further auditing of
the IPv4/v6 code path's as the length field in the IP header would
overflow leading to confusion in firewalls and others packet handler on
the real size of the packet.

The placement into "struct ifnet" is a bit hackish but the best place
that was found.  When the stack/driver boundary is updated it should
be handled in a better way.

Submitted by:	cperciva (earlier version)
Reviewed by:	cperciva
Tested by:	cperciva
MFC after:	1 week (using spare struct members to preserve ABI)
2013-06-03 12:55:13 +00:00
Gleb Smirnoff
5923c29332 Merge from projects/counters: TCP/IP stats.
Convert 'struct ipstat' and 'struct tcpstat' to counter(9).

  This speeds up IP forwarding at extreme packet rates, and
makes accounting more precise.

Sponsored by:	Nginx, Inc.
2013-04-08 19:57:21 +00:00
Andre Oppermann
09440655fe Increase the initial CWND to 10 segments as defined in IETF TCPM
draft-ietf-tcpm-initcwnd-05. It explains why the increased initial
window improves the overall performance of many web services without
risking congestion collapse.

As long as it remains a draft it is placed under a sysctl marking it
as experimental:
 net.inet.tcp.experimental.initcwnd10 = 1
When it becomes an official RFC soon the sysctl will be changed to
the RFC number and moved to net.inet.tcp.

This implementation differs from the RFC draft in that it is a bit
more conservative in the case of packet loss on SYN or SYN|ACK because
we haven't reduced the default RTO to 1 second yet.  Also the restart
window isn't yet increased as allowed.  Both will be adjusted with
upcoming changes.

Is is enabled by default.  In Linux it is enabled since kernel 3.0.

MFC after:	2 weeks
2012-10-28 19:47:46 +00:00
Navdeep Parhar
09fe63205c - Updated TOE support in the kernel.
- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
  These are available as t3_tom and t4_tom modules that augment cxgb(4)
  and cxgbe(4) respectively.  The cxgb/cxgbe drivers continue to work as
  usual with or without these extra features.

- iWARP driver for Terminator 3 ASIC (kernel verbs).  T4 iWARP in the
  works and will follow soon.

Build-tested with make universe.

30s overview
============
What interfaces support TCP offload?  Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE

Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe

Which connections are offloaded?  Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe

Reviewed by:	bz, gnn
Sponsored by:	Chelsio communications.
MFC after:	~3 months (after 9.1, and after ensuring MFC is feasible)
2012-06-19 07:34:13 +00:00
Gleb Smirnoff
ef341ee1e3 When we receive an ICMP unreach need fragmentation datagram, we take
proposed MTU value from it and update the TCP host cache. Then
tcp_mss_update() is called on the corresponding tcpcb. It finds the
just allocated entry in the TCP host cache and updates MSS on the
tcpcb. And then we do a fast retransmit of what we have in the tcp
send buffer.

This sequence gets broken if the TCP host cache is exausted. In this
case allocation fails, and later called tcp_mss_update() finds nothing
in cache. The fast retransmit is done with not reduced MSS and is
immidiately replied by remote host with new ICMP datagrams and the
cycle repeats. This ping-pong can go up to wirespeed.

To fix this:
- tcp_mss_update() gets new parameter - mtuoffer, that is like
  offer, but needs to have min_protoh subtracted.
- tcp_mtudisc() as notification method renamed to tcp_mtudisc_notify().
- tcp_mtudisc() now accepts not a useless error argument, but proposed
  MTU value, that is passed to tcp_mss_update() as mtuoffer.

Reported by:	az
Reported by:	Andrey Zonov <andrey zonov.org>
Reviewed by:	andre (previous version of patch)
2012-04-16 13:49:03 +00:00
Gleb Smirnoff
9077f38738 Add new socket options: TCP_KEEPINIT, TCP_KEEPIDLE, TCP_KEEPINTVL and
TCP_KEEPCNT, that allow to control initial timeout, idle time, idle
re-send interval and idle send count on a per-socket basis.

Reviewed by:	andre, bz, lstewart
2012-02-05 16:53:02 +00:00
Andre Oppermann
9ec4a4cca5 Remove the ss_fltsz and ss_fltsz_local sysctl's which have
long been superseded by the RFC3390 initial CWND sizing.

Also remove the remnants of TCP_METRICS_CWND which used the
TCP hostcache to set the initial CWND in a non-RFC compliant
way.

MFC after:	1 week
2011-10-16 20:06:44 +00:00
Andre Oppermann
e233e2acb3 VNET virtualize tcp_sendspace/tcp_recvspace and change the
type to INT.  A long is not necessary as the TCP window is
limited to 2**30.  A larger initial window isn't useful.

MFC after:	1 week
2011-10-16 15:08:43 +00:00