Commit Graph

586 Commits

Author SHA1 Message Date
Michael Tuexen
560c058683 The receive buffer autoscaling for TCP is based on a linear growth, which
is acceptable in the congestion avoidance phase, but not during slow start.
The MTU is is also not taken into account.
Use a method instead, which is based on exponential growth working also in
slow start and being independent from the MTU.

This is joint work with rrs@.

Reviewed by:		rrs@, Richard Scheffenegger
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18375
2019-02-21 10:35:32 +00:00
Michael Tuexen
116ef4d6e7 When handling SYN-ACK segments in the SYN-RCVD state, set tp->snd_wnd
consistently.

This inconsistency was observed when working on the bug reported in
PR 235256, although it does not fix the reported issue. The fix for
the PR will be a separate commit.

PR:			235256
Reviewed by:		rrs@, Richard Scheffenegger
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D19033
2019-02-01 12:33:00 +00:00
Michael Tuexen
bf7fcdb18a Fix the detection of ECN-setup SYN-ACK packets.
RFC 3168 defines an ECN-setup SYN-ACK packet as on with the ECE flags
set and the CWR flags not set. The code was only checking if ECE flag
is set. This patch adds the check to verify that the CWR flags is not
set.

Submitted by:		Richard Scheffenegger
Reviewed by:		tuexen@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D18996
2019-01-28 12:45:31 +00:00
Michael Tuexen
7dc90a1de0 Fix a bug in the restart window computation of TCP New Reno
When implementing support for IW10, an update in the computation
of the restart window used after an idle phase was missed. To
minimize code duplication, implement the logic in tcp_compute_initwnd()
and call it. This fixes a bug in NewReno, which was not aware of
IW10.

Submitted by:		Richard Scheffenegger
Reviewed by:		tuexen@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D18940
2019-01-25 13:57:09 +00:00
Michael Tuexen
93899d10b4 The handling of RST segments in the SYN-RCVD state exists in the
code paths. Both are not consistent and the one on the syn cache code
does not conform to the relevant specifications (Page 69 of RFC 793
and Section 4.2 of RFC 5961).

This patch fixes this:
* The sequence numbers checks are fixed as specified on
  page Page 69 RFC 793.
* The sysctl variable net.inet.tcp.insecure_rst is now honoured
  and the behaviour as specified in Section 4.2 of RFC 5961.

Approved by:		re (gjb@)
Reviewed by:		bz@, glebius@, rrs@,
Differential Revision:	https://reviews.freebsd.org/D17595
Sponsored by:		Netflix, Inc.
2018-10-18 19:21:18 +00:00
Andrey V. Elsukov
384a5c3c28 Add INP_INFO_WUNLOCK_ASSERT() macro and use it instead of
INP_INFO_UNLOCK_ASSERT() in TCP-related code. For encapsulated traffic
it is possible, that the code is running in net_epoch_preempt section,
and INP_INFO_UNLOCK_ASSERT() is very strict assertion for such case.

PR:		231428
Reviewed by:	mmacy, tuexen
Approved by:	re (kib)
Differential Revision:	https://reviews.freebsd.org/D17335
2018-10-01 10:46:00 +00:00
Michael Tuexen
5dff1c3845 Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP
socket resulted in sending fragmented IPV6 packets.

This is fixes by reducing the MSS to the appropriate value. In addtion,
if the socket option is set before the handshake happens, announce this
MSS to the peer. This is not stricly required, but done since TCP
is conservative.

PR:			173444
Reviewed by:		bz@, rrs@
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16796
2018-08-21 14:12:30 +00:00
Randall Stewart
c28440db29 This change represents a substantial restructure of the way we
reassembly inbound tcp segments. The old algorithm just blindly
dropped in segments without coalescing. This meant that every
segment could take up greater and greater room on the linked list
of segments. This of course is now subject to a tighter limit (100)
of segments which in a high BDP situation will cause us to be a
lot more in-efficent as we drop segments beyond 100 entries that
we receive. What this restructure does is cause the reassembly
buffer to coalesce segments putting an emphasis on the two
common cases (which avoid walking the list of segments) i.e.
where we add to the back of the queue of segments and where we
add to the front. We also have the reassembly buffer supporting
a couple of debug options (black box logging as well as counters
for code coverage). These are compiled out by default but can
be added by uncommenting the defines.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D16626
2018-08-20 12:43:18 +00:00
Michael Tuexen
8db239dc6b Fix some TCP fast open issues.
The following issues are fixed:
* Whenever a TCP server with TCP fast open enabled, calls accept(),
  recv(), send(), and close() before the TCP-ACK segment has been received,
  the TCP connection is just dropped and the reception of the TCP-ACK
  segment triggers the sending of a TCP-RST segment.
* Whenever a TCP server with TCP fast open enabled, calls accept(), recv(),
  send(), send(), and close() before the TCP-ACK segment has been received,
  the first byte provided in the second send call is not transferred.
* Whenever a TCP client with TCP fast open enabled calls sendto() followed
  by close() the TCP connection is just dropped.

Reviewed by:		jtl@, kbowling@, rrs@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16485
2018-07-30 20:35:50 +00:00
Michael Tuexen
6138da62a9 Add missing send/recv dtrace probes for TCP.
These missing probe are mostly in the syncache and timewait code.

Reviewed by:		markj@, rrs@
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16369
2018-07-30 20:13:38 +00:00
Michael Tuexen
a026a53a76 Use appropriate MSS value when populating the TCP FO client cookie cache
When a client receives a SYN-ACK segment with a TFP fast open cookie,
but without an MSS option, an MSS value from uninitialised stack memory is used.
This patch ensures that in case no MSS option is included in the SYN-ACK,
the appropriate value as given in RFC 7413 is used.

Reviewed by:		kbowling@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D16175
2018-07-10 10:42:48 +00:00
Matt Macy
6573d7580b epoch(9): allow preemptible epochs to compose
- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
  there's no longer any benefit to dropping the pcbinfo lock
  and trying to do so just adds an error prone branchfest to
  these functions
- Remove cases of same function recursion on the epoch as
  recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
  thread as the tracker field is now stack or heap allocated
  as appropriate.

Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066
2018-07-04 02:47:16 +00:00
Matt Macy
9e58ff6ff9 convert inpcbinfo hash and info rwlocks to epoch + mutex
- Convert inpcbinfo info & hash locks to epoch for read and mutex for write
- Garbage collect code that handled INP_INFO_TRY_RLOCK failures as
  INP_INFO_RLOCK which can no longer fail

When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces
unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to
3%.

Overall packet throughput rate limited by CPU affinity and NIC driver design
choices.

On the receiver unhalted core cycles samples in in_pcblookup_hash went from
13% to to 1.6%

Tested by LLNW and pho@

Reviewed by: jtl
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15686
2018-06-19 01:54:00 +00:00
Matt Macy
10d20c84ed Fix spurious retransmit recovery on low latency networks
TCP's smoothed RTT (SRTT) can be much larger than an actual observed RTT. This can be either because of hz restricting the calculable RTT to 10ms in VMs or 1ms using the default 1000hz or simply because SRTT recently incorporated a larger value.

If an ACK arrives before the calculated badrxtwin (now + SRTT):
tp->t_badrxtwin = ticks + (tp->t_srtt >> (TCP_RTT_SHIFT + 1));

We'll erroneously reset snd_una to snd_max. If multiple segments were dropped and this happens repeatedly the transmit rate will be limited to 1MSS per RTO until we've retransmitted all drops.

Reported by:	rstone
Reviewed by:	hiren, transport
Approved by:	sbruno
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D8556
2018-05-08 02:22:34 +00:00
Jonathan T. Looney
2529f56ed3 Add the "TCP Blackbox Recorder" which we discussed at the developer
summits at BSDCan and BSDCam in 2017.

The TCP Blackbox Recorder allows you to capture events on a TCP connection
in a ring buffer. It stores metadata with the event. It optionally stores
the TCP header associated with an event (if the event is associated with a
packet) and also optionally stores information on the sockets.

It supports setting a log ID on a TCP connection and using this to correlate
multiple connections that share a common log ID.

You can log connections in different modes. If you are doing a coordinated
test with a particular connection, you may tell the system to put it in
mode 4 (continuous dump). Or, if you just want to monitor for errors, you
can put it in mode 1 (ring buffer) and dump all the ring buffers associated
with the connection ID when we receive an error signal for that connection
ID. You can set a default mode that will be applied to a particular ratio
of incoming connections. You can also manually set a mode using a socket
option.

This commit includes only basic probes. rrs@ has added quite an abundance
of probes in his TCP development work. He plans to commit those soon.

There are user-space programs which we plan to commit as ports. These read
the data from the log device and output pcapng files, and then let you
analyze the data (and metadata) in the pcapng files.

Reviewed by:	gnn (previous version)
Obtained from:	Netflix, Inc.
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D11085
2018-03-22 09:40:08 +00:00
Patrick Kelsey
18a7530938 Greatly reduce the number of #ifdefs supporting the TCP_RFC7413 kernel option.
The conditional compilation support is now centralized in
tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum
theoretical code/data footprint when TCP_RFC7413 is disabled, but
nearly all the TFO code should wind up being removed by the optimizer,
the additional footprint in the syncache entries is a single pointer,
and the additional overhead in the tcpcb is at the end of the
structure.

This enables the TCP_RFC7413 kernel option by default in amd64 and
arm64 GENERIC.

Reviewed by:	hiren
MFC after:	1 month
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14048
2018-02-26 03:03:41 +00:00
Patrick Kelsey
c560df6f12 This is an implementation of the client side of TCP Fast Open (TFO)
[RFC7413]. It also includes a pre-shared key mode of operation in
which the server requires the client to be in possession of a shared
secret in order to successfully open TFO connections with that server.

The names of some existing fastopen sysctls have changed (e.g.,
net.inet.tcp.fastopen.enabled -> net.inet.tcp.fastopen.server_enable).

Reviewed by:	tuexen
MFC after:	1 month
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14047
2018-02-26 02:53:22 +00:00
Andrey V. Elsukov
b2fe54bc8b Reinitialize IP header length after checksum calculation. It is used
later by TCP-MD5 code.

This fixes the problem with broken TCP-MD5 over IPv4 when NIC has
disabled TCP checksum offloading.

PR:		223835
MFC after:	1 week
2018-02-10 10:13:17 +00:00
Pedro F. Giffuni
51369649b0 sys: further adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
2017-11-20 19:43:44 +00:00
Gleb Smirnoff
3bdf4c4274 Declare more TCP globals in tcp_var.h, so that alternative TCP stacks
can use them.  Gather all TCP tunables in tcp_var.h in one place and
alphabetically sort them, to ease maintainance of the list.

Don't copy and paste declarations in tcp_stacks/fastpath.c.
2017-10-11 20:36:09 +00:00
Michael Tuexen
4da9052a05 Avoid TCP log messages which are false positives.
The check for timestamps are too early to handle SYN-ACK correctly.
So move it down after the corresponing processing has been done.

PR:		216832
Obtained from:	antonfb@hesiod.org
MFC after:	1 week
2017-08-23 14:50:08 +00:00
Sean Bruno
43053c125a Revert r307901 - Inform CC modules about loss events.
This was discussed between various transport@ members and it was
requested to be reverted and discussed.

Submitted by:	Kevin Bowling <kevin.bowling@kev009.com>
Reported by:	lawrence
Reviewed by:	hiren
Sponsored by:	Limelight Networks
2017-07-25 15:08:52 +00:00
Sean Bruno
5d53981a18 Revert r308180 - Set slow start threshold more accurrately on loss ...
This was discussed between various transport@ members and it was
requested to be reverted and discussed.

Submitted by:	kevin
Reported by:	lawerence
Reviewed by:	hiren
2017-07-25 15:03:05 +00:00
Michael Tuexen
98732609d5 Improve comments to describe what the code does.
Reported by:		jtl
Sponsored by:		Netflix, Inc.
2017-06-01 15:11:18 +00:00
Michael Tuexen
ebfd753408 When a SYN-ACK is received in SYN-SENT state, RFC 793 requires the
validation of SEG.ACK as the first step. If the ACK is not acceptable,
a RST segment should be sent and the segment should be dropped.
Up to now, the segment was partially processed.
This patch moves the check for the SEG.ACK validation up to the front
as required.
Reviewed by:		hiren, gnn
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D10424
2017-04-26 06:20:58 +00:00
Michael Tuexen
013f4df643 The sysctl variable net.inet.tcp.drop_synfin is not honored in all states,
for example not in SYN-SENT.
This patch adds code to check the sysctl variable in other states than
LISTEN.
Thanks to ae and gnn for providing comments.
Reviewed by:		gnn
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D9894
2017-04-12 20:27:15 +00:00
Steven Hartland
e44c1887fd Use estimated RTT for receive buffer auto resizing instead of timestamps
Switched from using timestamps to RTT estimates when performing TCP receive
buffer auto resizing, as not all hosts support / enable TCP timestamps.

Disabled reset of receive buffer auto scaling when not in bulk receive mode,
which gives an extra 20% performance increase.

Also extracted auto resizing to a common method shared between standard and
fastpath modules.

With this AWS S3 downloads at ~17ms latency on a 1Gbps connection jump from
~3MB/s to ~100MB/s using the default settings.

Reviewed by:    lstewart, gnn
MFC after:      2 weeks
Relnotes:       Yes
Sponsored by:   Multiplay
Differential Revision:  https://reviews.freebsd.org/D9668
2017-04-10 08:19:35 +00:00
Warner Losh
fbbd9655e5 Renumber copyright clause 4
Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by:	Jan Schaumann <jschauma@stevens.edu>
Pull Request:	https://github.com/freebsd/freebsd/pull/96
2017-02-28 23:42:47 +00:00
Ryan Stone
5ede40dcf2 Don't zero out srtt after excess retransmits
If the TCP stack has retransmitted more than 1/4 of the total
number of retransmits before a connection drop, it decides that
its current RTT estimate is hopelessly out of date and decides
to recalculate it from scratch starting with the next ACK.

Unfortunately, it implements this by zeroing out the current RTT
estimate.  Drop this hack entirely, as it makes it significantly more
difficult to debug connection issues.  Instead check for excessive
retransmits at the point where srtt is updated from an ACK being
received.  If we've exceeded 1/4 of the maximum retransmits,
discard the previous srtt estimate and replace it with the latest
rtt measurement.

Differential Revision:	https://reviews.freebsd.org/D9519
Reviewed by:	gnn
Sponsored by:	Dell EMC Isilon
2017-02-11 17:05:08 +00:00
Andrey V. Elsukov
fcf596178b Merge projects/ipsec into head/.
Small summary
 -------------

o Almost all IPsec releated code was moved into sys/netipsec.
o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel
  option IPSEC_SUPPORT added. It enables support for loading
  and unloading of ipsec.ko and tcpmd5.ko kernel modules.
o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by
  default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type
  support was removed. Added TCP/UDP checksum handling for
  inbound packets that were decapsulated by transport mode SAs.
  setkey(8) modified to show run-time NAT-T configuration of SA.
o New network pseudo interface if_ipsec(4) added. For now it is
  build as part of ipsec.ko module (or with IPSEC kernel).
  It implements IPsec virtual tunnels to create route-based VPNs.
o The network stack now invokes IPsec functions using special
  methods. The only one header file <netipsec/ipsec_support.h>
  should be included to declare all the needed things to work
  with IPsec.
o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed.
  Now these protocols are handled directly via IPsec methods.
o TCP_SIGNATURE support was reworked to be more close to RFC.
o PF_KEY SADB was reworked:
  - now all security associations stored in the single SPI namespace,
    and all SAs MUST have unique SPI.
  - several hash tables added to speed up lookups in SADB.
  - SADB now uses rmlock to protect access, and concurrent threads
    can do SA lookups in the same time.
  - many PF_KEY message handlers were reworked to reflect changes
    in SADB.
  - SADB_UPDATE message was extended to support new PF_KEY headers:
    SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They
    can be used by IKE daemon to change SA addresses.
o ipsecrequest and secpolicy structures were cardinally changed to
  avoid locking protection for ipsecrequest. Now we support
  only limited number (4) of bundled SAs, but they are supported
  for both INET and INET6.
o INPCB security policy cache was introduced. Each PCB now caches
  used security policies to avoid SP lookup for each packet.
o For inbound security policies added the mode, when the kernel does
  check for full history of applied IPsec transforms.
o References counting rules for security policies and security
  associations were changed. The proper SA locking added into xform
  code.
o xform code was also changed. Now it is possible to unregister xforms.
  tdb_xxx structures were changed and renamed to reflect changes in
  SADB/SPDB, and changed rules for locking and refcounting.

Reviewed by:	gnn, wblock
Obtained from:	Yandex LLC
Relnotes:	yes
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D9352
2017-02-06 08:49:57 +00:00
George V. Neville-Neil
2b9c998413 Fix DTrace TCP tracepoints to not use mtod() as it is both unnecessary and
dangerous.  Those wanting data from an mbuf should use DTrace itself to get
the data.

PR:	203409
Reviewed by:	hiren
MFC after:	1 week
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D9035
2017-01-04 02:19:13 +00:00
Michael Tuexen
4ddd5aadea Fix the handling of TCP FIN-segments in the CLOSED state
When a TCP segment with the FIN bit set was received in the CLOSED state,
a TCP RST-ACK-segment is sent. When computing SEG.ACK for this, the
FIN counts as one byte. This accounting was missing and is fixed by this
patch.

Reviewed by:		hiren
MFC after:		1 month
Sponsored by:		Netflix, Inc.
Differential Revision:	https://svn.freebsd.org/base/head
2016-12-02 08:02:31 +00:00
Michael Tuexen
35dfb8cb68 Ensure that TCP state changes to state-closing are reported via dtrace.
This does not cover state changes from TIME-WAIT.

Reviewed by:		gnn
MFC after:		3 weeks
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D8443
2016-11-19 14:45:08 +00:00
Michael Tuexen
6779a1a101 Notify the use via setting errno when a TCP RST segment is received
either in the CLOSING or LAST-ACK state.

Reviewed by:		hiren
MFC after:		3 weeks
Sponsored by:		Netflix
Differential Revision:	https://reviews.freebsd.org/D8371
2016-11-17 08:15:02 +00:00
Hiren Panchasara
e04310d59b Set slow start threshold more accurately on loss to be flightsize/2 instead of
cwnd/2 as recommended by RFC5681. (spotted by mmacy at nextbsd dot org)

Restore pre-r307901 behavior of aligning ssthresh/cwnd on mss boundary. (spotted
by slawa at zxy dot spb dot ru)

Tested by:	    dim, Slawa <slawa at zxy dot spb dot ru>
MFC after:	    1 month
X-MFC with:	    r307901
Sponsored by:	    Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D8349
2016-11-01 21:08:37 +00:00
Hiren Panchasara
4e7f755377 FreeBSD tcp stack used to inform respective congestion control module about the
loss event but not use or obay the recommendations i.e. values set by it in some
cases.

Here is an attempt to solve that confusion by following relevant RFCs/drafts.
Stack only sets congestion window/slow start threshold values when there is no
CC module availalbe to take that action. All CC modules are inspected and
updated when needed to take appropriate action on loss.

tcp_stacks/fastpath module has been updated to adapt these changes.

Note: Probably, the most significant change would be to not bring congestion
window down to 1MSS on a loss signaled by 3-duplicate acks and letting
respective CC decide that value.

In collaboration with:	Matt Macy <mmacy at nextbsd dot org>
Discussed on:		transport@ mailing list
Reviewed by:		jtl
MFC after:		1 month
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D8225
2016-10-25 05:45:47 +00:00
Hiren Panchasara
dd13b7d387 Undo r307899. It needs a bit more work and proper commit log. 2016-10-25 05:07:51 +00:00
Hiren Panchasara
95d8236011 In Collaboration with: Matt Macy <mmacy at nextbsd dot com>
Reviewed by:		    jtl
Sponsored by:		    Limelight Networks
Differential Revision:	    https://reviews.freebsd.org/D8225
2016-10-25 05:03:33 +00:00
Julien Charbon
f5cf1e5f5a Fix a double-free when an inp transitions to INP_TIMEWAIT state
after having been dropped.

This fixes enforces in_pcbdrop() logic in tcp_input():

"in_pcbdrop() is used by TCP to mark an inpcb as unused and avoid future packet
delivery or event notification when a socket remains open but TCP has closed."

PR:			203175
Reported by:		Palle Girgensohn, Slawa Olhovchenkov
Tested by:		Slawa Olhovchenkov
Reviewed by:		Slawa Olhovchenkov
Approved by:		gnn, Slawa Olhovchenkov
Differential Revision:	https://reviews.freebsd.org/D8211
MFC after:		1 week
Sponsored by:		Verisign, inc
2016-10-18 07:16:49 +00:00
Hiren Panchasara
784ce8fad2 Make sure tcp_mss() has the same check as tcp_mss_update() to have t_maxseg set
to at least 64.

This is still just a coverup to avoid kernel panic and not an actual fix.

PR:			213232
Reviewed by:		glebius
MFC after:		1 week
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D8272
2016-10-18 02:40:25 +00:00
Patrick Kelsey
09c305eb65 Fix cases where the TFO pending counter would leak references, and eventually, memory.
Also renamed some tfo labels and added/reworked comments for clarity.

Based on an initial patch from jtl.

PR: 213424
Reviewed by:	jtl
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D8235
2016-10-15 01:41:28 +00:00
Jonathan T. Looney
6d172f58a2 The code currently resets the keepalive timer each time a packet is
received on a TCP session that has entered the ESTABLISHED state. This
results in a lot of calls to reset the keepalive timer.

This patch changes the behavior so we set the keepalive timer for the
keepalive idle time (TP_KEEPIDLE). When the keepalive timer fires, it will
first check to see if the session has been idle for TP_KEEPIDLE ticks. If
not, it will reschedule the keepalive timer for the time the session will
have been idle for TP_KEEPIDLE ticks.

For a session with regular communication, the keepalive timer should fire
approximately once every TP_KEEPIDLE ticks. For sessions with irregular
communication, the keepalive timer might fire more often. But, the
disruption from a periodic keepalive timer should be less than the regular
cost of resetting the keepalive timer on every packet.

(FWIW, this change saved approximately 1.73% of the busy CPU cycles on a
particular test system with a heavy TCP output load. Of course, the
actual impact is very specific to the particular hardware and workload.)

Reviewed by:	gallatin, rrs
MFC after:	2 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D8243
2016-10-14 14:57:43 +00:00
Jonathan T. Looney
68bd7ed102 The TFO server-side code contains some changes that are not conditioned on
the TCP_RFC7413 kernel option. This change removes those few instructions
from the packet processing path.

While not strictly necessary, for the sake of consistency, I applied the
new IS_FASTOPEN macro to all places in the packet processing path that
used the (t_flags & TF_FASTOPEN) check.

Reviewed by:	hiren
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D8219
2016-10-12 19:06:50 +00:00
Jonathan T. Looney
4527476029 Currently, when tcp_input() receives a packet on a session that matches a
TCPCB, it checks (so->so_options & SO_ACCEPTCONN) to determine whether or
not the socket is a listening socket. However, this causes the code to
access a different cacheline. If we first check if the socket is in the
LISTEN state, we can avoid accessing so->so_options when processing packets
received for ESTABLISHED sessions.

If INVARIANTS is defined, the code still needs to access both variables to
check that so->so_options is consistent with the state.

Reviewed by:	gallatin
MFC after:	1 week
Sponsored by:	Netflix
2016-10-12 02:30:33 +00:00
Jonathan T. Looney
bd79708dbf In the TCP stack, the hhook(9) framework provides hooks for kernel modules
to add actions that run when a TCP frame is sent or received on a TCP
session in the ESTABLISHED state. In the base tree, this functionality is
only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd,
and cc_vegas congestion control modules.

Presently, we incur overhead to check for hooks each time a TCP frame is
sent or received on an ESTABLISHED TCP session.

This change adds a new compile-time option (TCP_HHOOK) to determine whether
to include the hhook(9) framework for TCP. To retain backwards
compatibility, I added the TCP_HHOOK option to every configuration file that
already defined "options INET". (Therefore, this patch introduces no
functional change. In order to see a functional difference, you need to
compile a custom kernel without the TCP_HHOOK option.) This change will
allow users to easily exclude this functionality from their kernel, should
they wish to do so.

Note that any users who use a custom kernel configuration and use one of the
congestion control modules listed above will need to add the TCP_HHOOK
option to their kernel configuration.

Reviewed by:	rrs, lstewart, hiren (previous version), sjg (makefiles only)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D8185
2016-10-12 02:16:42 +00:00
Jonathan T. Looney
3ac125068a Remove "long" variables from the TCP stack (not including the modular
congestion control framework).

Reviewed by:	gnn, lstewart (partial)
Sponsored by:	Juniper Networks, Netflix
Differential Revision:	(multiple)
Tested by:	Limelight, Netflix
2016-10-06 16:28:34 +00:00
Kurt Lidl
1d7ee746e6 Properly preserve ip_tos bits for IPv4 packets
Restructure code slightly to save ip_tos bits earlier.  Fix the bug
where the ip_tos field is zeroed out before assigning to the iptos
variable. Restore the ip_tos and ip_ver fields only if they have
been zeroed during the pseudo-header checksum calculation.

Reviewed by:	cem, gnn, hiren
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D8053
2016-09-29 19:45:24 +00:00
Lawrence Stewart
4b7b743c16 Pass the number of segments coalesced by LRO up the stack by repurposing the
tso_segsz pkthdr field during RX processing, and use the information in TCP for
more correct accounting and as a congestion control input. This is only a start,
and an audit of other uses for the data is left as future work.

Reviewed by:	gallatin, rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D7564
2016-08-25 13:33:32 +00:00
Andrey V. Elsukov
4c10540274 Cleanup unneded include "opt_ipfw.h".
It was used for conditional build IPFIREWALL_FORWARD support.
But IPFIREWALL_FORWARD option was removed a long time ago.
2016-06-09 05:48:34 +00:00
Don Lewis
883054b4c3 Change net.inet.tcp.ecn.enable sysctl mib from a binary off/on
control to a three way setting.
  0 - Totally disable ECN. (no change)
  1 - Enable ECN if incoming connections request it.  Outgoing
      connections will request ECN.  (no change from present != 0 setting)
  2 - Enable ECN if incoming connections request it.  Outgoing
      conections will not request ECN.

Change the default value of net.inet.tcp.ecn.enable from 0 to 2.

Linux version 2.4.20 and newer, Solaris, and Mac OS X 10.5 and newer have
similar capabilities.  The actual values above match Linux, and the default
matches the current Linux default.

Reviewed by:	eadler
MFC after:	1 month
MFH:		yes
Sponsored by:	https://reviews.freebsd.org/D6386
2016-05-19 22:20:35 +00:00