Commit Graph

36 Commits

Author SHA1 Message Date
Sean Bruno
d7fb35d13a Update tcp_lro with tested bugfixes from Netflix and LLNW:
rrs - Lets make the LRO code look for true dup-acks and window update acks
          fly on through and combine.
    rrs - Make the LRO engine a bit more aware of ack-only seq space. Lets not
          have it incorrectly wipe out newer acks for older acks when we have
          out-of-order acks (common in wifi environments).
    jeggleston - LRO eating window updates

Based on all of the above I think we are RFC compliant doing it this way:

https://tools.ietf.org/html/rfc1122

section 4.2.2.16

"Note that TCP has a heuristic to select the latest window update despite
possible datagram reordering; as a result, it may ignore a window update with
a smaller window than previously offered if neither the sequence number nor the
acknowledgment number is increased."

Submitted by:	Kevin Bowling <kevin.bowling@kev009.com>
Reviewed by:	rstone gallatin
Sponsored by:	NetFlix and Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14540
2018-03-09 00:08:43 +00:00
Pedro F. Giffuni
fe267a5590 sys: general adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.
2017-11-27 15:23:17 +00:00
Navdeep Parhar
f8acc03ef1 Flush the LRO ctrl as soon as lro_mbufs fills up. There is no need to
wait for the next enqueue from the driver.

Reviewed by:	gnn@, hselasky@, gallatin@
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D10432
2017-04-24 22:35:00 +00:00
Navdeep Parhar
ea9a92f112 Frames that are not considered for LRO should not be counted in LRO statistics.
Reviewed by:	gnn@, hselasky@, gallatin@
MFC after:	1 week
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D10430
2017-04-24 22:31:56 +00:00
Navdeep Parhar
b0ca71f0a0 Free lro_hash unconditionally, just like lro_mbuf_data a few lines
later.  Fix whitespace nit while here.
2017-04-19 23:06:07 +00:00
Navdeep Parhar
a3927369fa Do not leak lro_hash on failure to allocate lro_mbuf_data.
MFC after:	1 week
2017-04-19 22:27:26 +00:00
Navdeep Parhar
3d24e03800 Remove redundant assignment. 2017-04-19 22:20:41 +00:00
Lawrence Stewart
4b7b743c16 Pass the number of segments coalesced by LRO up the stack by repurposing the
tso_segsz pkthdr field during RX processing, and use the information in TCP for
more correct accounting and as a congestion control input. This is only a start,
and an audit of other uses for the data is left as future work.

Reviewed by:	gallatin, rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D7564
2016-08-25 13:33:32 +00:00
Sepherosa Ziehau
8452c1b345 tcp/lro: Make # of LRO entries tunable
Reviewed by:	hps, gallatin
Obtained from:	rrs, gallatin
MFC after:	2 weeks
Sponsored by:	Netflix (rrs, gallatin), Microsoft (sephe)
Differential Revision:	https://reviews.freebsd.org/D7499
2016-08-16 06:40:27 +00:00
Sepherosa Ziehau
b9ec6f0b02 tcp/lro: If timestamps mismatch or it's a FIN, force flush.
This keeps the segments/ACK/FIN delivery order.

Before this patch, it was observed: if A sent FIN immediately after
an ACK, B would deliver FIN first to the TCP stack, then the ACK.
This out-of-order delivery causes one unnecessary ACK sent from B.

Reviewed by:	gallatin, hps
Obtained from:  rrs, gallatin
Sponsored by:	Netflix (rrs, gallatin), Microsoft (sephe)
Differential Revision:	https://reviews.freebsd.org/D7415
2016-08-05 09:08:00 +00:00
Sepherosa Ziehau
05cde7efa6 tcp/lro: Implement hash table for LRO entries.
This significantly improves HTTP workload performance and reduces
HTTP workload latency.

Reviewed by:	rrs, gallatin, hps
Obtained from:	rrs, gallatin
Sponsored by:	Netflix (rrs, gallatin) , Microsoft (sephe)
Differential Revision:	https://reviews.freebsd.org/D6689
2016-08-02 06:36:47 +00:00
Hans Petter Selasky
ec6689059d Use insertion sort instead of bubble sort in TCP LRO.
Replacing the bubble sort with insertion sort gives an 80% reduction
in runtime on average, with randomized keys, for small partitions.

If the keys are pre-sorted, insertion sort runs in linear time, and
even if the keys are reversed, insertion sort is faster than bubble
sort, although not by much.

Update comment describing "tcp_lro_sort()" while at it.

Differential Revision:	https://reviews.freebsd.org/D6619
Sponsored by:	Mellanox Technologies
Tested by:	Netflix
Suggested by:	Pieter de Goeje <pieter@degoeje.nl>
Reviewed by:	ed, gallatin, gnn, transport
2016-06-03 08:35:07 +00:00
Hans Petter Selasky
fc271df341 Use optimised complexity safe sorting routine instead of the kernel's
"qsort()".

The kernel's "qsort()" routine can in worst case spend O(N*N) amount of
comparisons before the input array is sorted. It can also recurse a
significant amount of times using up the kernel's interrupt thread
stack.

The custom sorting routine takes advantage of that the sorting key is
only 64 bits. Based on set and cleared bits in the sorting key it
partitions the array until it is sorted. This process has a recursion
limit of 64 times, due to the number of set and cleared bits which can
occur. Compiled with -O2 the sorting routine was measured to use
64-bytes of stack. Multiplying this by 64 gives a maximum stack
consumption of 4096 bytes for AMD64. The same applies to the execution
time, that the array to be sorted will not be traversed more than 64
times.

When serving roughly 80Gb/s with 80K TCP connections, the old method
consisting of "qsort()" and "tcp_lro_mbuf_compare_header()" used 1.4%
CPU, while the new "tcp_lro_sort()" used 1.1% for LRO related sorting
as measured by Intel Vtune. The testing was done using a sysctl to
toggle between "qsort()" and "tcp_lro_sort()".

Differential Revision:	https://reviews.freebsd.org/D6472
Sponsored by:	Mellanox Technologies
Tested by:	Netflix
Reviewed by:	gallatin, rrs, sephe, transport
2016-05-26 11:10:31 +00:00
Sepherosa Ziehau
51e3c20d36 tcp/lro: Refactor the active list operation.
Ease more work concerning active list, e.g. hash table etc.

Reviewed by:	gallatin, rrs (earlier version)
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D6137
2016-05-03 08:13:25 +00:00
Sepherosa Ziehau
9b436b180c tcp/lro: Fix more typo
Noticed by:	hiren
MFC after:	1 week
Sponsored by:	Microsoft OSTC
2016-04-28 01:43:18 +00:00
Sepherosa Ziehau
9e3db01282 tcp/lro: Fix typo.
MFC after:	1 week
Sponsored by:	Microsoft OSTC
2016-04-27 09:40:55 +00:00
Sepherosa Ziehau
1ea448225c tcp/lro: Change SLIST to LIST, so that removing an entry is O(1)
This is kinda critical to the performance when the CPU is slow and
network bandwidth is high, e.g. in the hypervisor.

Reviewed by:	rrs, gallatin, Dexuan Cui <decui microsoft com>
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D5765
2016-04-01 06:43:05 +00:00
Sepherosa Ziehau
6dd38b8716 tcp/lro: Use tcp_lro_flush_all in device drivers to avoid code duplication
And factor out tcp_lro_rx_done, which deduplicates the same logic with
netinet/tcp_lro.c

Reviewed by:	gallatin (1st version), hps, zbb, np, Dexuan Cui <decui microsoft com>
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D5725
2016-04-01 06:28:33 +00:00
Sepherosa Ziehau
489f0c3c17 tcp/lro: Return TCP_LRO_NO_ENTRIES if we are short of LRO entries.
So that callers could react accordingly.

Reviewed by:	gallatin (no objection)
MFC after:	1 week
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D5695
2016-03-25 02:54:13 +00:00
Sepherosa Ziehau
7ae3d4bf54 tcp/lro: Allow drivers to set the TCP ACK/data segment aggregation limit
ACK aggregation limit is append count based, while the TCP data segment
aggregation limit is length based.  Unless the network driver sets these
two limits, it's an NO-OP.

Reviewed by:	adrian, gallatin (previous version), hselasky (previous version)
Approved by:	adrian (mentor)
MFC after:	1 week
Sponsored by:	Microsoft OSTC
Differential Revision:	https://reviews.freebsd.org/D5185
2016-02-18 04:58:34 +00:00
Hans Petter Selasky
3e9470b721 Use a pair of ifs when comparing the 32-bit flowid integers so that
the sign bit doesn't cause an overflow. The overflow manifests itself
as a sorting index wrap around in the middle of the sorted array,
which is not a problem for the LRO code, but might be a problem for
the logic inside qsort().

Reviewed by:		gnn @
Sponsored by:		Mellanox Technologies
Differential Revision:	https://reviews.freebsd.org/D5239
2016-02-11 10:03:50 +00:00
Gleb Smirnoff
8ec07310fa These files were getting sys/malloc.h and vm/uma.h with header pollution
via sys/mbuf.h
2016-02-01 17:41:21 +00:00
Hans Petter Selasky
e936121d31 Add optimizing LRO wrapper:
- Add optimizing LRO wrapper which pre-sorts all incoming packets
  according to the hash type and flowid. This prevents exhaustion of
  the LRO entries due to too many connections at the same time.
  Testing using a larger number of higher bandwidth TCP connections
  showed that the incoming ACK packet aggregation rate increased from
  ~1.3:1 to almost 3:1. Another test showed that for a number of TCP
  connections greater than 16 per hardware receive ring, where 8 TCP
  connections was the LRO active entry limit, there was a significant
  improvement in throughput due to being able to fully aggregate more
  than 8 TCP stream. For very few very high bandwidth TCP streams, the
  optimizing LRO wrapper will add CPU usage instead of reducing CPU
  usage. This is expected. Network drivers which want to use the
  optimizing LRO wrapper needs to call "tcp_lro_queue_mbuf()" instead
  of "tcp_lro_rx()" and "tcp_lro_flush_all()" instead of
  "tcp_lro_flush()". Further the LRO control structure must be
  initialized using "tcp_lro_init_args()" passing a non-zero number
  into the "lro_mbufs" argument.

- Make LRO statistics 64-bit. Previously 32-bit integers were used for
  statistics which can be prone to wrap-around. Fix this while at it
  and update all SYSCTL's which expose LRO statistics.

- Ensure all data is freed when destroying a LRO control structures,
  especially leftover LRO entries.

- Reduce number of memory allocations needed when setting up a LRO
  control structure by precomputing the total amount of memory needed.

- Add own memory allocation counter for LRO.

- Bump the FreeBSD version to force recompilation of all KLDs due to
  change of the LRO control structure size.

Sponsored by:	Mellanox Technologies
Reviewed by:	gallatin, sbruno, rrs, gnn, transport
Tested by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D4914
2016-01-19 15:33:28 +00:00
Navdeep Parhar
9523d1bfc3 Fix leak in tcp_lro_rx. Simply clearing M_PKTHDR isn't enough, any tags
hanging off the header need to be freed too.

Differential Revision:	https://reviews.freebsd.org/D2708
Reviewed by:	ae@, hiren@
2015-06-30 17:19:58 +00:00
Navdeep Parhar
7127e6acf0 Merge r254336 from user/np/cxl_tuning.
Add a last-modified timestamp to each LRO entry and provide an interface
to flush all inactive entries.  Drivers decide when to flush and what
the inactivity threshold should be.

Network drivers that process an rx queue to completion can enter a
livelock type situation when the rate at which packets are received
reaches equilibrium with the rate at which the rx thread is processing
them.  When this happens the final LRO flush (normally when the rx
routine is done) does not occur.  Pure ACKs and segments with total
payload < 64K can get stuck in an LRO entry.  Symptoms are that TCP
tx-mostly connections' performance falls off a cliff during heavy,
unrelated rx on the interface.

Flushing only inactive LRO entries works better than any of these
alternates that I tried:
- don't LRO pure ACKs
- flush _all_ LRO entries periodically (every 'x' microseconds or every
  'y' descriptors)
- stop rx processing in the driver periodically and schedule remaining
  work for later.

Reviewed by:	andre
2013-08-28 23:00:34 +00:00
Andrew Gallatin
e5ca1ffab5 Fix tcp_lro_rx_ipv4() for drivers that do not set CSUM_IP_CHECKED.
Specifcially, in_cksum_hdr() returns 0 (not 0xffff) when the IPv4
checksum is correct. Without this fix, the tcp_lro code will reject
good IPv4 traffic from drivers that do not implement IPv4 header
harder csum offload.

Sponsored by: Myricom Inc.

MFC after:	7 days
2013-02-21 17:00:35 +00:00
Bjoern A. Zeeb
5fa2656e55 Make TCP LRO work properly with VIMAGE kernels rather than just panicing.
There's no VIMAGE context set there yet as this is before if_ethersubr.c.

MFC after:	3 days
X-MFC with:	r235981
2012-06-01 11:42:50 +00:00
Bjoern A. Zeeb
cace7064fc Trim the extra $FreeBSD$ from the comment below the license. We use
the __FBSDID() macro on the file now instead.

MFC after:	3 days
2012-05-26 10:28:11 +00:00
Bjoern A. Zeeb
31bfc56ecd In case forwarding is turned on for a given address family, refuse to
queue the packet for LRO and tell the driver to directly pass it on.
This avoids re-assembly and later re-fragmentation problems when
forwarding.

It's not the best solution but the simplest and most effective for
the moment.

Should have been done:	ages ago
Discussed with and by:	many
MFC after:		3 days
2012-05-25 08:17:59 +00:00
Bjoern A. Zeeb
62b5b6ecd0 MFp4 bz_ipv6_fast:
Significantly update tcp_lro for mostly two things:
  1) introduce basic support for IPv6 without extension headers.
  2) try hard to also get the incremental checksum updates right,
     especially also in the IPv4 case for the IP and TCP header.

  Move variables around for better locality, factor things out into
  functions, allow checksum updates to be compiled out, ...

  Leave a few comments on further things to look at in the future,
  though that is not the full list.

  Update drivers with appropriate #includes as needed for IPv6 data
  type in LRO.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-24 23:03:23 +00:00
Bjoern A. Zeeb
27f190a3ca Switch to a standard 2 clause BSD license (from bsd-style-copyright).
Approved by:	Myricom Inc. (gallatin)
Approved by:	Intel Corporation (jfv)
2012-05-15 13:23:44 +00:00
Colin Percival
ca7122622b Don't allow lro->len to exceed 65535, as this will result in overflow
when len is inserted back into the synthetic IP packet and cause a
multiple of 2^16 bytes of TCP "packet loss".

This improves Linux->FreeBSD netperf bandwidth by a factor of 300 in
testing on Amazon EC2.

Reviewed by:	jfv
MFC after:	2 weeks
2011-07-05 18:43:54 +00:00
Jack F Vogel
c31aa19c53 Port of the LRO fix from mxge driver to the generic
LRO code. Thanks to Andrew Gallatin for the change.

MFC after:  7 days
2011-04-07 21:20:26 +00:00
John Baldwin
79e955ed63 Trim extra spaces before tabs. 2011-01-07 21:40:34 +00:00
Kip Macy
4570959392 Don't calculate checksum if it has already been validated
Obtained from:	Chelsio Inc.
MFC after:	3 days
2008-08-24 02:31:09 +00:00
Jack F Vogel
6c5087a818 Add generic TCP LOR into netinet 2008-06-11 22:12:50 +00:00