Commit Graph

63 Commits

Author SHA1 Message Date
Michael Tuexen
bb63f59b22 When performing after_idle() or post_recovery(), don't disable the
DCTCP specific methods. Also fallthrough NewReno for non ECN capable
TCP connections and improve the integer arithmetic.

Obtained from:		Richard Scheffenegger
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D20550
2019-07-29 09:19:48 +00:00
Michael Tuexen
333ba164d6 * Improve input validation of sysctl parameters for DCTPC.
* Initialize the alpha parameter to a conservative value (like Linux)
* Improve handling of arithmetic.
* Improve man-page

Obtained from:		Richard Scheffenegger
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D20549
2019-07-29 08:50:35 +00:00
Michael Tuexen
5cc11a89db Prevent cwnd to collapse down to 1 MSS after exiting recovery.
This is descrined in RFC 6582, which updates RFC 3782.

Submitted by:		Richard Scheffenegger
Reviewed by:		lstewart@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D17614
2019-05-09 07:11:08 +00:00
Michael Tuexen
aa36fbd6fa Ensure that when using the TCP CDG congestion control and setting the
sysctl variable net.inet.tcp.cc.cdg.smoothing_factor to 0, the smoothing
is disabled. Without this patch, a division by zero orrurs.

PR:			193762
Reviewed by:		lstewart@, rrs@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D19071
2019-02-08 20:42:49 +00:00
Michael Tuexen
7dc90a1de0 Fix a bug in the restart window computation of TCP New Reno
When implementing support for IW10, an update in the computation
of the restart window used after an idle phase was missed. To
minimize code duplication, implement the logic in tcp_compute_initwnd()
and call it. This fixes a bug in NewReno, which was not aware of
IW10.

Submitted by:		Richard Scheffenegger
Reviewed by:		tuexen@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D18940
2019-01-25 13:57:09 +00:00
Hiren Panchasara
51e712f865 Revert r331567 CC Cubic: fix underflow for cubic_cwnd()
This change is causing TCP connections using cubic to hang. Need to dig more to
find exact cause and fix it.

Reported by:	tj at mrsk dot me, Matt Garber (via twitter)
Discussed with:	sbruno (previously), allanjude, cperciva
MFC after:	3 days
2018-12-15 17:01:16 +00:00
Brooks Davis
855acb84ca Fix bugs in plugable CC algorithm and siftr sysctls.
Use the sysctl_handle_int() handler to write out the old value and read
the new value into a temporary variable. Use the temporary variable
for any checks of values rather than using the CAST_PTR_INT() macro on
req->newptr. The prior usage read directly from userspace memory if the
sysctl() was called correctly. This is unsafe and doesn't work at all on
some architectures (at least i386.)

In some cases, the code could also be tricked into reading from kernel
memory and leaking limited information about the contents or crashing
the system. This was true for CDG, newreno, and siftr on all platforms
and true for i386 in all cases. The impact of this bug is largest in
VIMAGE jails which have been configured to allow writing to these
sysctls.

Per discussion with the security officer, we will not be issuing an
advisory for this issue as root access and a non-default config are
required to be impacted.

Reviewed by:	markj, bz
Discussed with:	gordon (security officer)
MFC after:	3 days
Security:	kernel information leak, local DoS (both require root)
Differential Revision:	https://reviews.freebsd.org/D18443
2018-12-15 15:06:22 +00:00
Michael Tuexen
c8b53ced95 Limit option_len for the TCP_CCALGOOPT.
Limiting the length to 2048 bytes seems to be acceptable, since
the values used right now are using 8 bytes.

Reviewed by:		glebius, bz, rrs
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D18366
2018-11-30 10:50:07 +00:00
Andrew Turner
5f901c92a8 Use the new VNET_DEFINE_STATIC macro when we are defining static VNET
variables.

Reviewed by:	bz
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D16147
2018-07-24 16:35:52 +00:00
Matt Macy
2269988749 NULL out cc_data in pluggable TCP {cc}_cb_destroy
When ABE was added (rS331214) to NewReno and leak fixed (rS333699) , it now has
a destructor (newreno_cb_destroy) for per connection state. Other congestion
controls may allocate and free cc_data on entry and exit, but the field is
never explicitly NULLed if moving back to NewReno which only internally
allocates stateful data (no entry contstructor) resulting in a situation where
newreno_cb_destory might be called on a junk pointer.

 -    NULL out cc_data in the framework after calling {cc}_cb_destroy
 -    free(9) checks for NULL so there is no need to perform not NULL checks
     before calling free.
 -    Improve a comment about NewReno in tcp_ccalgounload

This is the result of a debugging session from Jason Wolfe, Jason Eggleston,
and mmacy@ and very helpful insight from lstewart@.

Submitted by: Kevin Bowling
Reviewed by: lstewart
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16282
2018-07-22 05:37:58 +00:00
Lawrence Stewart
9891578a40 Plug a memory leak and potential NULL-pointer dereference introduced in r331214.
Each TCP connection that uses the system default cc_newreno(4) congestion
control algorithm module leaks a "struct newreno" (8 bytes of memory) at
connection initialisation time. The NULL-pointer dereference is only germane
when using the ABE feature, which is disabled by default.

While at it:

- Defer the allocation of memory until it is actually needed given that ABE is
  optional and disabled by default.

- Document the ENOMEM errno in getsockopt(2)/setsockopt(2).

- Document ENOMEM and ENOBUFS in tcp(4) as being synonymous given that they are
  used interchangeably throughout the code.

- Fix a few other nits also accidentally omitted from the original patch.

Reported by:	Harsh Jain on freebsd-net@
Tested by:	tjh@
Differential Revision:	https://reviews.freebsd.org/D15358
2018-05-17 02:46:27 +00:00
Sean Bruno
68ff29affa cc_cubic:
- Update cubic parameters to draft-ietf-tcpm-cubic-04

Submitted by:	Matt Macy <mmacy@mattmacy.io>
Reviewed by:	lstewart
Differential Revision:	https://reviews.freebsd.org/D10556
2018-05-03 15:01:27 +00:00
Sean Bruno
e2041bfa7c CC Cubic: fix underflow for cubic_cwnd()
Singed calculations in cubic_cwnd() can result in negative cwnd
value which is then cast to an unsigned value. Values less than
1 mss are generally bad for other parts of the code, also fixed.

Submitted by:	Jason Eggleston <jason@eggnet.com>
Sponsored by:	Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D14141
2018-03-26 19:53:36 +00:00
Lawrence Stewart
370efe5ac8 Add support for the experimental Internet-Draft "TCP Alternative Backoff with
ECN (ABE)" proposal to the New Reno congestion control algorithm module.
ABE reduces the amount of congestion window reduction in response to
ECN-signalled congestion relative to the loss-inferred congestion response.

More details about ABE can be found in the Internet-Draft:
https://tools.ietf.org/html/draft-ietf-tcpm-alternativebackoff-ecn

The implementation introduces four new sysctls:

- net.inet.tcp.cc.abe defaults to 0 (disabled) and can be set to non-zero to
  enable ABE for ECN-enabled TCP connections.

- net.inet.tcp.cc.newreno.beta and net.inet.tcp.cc.newreno.beta_ecn set the
  multiplicative window decrease factor, specified as a percentage, applied to
  the congestion window in response to a loss-based or ECN-based congestion
  signal respectively. They default to the values specified in the draft i.e.
  beta=50 and beta_ecn=80.

- net.inet.tcp.cc.abe_frlossreduce defaults to 0 (disabled) and can be set to
  non-zero to enable the use of standard beta (50% by default) when repairing
  loss during an ECN-signalled congestion recovery episode. It enables a more
  conservative congestion response and is provided for the purposes of
  experimentation as a result of some discussion at IETF 100 in Singapore.

The values of beta and beta_ecn can also be set per-connection by way of the
TCP_CCALGOOPT TCP-level socket option and the new CC_NEWRENO_BETA or
CC_NEWRENO_BETA_ECN CC algo sub-options.

Submitted by:	Tom Jones <tj@enoti.me>
Tested by:	Tom Jones <tj@enoti.me>, Grenville Armitage <garmitage@swin.edu.au>
Relnotes:	Yes
Differential Revision:	https://reviews.freebsd.org/D11616
2018-03-19 16:37:47 +00:00
Pedro F. Giffuni
fe267a5590 sys: general adoption of SPDX licensing ID tags.
Mainly focus on files that use BSD 2-Clause license, however the tool I
was using misidentified many licenses so this was mostly a manual - error
prone - task.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

No functional change intended.
2017-11-27 15:23:17 +00:00
Ed Maste
1edbb54fe9 cc_cubic: restore braces around if-condition block
r307901 was reverted in r321480, restoring an incorrect block
delimitation bug present in the original cc_cubic commit. Restore
only the bugfix (brace addition) from r307901.

CID:		1090182
Approved by:	sbruno
2017-07-26 21:23:09 +00:00
Sean Bruno
43053c125a Revert r307901 - Inform CC modules about loss events.
This was discussed between various transport@ members and it was
requested to be reverted and discussed.

Submitted by:	Kevin Bowling <kevin.bowling@kev009.com>
Reported by:	lawrence
Reviewed by:	hiren
Sponsored by:	Limelight Networks
2017-07-25 15:08:52 +00:00
Sean Bruno
5d53981a18 Revert r308180 - Set slow start threshold more accurrately on loss ...
This was discussed between various transport@ members and it was
requested to be reverted and discussed.

Submitted by:	kevin
Reported by:	lawerence
Reviewed by:	hiren
2017-07-25 15:03:05 +00:00
Conrad Meyer
1d64db52f3 Fix a variety of cosmetic typos and misspellings
No functional change.

PR:		216096, 216097, 216098, 216101, 216102, 216106, 216109, 216110
Reported by:	Bulat <bltsrc at mail.ru>
Sponsored by:	Dell EMC Isilon
2017-01-15 18:00:45 +00:00
Hiren Panchasara
e04310d59b Set slow start threshold more accurately on loss to be flightsize/2 instead of
cwnd/2 as recommended by RFC5681. (spotted by mmacy at nextbsd dot org)

Restore pre-r307901 behavior of aligning ssthresh/cwnd on mss boundary. (spotted
by slawa at zxy dot spb dot ru)

Tested by:	    dim, Slawa <slawa at zxy dot spb dot ru>
MFC after:	    1 month
X-MFC with:	    r307901
Sponsored by:	    Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D8349
2016-11-01 21:08:37 +00:00
Hiren Panchasara
4e7f755377 FreeBSD tcp stack used to inform respective congestion control module about the
loss event but not use or obay the recommendations i.e. values set by it in some
cases.

Here is an attempt to solve that confusion by following relevant RFCs/drafts.
Stack only sets congestion window/slow start threshold values when there is no
CC module availalbe to take that action. All CC modules are inspected and
updated when needed to take appropriate action on loss.

tcp_stacks/fastpath module has been updated to adapt these changes.

Note: Probably, the most significant change would be to not bring congestion
window down to 1MSS on a loss signaled by 3-duplicate acks and letting
respective CC decide that value.

In collaboration with:	Matt Macy <mmacy at nextbsd dot org>
Discussed on:		transport@ mailing list
Reviewed by:		jtl
MFC after:		1 month
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D8225
2016-10-25 05:45:47 +00:00
Hiren Panchasara
dd13b7d387 Undo r307899. It needs a bit more work and proper commit log. 2016-10-25 05:07:51 +00:00
Hiren Panchasara
95d8236011 In Collaboration with: Matt Macy <mmacy at nextbsd dot com>
Reviewed by:		    jtl
Sponsored by:		    Limelight Networks
Differential Revision:	    https://reviews.freebsd.org/D8225
2016-10-25 05:03:33 +00:00
Jonathan T. Looney
3ac125068a Remove "long" variables from the TCP stack (not including the modular
congestion control framework).

Reviewed by:	gnn, lstewart (partial)
Sponsored by:	Juniper Networks, Netflix
Differential Revision:	(multiple)
Tested by:	Limelight, Netflix
2016-10-06 16:28:34 +00:00
Lawrence Stewart
4b7b743c16 Pass the number of segments coalesced by LRO up the stack by repurposing the
tso_segsz pkthdr field during RX processing, and use the information in TCP for
more correct accounting and as a congestion control input. This is only a start,
and an audit of other uses for the data is left as future work.

Reviewed by:	gallatin, rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D7564
2016-08-25 13:33:32 +00:00
Brad Davis
439e76ec22 Fix the case for some sysctl descriptions.
Reviewed by:	gnn
2016-07-26 20:20:09 +00:00
Hiren Panchasara
7c375daa61 Add an option to use rfc6675 based pipe/inflight bytes calculation in htcp.
Submitted by:	Kevin Bowling <kevin.bowling@kev009.com>
MFC after:	1 week
Sponsored by:	Limelight Networks
2016-05-09 19:19:03 +00:00
Pedro F. Giffuni
a4641f4eaa sys/net*: minor spelling fixes.
No functional change.
2016-05-03 18:05:43 +00:00
Gleb Smirnoff
4644fda3f7 Rename netinet/tcp_cc.h to netinet/cc/cc.h.
Discussed with:	lstewart
2016-01-27 17:59:39 +00:00
Gleb Smirnoff
2de3e790f5 - Rename cc.h to more meaningful tcp_cc.h.
- Declare it a kernel only include, which it already is.
- Don't include tcp.h implicitly from tcp_cc.h
2016-01-21 22:34:51 +00:00
Gleb Smirnoff
b66d74c138 Cleanup TCP files from unnecessary interface related includes. 2016-01-21 22:24:20 +00:00
Hiren Panchasara
a934d06194 Add an option to use rfc6675 based pipe/inflight bytes calculation in newreno.
MFC after:	    3 weeks
Sponsored by:	    Limelight Networks
2015-12-09 08:53:41 +00:00
Hiren Panchasara
f81bc34eac Add an option to use rfc6675 based pipe/inflight bytes calculation in cubic.
Reviewed by:		gnn
MFC after:		3 weeks
Sponsored by:		Limelight Networks
Differential Revision:	https://reviews.freebsd.org/D4205
2015-12-09 07:56:40 +00:00
Hiren Panchasara
64807b300f DCTCP (Data Center TCP) implementation.
DCTCP congestion control algorithm aims to maximise throughput and minimise
latency in data center networks by utilising the proportion of Explicit
Congestion Notification (ECN) marked packets received from capable hardware as a
congestion signal.

Highlights:
Implemented as a mod_cc(4) module.
ECN (Explicit congestion notification) processing is done differently from
RFC3168.
Takes one-sided DCTCP into consideration where only one of the sides is using
DCTCP and other is using standard ECN.

IETF draft: http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Thesis report by Midori Kato: https://eggert.org/students/kato-thesis.pdf

Submitted by:	Midori Kato <katoon@sfc.wide.ad.jp> and
		Lars Eggert <lars@netapp.com>
		with help and modifications from
		hiren
Differential Revision:	https://reviews.freebsd.org/D604
Reviewed by:	gnn
2015-01-12 08:33:04 +00:00
Gleb Smirnoff
6df8a71067 Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed.
Sponsored by:	Nginx, Inc.
2014-11-07 09:39:05 +00:00
Hans Petter Selasky
0e1152fcc2 The SYSCTL data pointers can come from userspace and must not be
directly accessed. Although this will work on some platforms, it can
throw an exception if the pointer is invalid and then panic the kernel.

Add a missing SYSCTL_IN() of "SCTP_BASE_STATS" structure.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2014-10-28 12:00:39 +00:00
Hans Petter Selasky
614b50ae8b Preserve limitation of "TCP_CA_NAME_MAX" when matching the algorithm
name.

MFC after:	3 days
Suggested by:	gnn @
2014-10-27 16:08:41 +00:00
Hans Petter Selasky
60a945f95d Make assignments to "net.inet.tcp.cc.algorithm" work by fixing a bad
string comparison.

MFC after:	3 days
Reported by:	Jukka Ukkonen <jau789@gmail.com>
Sponsored by:	Mellanox Technologies
2014-10-27 11:21:47 +00:00
Hans Petter Selasky
f0188618f2 Fix multiple incorrect SYSCTL arguments in the kernel:
- Wrong integer type was specified.

- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.

- Logical OR where binary OR was expected.

- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.

- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.

- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.

- Updated "EXAMPLES" section in SYSCTL manual page.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2014-10-21 07:31:21 +00:00
Lawrence Stewart
8b0fe327e8 Destroy the "qdiffsample_zone" UMA zone on unload to avoid a use-after-unload
panic easily triggered by running "sysctl -a" after unload.

Reported and tested by:	Grenville Armitage <garmitage@swin.edu.au>
MFC after:	1 week
2014-08-19 02:19:53 +00:00
Hans Petter Selasky
e167cb89a2 Fix string length argument passed to "sysctl_handle_string()" so that
the complete string is returned by the function and not just only one
byte.

PR:	192544
MFC after:	2 weeks
2014-08-10 07:51:55 +00:00
Mikolaj Golub
db2f5a2461 Fixup for r261590 (vnet sysctl handlers cleanup).
Reviewed by:	glebius
2014-02-09 08:13:17 +00:00
Lawrence Stewart
92a0637f73 Import an implementation of the CAIA Delay-Gradient (CDG) congestion control
algorithm, which is based on the 2011 v0.1 patch release and described in the
paper "Revisiting TCP Congestion Control using Delay Gradients" by David Hayes
and Grenville Armitage. It is implemented as a kernel module compatible with the
modular congestion control framework.

CDG is a hybrid congestion control algorithm which reacts to both packet loss
and inferred queuing delay. It attempts to operate as a delay-based algorithm
where possible, but utilises heuristics to detect loss-based TCP cross traffic
and will compete effectively as required. CDG is therefore incrementally
deployable and suitable for use on shared networks.

In collaboration with:	David Hayes <david.hayes at ieee.org> and
		Grenville Armitage <garmitage at swin edu au>
MFC after:	4 days
Sponsored by:	Cisco University Research Program and FreeBSD Foundation
2013-07-02 08:44:56 +00:00
Sergey Kandaurov
6bed196c35 Staticize malloc types.
Approved by:	lstewart
MFC after:	1 week
2011-04-13 11:28:46 +00:00
Lawrence Stewart
891b8ed467 Use the full and proper company name for Swinburne University of Technology
throughout the source tree.

Requested by:	Grenville Armitage, Director of CAIA at Swinburne University of
			Technology
MFC after:	3 days
2011-04-12 08:13:18 +00:00
Lawrence Stewart
03f0843bdb Algorithm modules can define their own private congestion signal types in the
top 8 bits of the 32 bit signal bit field space for internal use. These private
signals should not be leaked outside of a module.

Given that many algorithm modules use the NewReno hook functions to simplify
their implementation, the obvious place such a leak would show up is in the
NewReno cong_signal hook function.

- Show the full number of significant bits in the signal type definitions in
  <netinet/cc.h>.

- Add a bitmask to simplify figuring out if a given signal is in the private or
  public bit range.

- Add a sanity check in newreno_cong_signal() to ensure private signals are not
  being leaked into the hook function.

Sponsored by:	FreeBSD Foundation
Discussed with:	David Hayes <dahayes at swin edu au>
MFC after:	1 week
X-MFC with:	r215166
2011-02-01 13:32:27 +00:00
Lawrence Stewart
ec943febbb Fix typo in comment: "course" -> "coarse"
Sponsored by:	FreeBSD Foundation
Submitted by:	jmallett
MFC after:	3 months
X-MFC with:	r218152
2011-02-01 07:10:13 +00:00
Lawrence Stewart
0927e1a18b Import an implementation of the CAIA-Hamilton-Delay (CHD) congestion control
algorithm described in the paper "Improved coexistence and loss tolerance for
delay based TCP congestion control" by Hayes and Armitage. It is implemented as
a kernel module compatible with the recently committed modular congestion
control framework.

CHD enhances the approach taken by the Hamilton-Delay (HD) algorithm to provide
tolerance to non-congestion related packet loss and improvements to coexistence
with loss-based congestion control algorithms. A key idea in improving
coexistence with loss-based congestion control algorithms is the use of a shadow
window, which attempts to track how NewReno's congestion window (cwnd) would
evolve. At the next packet loss congestion event, CHD uses the shadow window to
correct cwnd in a way that reduces the amount of unfairness CHD experiences when
competing with loss-based algorithms.

In collaboration with:	David Hayes <dahayes at swin edu au> and
				Grenville Armitage <garmitage at swin edu au>
Sponsored by:	FreeBSD Foundation
Reviewed by:	bz and others along the way
MFC after:	3 months
2011-02-01 07:05:14 +00:00
Lawrence Stewart
ac230a79e1 Import a clean-room implementation of the Hamilton-Delay (HD) congestion control
algorithm based on the paper "A strategy for fair coexistence of loss and
delay-based congestion control algorithms" by Budzisz, Stanojevic, Shorten and
Baker. It is implemented as a kernel module compatible with the recently
committed modular congestion control framework.

HD uses a probabilistic approach to reacting to delay-based congestion. The
probability of reducing cwnd is zero when the queuing delay is very small,
increasing to a maximum at a set threshold, then back down to zero again when
the queuing delay is high. Normal operation keeps the queuing delay below the
set threshold. However, since loss-based congestion control algorithms push the
queuing delay high when probing for bandwidth, having the probability of
reducing cwnd drop back to zero for high delays allows HD to compete with
loss-based algorithms.

In collaboration with:	David Hayes <dahayes at swin edu au> and
				Grenville Armitage <garmitage at swin edu au>
Sponsored by:	FreeBSD Foundation
Reviewed by:	bz and others along the way
MFC after:	3 months
2011-02-01 06:42:46 +00:00
Lawrence Stewart
1d4ed791d0 Import a clean-room implementation of the VEGAS congestion control algorithm
based on the paper "TCP Vegas: end to end congestion avoidance on a global
internet" by Brakmo and Peterson. It is implemented as a kernel module
compatible with the recently committed modular congestion control framework.

VEGAS uses network delay as a congestion indicator and unlike regular loss-based
algorithms, attempts to keep the network operating with stable queuing delays
and no congestion losses. By keeping network buffers used along the path within
a set range, queuing delays are kept low while maintaining high throughput.

In collaboration with:	David Hayes <dahayes at swin edu au> and
				Grenville Armitage <garmitage at swin edu au>
Sponsored by:	FreeBSD Foundation
Reviewed by:	bz and others along the way
MFC after:	3 months
2011-02-01 06:17:00 +00:00