DCTCP congestion control algorithm aims to maximise throughput and minimise
latency in data center networks by utilising the proportion of Explicit
Congestion Notification (ECN) marked packets received from capable hardware as a
congestion signal.
Highlights:
Implemented as a mod_cc(4) module.
ECN (Explicit congestion notification) processing is done differently from
RFC3168.
Takes one-sided DCTCP into consideration where only one of the sides is using
DCTCP and other is using standard ECN.
IETF draft: http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
Thesis report by Midori Kato: https://eggert.org/students/kato-thesis.pdf
Submitted by: Midori Kato <katoon@sfc.wide.ad.jp> and
Lars Eggert <lars@netapp.com>
with help and modifications from
hiren
Differential Revision: https://reviews.freebsd.org/D604
Reviewed by: gnn
directly accessed. Although this will work on some platforms, it can
throw an exception if the pointer is invalid and then panic the kernel.
Add a missing SYSCTL_IN() of "SCTP_BASE_STATS" structure.
MFC after: 3 days
Sponsored by: Mellanox Technologies
- Wrong integer type was specified.
- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.
- Logical OR where binary OR was expected.
- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.
- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.
- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.
- Updated "EXAMPLES" section in SYSCTL manual page.
MFC after: 3 days
Sponsored by: Mellanox Technologies
algorithm, which is based on the 2011 v0.1 patch release and described in the
paper "Revisiting TCP Congestion Control using Delay Gradients" by David Hayes
and Grenville Armitage. It is implemented as a kernel module compatible with the
modular congestion control framework.
CDG is a hybrid congestion control algorithm which reacts to both packet loss
and inferred queuing delay. It attempts to operate as a delay-based algorithm
where possible, but utilises heuristics to detect loss-based TCP cross traffic
and will compete effectively as required. CDG is therefore incrementally
deployable and suitable for use on shared networks.
In collaboration with: David Hayes <david.hayes at ieee.org> and
Grenville Armitage <garmitage at swin edu au>
MFC after: 4 days
Sponsored by: Cisco University Research Program and FreeBSD Foundation
top 8 bits of the 32 bit signal bit field space for internal use. These private
signals should not be leaked outside of a module.
Given that many algorithm modules use the NewReno hook functions to simplify
their implementation, the obvious place such a leak would show up is in the
NewReno cong_signal hook function.
- Show the full number of significant bits in the signal type definitions in
<netinet/cc.h>.
- Add a bitmask to simplify figuring out if a given signal is in the private or
public bit range.
- Add a sanity check in newreno_cong_signal() to ensure private signals are not
being leaked into the hook function.
Sponsored by: FreeBSD Foundation
Discussed with: David Hayes <dahayes at swin edu au>
MFC after: 1 week
X-MFC with: r215166
algorithm described in the paper "Improved coexistence and loss tolerance for
delay based TCP congestion control" by Hayes and Armitage. It is implemented as
a kernel module compatible with the recently committed modular congestion
control framework.
CHD enhances the approach taken by the Hamilton-Delay (HD) algorithm to provide
tolerance to non-congestion related packet loss and improvements to coexistence
with loss-based congestion control algorithms. A key idea in improving
coexistence with loss-based congestion control algorithms is the use of a shadow
window, which attempts to track how NewReno's congestion window (cwnd) would
evolve. At the next packet loss congestion event, CHD uses the shadow window to
correct cwnd in a way that reduces the amount of unfairness CHD experiences when
competing with loss-based algorithms.
In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: bz and others along the way
MFC after: 3 months
algorithm based on the paper "A strategy for fair coexistence of loss and
delay-based congestion control algorithms" by Budzisz, Stanojevic, Shorten and
Baker. It is implemented as a kernel module compatible with the recently
committed modular congestion control framework.
HD uses a probabilistic approach to reacting to delay-based congestion. The
probability of reducing cwnd is zero when the queuing delay is very small,
increasing to a maximum at a set threshold, then back down to zero again when
the queuing delay is high. Normal operation keeps the queuing delay below the
set threshold. However, since loss-based congestion control algorithms push the
queuing delay high when probing for bandwidth, having the probability of
reducing cwnd drop back to zero for high delays allows HD to compete with
loss-based algorithms.
In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: bz and others along the way
MFC after: 3 months
based on the paper "TCP Vegas: end to end congestion avoidance on a global
internet" by Brakmo and Peterson. It is implemented as a kernel module
compatible with the recently committed modular congestion control framework.
VEGAS uses network delay as a congestion indicator and unlike regular loss-based
algorithms, attempts to keep the network operating with stable queuing delays
and no congestion losses. By keeping network buffers used along the path within
a set range, queuing delays are kept low while maintaining high throughput.
In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: bz and others along the way
MFC after: 3 months
write to the buffer causes it to overflow. We therefore can't hold the CC list
rwlock over a call to sbuf_printf() for an sbuf configured with SBUF_AUTOEXTEND.
Switch to a fixed length sbuf which should be of sufficient size except in the
very unlikely event that the sysctl is being processed as one or more new
algorithms are loaded. If that happens, we accept the race and may fail the
sysctl gracefully if there is insufficient room to print the names of all the
algorithms.
This should address a WITNESS warning and the potential panic that would occur
if the sbuf call to malloc did sleep whilst holding the CC list rwlock.
Sponsored by: FreeBSD Foundation
Reported by: Nick Hibma
Reviewed by: bz
MFC after: 3 weeks
X-MFC with: r215166
- The mean RTT is updated at the end of each congestion epoch, but if we switch
to congestion avoidance within the first epoch (e.g. if ssthresh was primed
from the hostcache), we'll trigger a divide by zero panic in
cubic_ack_received(). Set the mean to the min in cubic_record_rtt() if the
mean is less than the min to ensure we have a sane mean for use in this
situation. This fixes the panic reported by Nick Hibma.
- Adjust conditions under which we update the mean RTT in cubic_post_recovery()
to ensure a low latency path won't yield an RTT of less than 1. This avoids
another potential divide by zero panic when running CUBIC in networks with
sub-millisecond latencies.
- Remove the "safety" assignment of min into mean when we don't update the mean
because of failed conditions. The above change to the conditions for updating
the mean ensures the safety issue is addressed and I feel it is better to keep
our previous mean estimate around if we can't update than to revert to the
min.
- Initialise the mean RTT to 1 on connection startup to act as a safety belt if
a situation we haven't considered and addressed with the above changes were to
crop up in the wild.
Sponsored by: FreeBSD Foundation
Reported and tested by: Nick Hibma
Discussed with: David Hayes <dahayes at swin edu au>
MFC after: 5 weeks
X-MFC with: r216114
algorithm based on the Internet-Draft "draft-leith-tcp-htcp-06.txt". It is
implemented as a kernel module compatible with the recently committed modular
congestion control framework.
H-TCP was designed to provide increased throughput in fast and long-distance
networks. It attempts to maintain fairness when competing with legacy NewReno
TCP in lower speed scenarios where NewReno is able to operate adequately. The
paper "H-TCP: A framework for congestion control in high-speed and long-distance
networks" provides additional detail.
In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: rpaulo (older patch from a few weeks ago)
MFC after: 3 months
algorithm based on the Internet-Draft "draft-rhee-tcpm-cubic-02.txt". It is
implemented as a kernel module compatible with the recently committed modular
congestion control framework.
CUBIC was designed for provide increased throughput in fast and long-distance
networks. It attempts to maintain fairness when competing with legacy NewReno
TCP in lower speed scenarios where NewReno is able to operate adequately. The
paper "CUBIC: A New TCP-Friendly High-Speed TCP Variant" provides additional
detail.
In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: rpaulo (older patch from a few weeks ago)
MFC after: 3 months
somewhere along the way due to mismerging r211464 in our development tree.
- Capture the essence of r211464 in NewReno's after_idle() hook. We don't
use V_ss_fltsz/V_ss_fltsz_local yet which needs to be revisited.
Sponsored by: FreeBSD Foundation
Submitted by: David Hayes <dahayes at swin edu au>
MFC after: 9 weeks
X-MFC with: r215166
vnets to select their own default CC algorithm independent of each other and the
base system. If the base system or a vnet has set a default which gets unloaded,
we reset that netstack's default to NewReno.
Sponsored by: FreeBSD Foundation
Tested by: Mikolaj Golub <to.my.trociny at gmail com>
Reviewed by: bz (briefly)
MFC after: 3 months
is small, so there is no good reason not to declare the buffer at the top.
- Fix a whitespace nit.
Sponsored by: FreeBSD Foundation
MFC after: 11 weeks
X-MFC with: r215166
Any found to be using the algorithm that is about to go away are switched back
to NewReno to avoid leaving dangling pointers which would trigger a panic. For
VIMAGE kernels, there is a list per vnet to walk, yet the implementation was
only examining one of the vnet lists.
Fix the implementation of the above feature for VIMAGE kernels by looping
through all active TCP control blocks across all vnets.
Sponsored by: FreeBSD Foundation
Tested by: Mikolaj Golub <to.my.trociny at gmail com>
Reviewed by: bz (briefly)
MFC after: 11 weeks
runs on boot and each time a vnet jail is created. Running cc_init() multiple
times results in a panic when attempting to initialise the cc_list lock again,
and so r215166 effectively broke the use of vnet jails.
Switch to using a SYSINIT to run cc_init() on boot. CC algorithm modules loaded
on boot register in the same SI_SUB_PROTO_IFATTACHDOMAIN category as is used in
this patch, so cc_init() is run at SI_ORDER_FIRST to ensure the framework is
initialised before module registration is attempted.
Sponsored by: FreeBSD Foundation
Reported and tested by: Mikolaj Golub <to.my.trociny at gmail com>
MFC after: 11 weeks
X-MFC with: r215166
Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details
about the project are available at: http://caia.swin.edu.au/freebsd/5cc/
- Add a KPI and supporting infrastructure to allow modular congestion control
algorithms to be used in the net stack. Algorithms can maintain per-connection
state if required, and connections maintain their own algorithm pointer, which
allows different connections to concurrently use different algorithms. The
TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to
programmatically query or change the congestion control algorithm respectively
from within an application at runtime.
- Integrate the framework with the TCP stack in as least intrusive a manner as
possible. Care was also taken to develop the framework in a way that should
allow integration with other congestion aware transport protocols (e.g. SCTP)
in the future. The hope is that we will one day be able to share a single set
of congestion control algorithm modules between all congestion aware transport
protocols.
- Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack
and use it to decouple the meaning of recovery from a congestion event and
recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based
congestion control protocols don't generally need to recover from packet loss
and need a different way to note a congestion recovery episode within the
stack.
- Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code
and ensures the stack always uses the appropriate mechanisms for recovering
from packet loss during a congestion recovery episode.
- Extract the NewReno congestion control algorithm from the TCP stack and
massage it into module form. NewReno is always built into the kernel and will
remain the default algorithm for the forseeable future. Implementations of
additional different algorithms will become available in the near future.
- Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code
that relies on the size of "struct tcpcb" is required.
Many thanks go to the Cisco University Research Program Fund at Community
Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work
at the Centre for Advanced Internet Architectures, Swinburne University of
Technology is greatly appreciated.
In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: Cisco URP, FreeBSD Foundation
Reviewed by: rpaulo
Tested by: David Hayes (and many others over the years)
MFC after: 3 months