b8d60729de
NOTE: HEADS UP read the note below if your kernel config is not including GENERIC!! This patch does a bit of cleanup on TCP congestion control modules. There were some rather interesting surprises that one could get i.e. where you use a socket option to change from one CC (say cc_cubic) to another CC (say cc_vegas) and you could in theory get a memory failure and end up on cc_newreno. This is not what one would expect. The new code fixes this by requiring a cc_data_sz() function so we can malloc with M_WAITOK and pass in to the init function preallocated memory. The CC init is expected in this case *not* to fail but if it does and a module does break the "no fail with memory given" contract we do fall back to the CC that was in place at the time. This also fixes up a set of common newreno utilities that can be shared amongst other CC modules instead of the other CC modules reaching into newreno and executing what they think is a "common and understood" function. Lets put these functions in cc.c and that way we have a common place that is easily findable by future developers or bug fixers. This also allows newreno to evolve and grow support for its features i.e. ABE and HYSTART++ without having to dance through hoops for other CC modules, instead both newreno and the other modules just call into the common functions if they desire that behavior or roll there own if that makes more sense. Note: This commit changes the kernel configuration!! If you are not using GENERIC in some form you must add a CC module option (one of CC_NEWRENO, CC_VEGAS, CC_CUBIC, CC_CDG, CC_CHD, CC_DCTCP, CC_HTCP, CC_HD). You can have more than one defined as well if you desire. Note that if you create a kernel configuration that does not define a congestion control module and includes INET or INET6 the kernel compile will break. Also you need to define a default, generic adds 'options CC_DEFAULT=\"newreno\" but you can specify any string that represents the name of the CC module (same names that show up in the CC module list under net.inet.tcp.cc). If you fail to add the options CC_DEFAULT in your kernel configuration the kernel build will also break. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc. RELNOTES:YES Differential Revision: https://reviews.freebsd.org/D32693
423 lines
13 KiB
Groff
423 lines
13 KiB
Groff
.\"
|
|
.\" Copyright (c) 2008-2009 Lawrence Stewart <lstewart@FreeBSD.org>
|
|
.\" Copyright (c) 2010-2011 The FreeBSD Foundation
|
|
.\" All rights reserved.
|
|
.\"
|
|
.\" Portions of this documentation were written at the Centre for Advanced
|
|
.\" Internet Architectures, Swinburne University of Technology, Melbourne,
|
|
.\" Australia by David Hayes and Lawrence Stewart under sponsorship from the
|
|
.\" FreeBSD Foundation.
|
|
.\"
|
|
.\" Redistribution and use in source and binary forms, with or without
|
|
.\" modification, are permitted provided that the following conditions
|
|
.\" are met:
|
|
.\" 1. Redistributions of source code must retain the above copyright
|
|
.\" notice, this list of conditions and the following disclaimer.
|
|
.\" 2. Redistributions in binary form must reproduce the above copyright
|
|
.\" notice, this list of conditions and the following disclaimer in the
|
|
.\" documentation and/or other materials provided with the distribution.
|
|
.\"
|
|
.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
|
|
.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
|
.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR
|
|
.\" ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
|
.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
|
.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
|
.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
|
.\" SUCH DAMAGE.
|
|
.\"
|
|
.\" $FreeBSD$
|
|
.\"
|
|
.Dd May 13, 2021
|
|
.Dt MOD_CC 9
|
|
.Os
|
|
.Sh NAME
|
|
.Nm mod_cc ,
|
|
.Nm DECLARE_CC_MODULE ,
|
|
.Nm CCV
|
|
.Nd Modular Congestion Control
|
|
.Sh SYNOPSIS
|
|
.In netinet/tcp.h
|
|
.In netinet/cc/cc.h
|
|
.In netinet/cc/cc_module.h
|
|
.Fn DECLARE_CC_MODULE "ccname" "ccalgo"
|
|
.Fn CCV "ccv" "what"
|
|
.Sh DESCRIPTION
|
|
The
|
|
.Nm
|
|
framework allows congestion control algorithms to be implemented as dynamically
|
|
loadable kernel modules via the
|
|
.Xr kld 4
|
|
facility.
|
|
Transport protocols can select from the list of available algorithms on a
|
|
connection-by-connection basis, or use the system default (see
|
|
.Xr mod_cc 4
|
|
for more details).
|
|
.Pp
|
|
.Nm
|
|
modules are identified by an
|
|
.Xr ascii 7
|
|
name and set of hook functions encapsulated in a
|
|
.Vt "struct cc_algo" ,
|
|
which has the following members:
|
|
.Bd -literal -offset indent
|
|
struct cc_algo {
|
|
char name[TCP_CA_NAME_MAX];
|
|
int (*mod_init) (void);
|
|
int (*mod_destroy) (void);
|
|
size_t (*cc_data_sz)(void);
|
|
int (*cb_init) (struct cc_var *ccv, void *ptr);
|
|
void (*cb_destroy) (struct cc_var *ccv);
|
|
void (*conn_init) (struct cc_var *ccv);
|
|
void (*ack_received) (struct cc_var *ccv, uint16_t type);
|
|
void (*cong_signal) (struct cc_var *ccv, uint32_t type);
|
|
void (*post_recovery) (struct cc_var *ccv);
|
|
void (*after_idle) (struct cc_var *ccv);
|
|
int (*ctl_output)(struct cc_var *, struct sockopt *, void *);
|
|
void (*rttsample)(struct cc_var *, uint32_t, uint32_t, uint32_t);
|
|
void (*newround)(struct cc_var *, uint32_t);
|
|
};
|
|
.Ed
|
|
.Pp
|
|
The
|
|
.Va name
|
|
field identifies the unique name of the algorithm, and should be no longer than
|
|
TCP_CA_NAME_MAX-1 characters in length (the TCP_CA_NAME_MAX define lives in
|
|
.In netinet/tcp.h
|
|
for compatibility reasons).
|
|
.Pp
|
|
The
|
|
.Va mod_init
|
|
function is called when a new module is loaded into the system but before the
|
|
registration process is complete.
|
|
It should be implemented if a module needs to set up some global state prior to
|
|
being available for use by new connections.
|
|
Returning a non-zero value from
|
|
.Va mod_init
|
|
will cause the loading of the module to fail.
|
|
.Pp
|
|
The
|
|
.Va mod_destroy
|
|
function is called prior to unloading an existing module from the kernel.
|
|
It should be implemented if a module needs to clean up any global state before
|
|
being removed from the kernel.
|
|
The return value is currently ignored.
|
|
.Pp
|
|
The
|
|
.Va cc_data_sz
|
|
function is called by the socket option code to get the size of
|
|
data that the
|
|
.Va cb_init
|
|
function needs.
|
|
The socket option code then preallocates the modules memory so that the
|
|
.Va cb_init
|
|
function will not fail (the socket option code uses M_WAITOK with
|
|
no locks held to do this).
|
|
.Pp
|
|
The
|
|
.Va cb_init
|
|
function is called when a TCP control block
|
|
.Vt struct tcpcb
|
|
is created.
|
|
It should be implemented if a module needs to allocate memory for storing
|
|
private per-connection state.
|
|
Returning a non-zero value from
|
|
.Va cb_init
|
|
will cause the connection set up to be aborted, terminating the connection as a
|
|
result.
|
|
Note that the ptr argument passed to the function should be checked to
|
|
see if it is non-NULL, if so it is preallocated memory that the cb_init function
|
|
must use instead of calling malloc itself.
|
|
.Pp
|
|
The
|
|
.Va cb_destroy
|
|
function is called when a TCP control block
|
|
.Vt struct tcpcb
|
|
is destroyed.
|
|
It should be implemented if a module needs to free memory allocated in
|
|
.Va cb_init .
|
|
.Pp
|
|
The
|
|
.Va conn_init
|
|
function is called when a new connection has been established and variables are
|
|
being initialised.
|
|
It should be implemented to initialise congestion control algorithm variables
|
|
for the newly established connection.
|
|
.Pp
|
|
The
|
|
.Va ack_received
|
|
function is called when a TCP acknowledgement (ACK) packet is received.
|
|
Modules use the
|
|
.Fa type
|
|
argument as an input to their congestion management algorithms.
|
|
The ACK types currently reported by the stack are CC_ACK and CC_DUPACK.
|
|
CC_ACK indicates the received ACK acknowledges previously unacknowledged data.
|
|
CC_DUPACK indicates the received ACK acknowledges data we have already received
|
|
an ACK for.
|
|
.Pp
|
|
The
|
|
.Va cong_signal
|
|
function is called when a congestion event is detected by the TCP stack.
|
|
Modules use the
|
|
.Fa type
|
|
argument as an input to their congestion management algorithms.
|
|
The congestion event types currently reported by the stack are CC_ECN, CC_RTO,
|
|
CC_RTO_ERR and CC_NDUPACK.
|
|
CC_ECN is reported when the TCP stack receives an explicit congestion notification
|
|
(RFC3168).
|
|
CC_RTO is reported when the retransmission time out timer fires.
|
|
CC_RTO_ERR is reported if the retransmission time out timer fired in error.
|
|
CC_NDUPACK is reported if N duplicate ACKs have been received back-to-back,
|
|
where N is the fast retransmit duplicate ack threshold (N=3 currently as per
|
|
RFC5681).
|
|
.Pp
|
|
The
|
|
.Va post_recovery
|
|
function is called after the TCP connection has recovered from a congestion event.
|
|
It should be implemented to adjust state as required.
|
|
.Pp
|
|
The
|
|
.Va after_idle
|
|
function is called when data transfer resumes after an idle period.
|
|
It should be implemented to adjust state as required.
|
|
.Pp
|
|
The
|
|
.Va ctl_output
|
|
function is called when
|
|
.Xr getsockopt 2
|
|
or
|
|
.Xr setsockopt 2
|
|
is called on a
|
|
.Xr tcp 4
|
|
socket with the
|
|
.Va struct sockopt
|
|
pointer forwarded unmodified from the TCP control, and a
|
|
.Va void *
|
|
pointer to algorithm specific argument.
|
|
.Pp
|
|
The
|
|
.Va rttsample
|
|
function is called to pass round trip time information to the
|
|
congestion controller.
|
|
The additional arguments to the function include the microsecond RTT
|
|
that is being noted, the number of times that the data being
|
|
acknowledged was retransmitted as well as the flightsize at send.
|
|
For transports that do not track flightsize at send, this variable
|
|
will be the current cwnd at the time of the call.
|
|
.Pp
|
|
The
|
|
.Va newround
|
|
function is called each time a new round trip time begins.
|
|
The montonically increasing round number is also passed to the
|
|
congestion controller as well.
|
|
This can be used for various purposes by the congestion controller (e.g Hystart++).
|
|
.Pp
|
|
Note that currently not all TCP stacks call the
|
|
.Va rttsample
|
|
and
|
|
.Va newround
|
|
function so dependancy on these functions is also
|
|
dependant upon which TCP stack is in use.
|
|
.Pp
|
|
The
|
|
.Fn DECLARE_CC_MODULE
|
|
macro provides a convenient wrapper around the
|
|
.Xr DECLARE_MODULE 9
|
|
macro, and is used to register a
|
|
.Nm
|
|
module with the
|
|
.Nm
|
|
framework.
|
|
The
|
|
.Fa ccname
|
|
argument specifies the module's name.
|
|
The
|
|
.Fa ccalgo
|
|
argument points to the module's
|
|
.Vt struct cc_algo .
|
|
.Pp
|
|
.Nm
|
|
modules must instantiate a
|
|
.Vt struct cc_algo ,
|
|
but are only required to set the name field, and optionally any of the function
|
|
pointers.
|
|
Note that if a module defines the
|
|
.Va cb_init
|
|
function it also must define a
|
|
.Va cc_data_sz
|
|
function.
|
|
This is because when switching from one congestion control
|
|
module to another the socket option code will preallocate memory for the
|
|
.Va cb_init
|
|
function. If no memory is allocated by the modules
|
|
.Va cb_init
|
|
then the
|
|
.Va cc_data_sz
|
|
function should return 0.
|
|
.Pp
|
|
The stack will skip calling any function pointer which is NULL, so there is no
|
|
requirement to implement any of the function pointers (with the exception of
|
|
the cb_init <-> cc_data_sz dependancy noted above).
|
|
Using the C99 designated initialiser feature to set fields is encouraged.
|
|
.Pp
|
|
Each function pointer which deals with congestion control state is passed a
|
|
pointer to a
|
|
.Vt struct cc_var ,
|
|
which has the following members:
|
|
.Bd -literal -offset indent
|
|
struct cc_var {
|
|
void *cc_data;
|
|
int bytes_this_ack;
|
|
tcp_seq curack;
|
|
uint32_t flags;
|
|
int type;
|
|
union ccv_container {
|
|
struct tcpcb *tcp;
|
|
struct sctp_nets *sctp;
|
|
} ccvc;
|
|
uint16_t nsegs;
|
|
uint8_t labc;
|
|
};
|
|
.Ed
|
|
.Pp
|
|
.Vt struct cc_var
|
|
groups congestion control related variables into a single, embeddable structure
|
|
and adds a layer of indirection to accessing transport protocol control blocks.
|
|
The eventual goal is to allow a single set of
|
|
.Nm
|
|
modules to be shared between all congestion aware transport protocols, though
|
|
currently only
|
|
.Xr tcp 4
|
|
is supported.
|
|
.Pp
|
|
To aid the eventual transition towards this goal, direct use of variables from
|
|
the transport protocol's data structures is strongly discouraged.
|
|
However, it is inevitable at the current time to require access to some of these
|
|
variables, and so the
|
|
.Fn CCV
|
|
macro exists as a convenience accessor.
|
|
The
|
|
.Fa ccv
|
|
argument points to the
|
|
.Vt struct cc_var
|
|
passed into the function by the
|
|
.Nm
|
|
framework.
|
|
The
|
|
.Fa what
|
|
argument specifies the name of the variable to access.
|
|
.Pp
|
|
Apart from the
|
|
.Va type
|
|
and
|
|
.Va ccv_container
|
|
fields, the remaining fields in
|
|
.Vt struct cc_var
|
|
are for use by
|
|
.Nm
|
|
modules.
|
|
.Pp
|
|
The
|
|
.Va cc_data
|
|
field is available for algorithms requiring additional per-connection state to
|
|
attach a dynamic memory pointer to.
|
|
The memory should be allocated and attached in the module's
|
|
.Va cb_init
|
|
hook function.
|
|
.Pp
|
|
The
|
|
.Va bytes_this_ack
|
|
field specifies the number of new bytes acknowledged by the most recently
|
|
received ACK packet.
|
|
It is only valid in the
|
|
.Va ack_received
|
|
hook function.
|
|
.Pp
|
|
The
|
|
.Va curack
|
|
field specifies the sequence number of the most recently received ACK packet.
|
|
It is only valid in the
|
|
.Va ack_received ,
|
|
.Va cong_signal
|
|
and
|
|
.Va post_recovery
|
|
hook functions.
|
|
.Pp
|
|
The
|
|
.Va flags
|
|
field is used to pass useful information from the stack to a
|
|
.Nm
|
|
module.
|
|
The CCF_ABC_SENTAWND flag is relevant in
|
|
.Va ack_received
|
|
and is set when appropriate byte counting (RFC3465) has counted a window's worth
|
|
of bytes has been sent.
|
|
It is the module's responsibility to clear the flag after it has processed the
|
|
signal.
|
|
The CCF_CWND_LIMITED flag is relevant in
|
|
.Va ack_received
|
|
and is set when the connection's ability to send data is currently constrained
|
|
by the value of the congestion window.
|
|
Algorithms should use the absence of this flag being set to avoid accumulating
|
|
a large difference between the congestion window and send window.
|
|
.Pp
|
|
The
|
|
.Va nsegs
|
|
variable is used to pass in how much compression was done by the local
|
|
LRO system.
|
|
So for example if LRO pushed three in-order acknowledgements into
|
|
one acknowledgement the variable would be set to three.
|
|
.Pp
|
|
The
|
|
.Va labc
|
|
variable is used in conjunction with the CCF_USE_LOCAL_ABC flag
|
|
to override what labc variable the congestion controller will use
|
|
for this particular acknowledgement.
|
|
.Sh SEE ALSO
|
|
.Xr cc_cdg 4 ,
|
|
.Xr cc_chd 4 ,
|
|
.Xr cc_cubic 4 ,
|
|
.Xr cc_dctcp 4 ,
|
|
.Xr cc_hd 4 ,
|
|
.Xr cc_htcp 4 ,
|
|
.Xr cc_newreno 4 ,
|
|
.Xr cc_vegas 4 ,
|
|
.Xr mod_cc 4 ,
|
|
.Xr tcp 4
|
|
.Sh ACKNOWLEDGEMENTS
|
|
Development and testing of this software were made possible in part by grants
|
|
from the FreeBSD Foundation and Cisco University Research Program Fund at
|
|
Community Foundation Silicon Valley.
|
|
.Sh FUTURE WORK
|
|
Integrate with
|
|
.Xr sctp 4 .
|
|
.Sh HISTORY
|
|
The modular Congestion Control (CC) framework first appeared in
|
|
.Fx 9.0 .
|
|
.Pp
|
|
The framework was first released in 2007 by James Healy and Lawrence Stewart
|
|
whilst working on the NewTCP research project at Swinburne University of
|
|
Technology's Centre for Advanced Internet Architectures, Melbourne, Australia,
|
|
which was made possible in part by a grant from the Cisco University Research
|
|
Program Fund at Community Foundation Silicon Valley.
|
|
More details are available at:
|
|
.Pp
|
|
http://caia.swin.edu.au/urp/newtcp/
|
|
.Sh AUTHORS
|
|
.An -nosplit
|
|
The
|
|
.Nm
|
|
framework was written by
|
|
.An Lawrence Stewart Aq Mt lstewart@FreeBSD.org ,
|
|
.An James Healy Aq Mt jimmy@deefa.com
|
|
and
|
|
.An David Hayes Aq Mt david.hayes@ieee.org .
|
|
.Pp
|
|
This manual page was written by
|
|
.An David Hayes Aq Mt david.hayes@ieee.org
|
|
and
|
|
.An Lawrence Stewart Aq Mt lstewart@FreeBSD.org .
|