with the latest socket API ID. Especially it can be disabled.
Full compliance needs changing the structure used in the
socket option. Since this breaks the API, it will be a
seperate commit which will not be MFCed to stable/8.
MFC after: 3 months.
Khelp/Hhook KPIs to hook into the TCP stack and maintain a per-connection, low
noise estimate of the instantaneous RTT. ERTT's implementation is robust even in
the face of delayed acknowledgements and/or TSO being in use for a connection.
A high quality, low noise RTT estimate is a requirement for applications such as
delay-based congestion control, for which we will be importing some algorithm
implementations shortly.
In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: bz and others along the way
MFC after: 3 months
write to the buffer causes it to overflow. We therefore can't hold the CC list
rwlock over a call to sbuf_printf() for an sbuf configured with SBUF_AUTOEXTEND.
Switch to a fixed length sbuf which should be of sufficient size except in the
very unlikely event that the sysctl is being processed as one or more new
algorithms are loaded. If that happens, we accept the race and may fail the
sysctl gracefully if there is insufficient room to print the names of all the
algorithms.
This should address a WITNESS warning and the potential panic that would occur
if the sbuf call to malloc did sleep whilst holding the CC list rwlock.
Sponsored by: FreeBSD Foundation
Reported by: Nick Hibma
Reviewed by: bz
MFC after: 3 weeks
X-MFC with: r215166
- The mean RTT is updated at the end of each congestion epoch, but if we switch
to congestion avoidance within the first epoch (e.g. if ssthresh was primed
from the hostcache), we'll trigger a divide by zero panic in
cubic_ack_received(). Set the mean to the min in cubic_record_rtt() if the
mean is less than the min to ensure we have a sane mean for use in this
situation. This fixes the panic reported by Nick Hibma.
- Adjust conditions under which we update the mean RTT in cubic_post_recovery()
to ensure a low latency path won't yield an RTT of less than 1. This avoids
another potential divide by zero panic when running CUBIC in networks with
sub-millisecond latencies.
- Remove the "safety" assignment of min into mean when we don't update the mean
because of failed conditions. The above change to the conditions for updating
the mean ensures the safety issue is addressed and I feel it is better to keep
our previous mean estimate around if we can't update than to revert to the
min.
- Initialise the mean RTT to 1 on connection startup to act as a safety belt if
a situation we haven't considered and addressed with the above changes were to
crop up in the wild.
Sponsored by: FreeBSD Foundation
Reported and tested by: Nick Hibma
Discussed with: David Hayes <dahayes at swin edu au>
MFC after: 5 weeks
X-MFC with: r216114
udp endpoint may end up echoing back to the sender
even with OUT joining the multi-cast group.
Reviewed by: gnn, bms, bz?
Obtained from: deischen (with help from)
packets.
*) Reject requests with a protocol length not equal to 4. This is IPv4
and there is no reason to accept anything else.
*) Reject packets that have a multicast source hardware address.
*) Drop requests where the hardware address length is not equal
to the hardware address length of the interface.
Pointed out by: Rozhuk Ivan
MFC after: 1 week
hint is 0 when no SACK data is received to update the hint with. This was
accidentally omitted from r216753.
Sponsored by: FreeBSD Foundation
MFC after: 10 weeks
X-MFC with: 216753
an unbound socket, regardless of any multicast options.
If an address is specified via a multicast option, then
let it override normal the source address selection.
This fixes a bug where source address selection was
not being performed when multicast options were present
but without an interface being specified.
Reviewed by: bz
MFC after: 1 day
(also test for negative MTUs if checking it anyway).
An MTU of 0 is arguably a bug elsewhere, but this at least gives us some
more debugging hints.
Sponsored by: ISPsystem (Early 2010)
MFC after: 1 week
access inbound/outbound events and associated data for established TCP
connections. The hooks only run if at least one hook function is registered
for the hook point, ensuring the impact on the stack is effectively nil when
no TCP Khelp modules are loaded. struct tcp_hhook_data is passed as contextual
data to any registered Khelp module hook functions.
- Add an OSD (Object Specific Data) pointer to struct tcpcb to allow Khelp
modules to associate per-connection data with the TCP control block.
- Bump __FreeBSD_version and add a note to UPDATING regarding to ABI changes
introduced by this commit and r216753.
In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: bz, others along the way
MFC after: 3 months
This will be used by the incoming Enhanced RTT Khelp module.
Sponsored by: FreeBSD Foundation
Submitted by: David Hayes <dahayes at swin edu au>
Reviewed by: bz and others (as part of a larger patch)
MFC after: 3 months
Keep three lines disabled which I am unsure if they had been used at all.
This will allow us to seek testers and possibly bring it all back.
Discussed with: rwatson
MFC after: 7 weeks
algorithm based on the Internet-Draft "draft-leith-tcp-htcp-06.txt". It is
implemented as a kernel module compatible with the recently committed modular
congestion control framework.
H-TCP was designed to provide increased throughput in fast and long-distance
networks. It attempts to maintain fairness when competing with legacy NewReno
TCP in lower speed scenarios where NewReno is able to operate adequately. The
paper "H-TCP: A framework for congestion control in high-speed and long-distance
networks" provides additional detail.
In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: rpaulo (older patch from a few weeks ago)
MFC after: 3 months
algorithm based on the Internet-Draft "draft-rhee-tcpm-cubic-02.txt". It is
implemented as a kernel module compatible with the recently committed modular
congestion control framework.
CUBIC was designed for provide increased throughput in fast and long-distance
networks. It attempts to maintain fairness when competing with legacy NewReno
TCP in lower speed scenarios where NewReno is able to operate adequately. The
paper "CUBIC: A New TCP-Friendly High-Speed TCP Variant" provides additional
detail.
In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: FreeBSD Foundation
Reviewed by: rpaulo (older patch from a few weeks ago)
MFC after: 3 months
somewhere along the way due to mismerging r211464 in our development tree.
- Capture the essence of r211464 in NewReno's after_idle() hook. We don't
use V_ss_fltsz/V_ss_fltsz_local yet which needs to be revisited.
Sponsored by: FreeBSD Foundation
Submitted by: David Hayes <dahayes at swin edu au>
MFC after: 9 weeks
X-MFC with: r215166
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files. A better long-term solution is
still being considered. This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.
Changes reverted:
------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines
Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.
------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines
Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.
------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines
Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
the SIFTR pfil(9) hook functions to or from all network stacks. This patch
allows packets inbound or outbound from a vnet to be "seen" by SIFTR.
Additional work is required to allow SIFTR to actually generate log messages for
all vnet related packets because the siftr_findinpcb() function does not yet
search for inpcbs across all vnets. This issue will be fixed separately.
Reported and tested by: David Hayes <dahayes at swin edu au>
MFC after: 3 days
Retransmitted Packets
Zero Window Advertisements
Out of Order Receives
These statistics are available via the -T argument to
netstat(1).
MFC after: 2 weeks
vnets to select their own default CC algorithm independent of each other and the
base system. If the base system or a vnet has set a default which gets unloaded,
we reset that netstack's default to NewReno.
Sponsored by: FreeBSD Foundation
Tested by: Mikolaj Golub <to.my.trociny at gmail com>
Reviewed by: bz (briefly)
MFC after: 3 months
is small, so there is no good reason not to declare the buffer at the top.
- Fix a whitespace nit.
Sponsored by: FreeBSD Foundation
MFC after: 11 weeks
X-MFC with: r215166
Any found to be using the algorithm that is about to go away are switched back
to NewReno to avoid leaving dangling pointers which would trigger a panic. For
VIMAGE kernels, there is a list per vnet to walk, yet the implementation was
only examining one of the vnet lists.
Fix the implementation of the above feature for VIMAGE kernels by looping
through all active TCP control blocks across all vnets.
Sponsored by: FreeBSD Foundation
Tested by: Mikolaj Golub <to.my.trociny at gmail com>
Reviewed by: bz (briefly)
MFC after: 11 weeks
runs on boot and each time a vnet jail is created. Running cc_init() multiple
times results in a panic when attempting to initialise the cc_list lock again,
and so r215166 effectively broke the use of vnet jails.
Switch to using a SYSINIT to run cc_init() on boot. CC algorithm modules loaded
on boot register in the same SI_SUB_PROTO_IFATTACHDOMAIN category as is used in
this patch, so cc_init() is run at SI_ORDER_FIRST to ensure the framework is
initialised before module registration is attempted.
Sponsored by: FreeBSD Foundation
Reported and tested by: Mikolaj Golub <to.my.trociny at gmail com>
MFC after: 11 weeks
X-MFC with: r215166
When a fast machine first brings up some non TCP networking program
it is quite possible that we will drop packets due to the fact that
only one packet can be held per ARP entry. This leads to packets
being missed when a program starts or restarts if the ARP data is
not currently in the ARP cache.
This code adds a new sysctl, net.link.ether.inet.maxhold, which defines
a system wide maximum number of packets to be held in each ARP entry.
Up to maxhold packets are queued until an ARP reply is received or
the ARP times out. The default setting is the old value of 1
which has been part of the BSD networking code since time
immemorial.
Expose the time we hold an incomplete ARP entry by adding
the sysctl net.link.ether.inet.wait, which defaults to 20
seconds, the value used when the new ARP code was added..
Reviewed by: bz, rpaulo
MFC after: 3 weeks
the "sockarg" ipfw option matches packets associated to
a local socket and with a non-zero so_user_cookie value.
The value is made available as tablearg, so it can be used
as a skipto target or pipe number in ipfw/dummynet rules.
Code by Paul Joe, manpage by me.
Submitted by: Paul Joe
MFC after: 1 week
Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details
about the project are available at: http://caia.swin.edu.au/freebsd/5cc/
- Add a KPI and supporting infrastructure to allow modular congestion control
algorithms to be used in the net stack. Algorithms can maintain per-connection
state if required, and connections maintain their own algorithm pointer, which
allows different connections to concurrently use different algorithms. The
TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to
programmatically query or change the congestion control algorithm respectively
from within an application at runtime.
- Integrate the framework with the TCP stack in as least intrusive a manner as
possible. Care was also taken to develop the framework in a way that should
allow integration with other congestion aware transport protocols (e.g. SCTP)
in the future. The hope is that we will one day be able to share a single set
of congestion control algorithm modules between all congestion aware transport
protocols.
- Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack
and use it to decouple the meaning of recovery from a congestion event and
recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based
congestion control protocols don't generally need to recover from packet loss
and need a different way to note a congestion recovery episode within the
stack.
- Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code
and ensures the stack always uses the appropriate mechanisms for recovering
from packet loss during a congestion recovery episode.
- Extract the NewReno congestion control algorithm from the TCP stack and
massage it into module form. NewReno is always built into the kernel and will
remain the default algorithm for the forseeable future. Implementations of
additional different algorithms will become available in the near future.
- Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code
that relies on the size of "struct tcpcb" is required.
Many thanks go to the Cisco University Research Program Fund at Community
Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work
at the Centre for Advanced Internet Architectures, Swinburne University of
Technology is greatly appreciated.
In collaboration with: David Hayes <dahayes at swin edu au> and
Grenville Armitage <garmitage at swin edu au>
Sponsored by: Cisco URP, FreeBSD Foundation
Reviewed by: rpaulo
Tested by: David Hayes (and many others over the years)
MFC after: 3 months
tree in preparation for another large code import. Swinburne University is the
legal entity that owns copyright and the 2-clause BSD licence is acceptable.
even if there is no route out to that mcast address. The code in
in_pcb inadvertantly would error (no route) even though
the user may have specified the address with the
proper socket option (to specify the egress interface).
Thanks bz for reminding me I forgot to commit this ;-)
Reviewed by: bz
MFC after: 1 week
function from the timer code to util, rename it appropriately and
also fix a bug in sctp_get_prev_mtu(), where calling it with a
value existing in the MTU table did not return a smaller one.
MFC after: 3 days.
r198301 itself. It also broke the logic of not sending more than one
ARP request per second, that consequently lead to a potential problem
of flooding network with broadcast packets.
MFC after: 1 week
legacy and IPv6 route destination address.
Previously in case of IPv6, there was a memory overwrite due to not enough
space for the IPv6 address.
PR: kern/122565
MFC After: 2 weeks
Make it harder to exploit certain in_control() related races between the
intiial lookup at the beginning and the time we will remove the entry
from the lists by re-checking that entry is still in the list before
trying to remove it.
(*) It is believed that with the current code and locking strategy we
cannot completely fix all race.
Reported by: Nima Misaghian (nima_misa hotmail.com) on net@ 20100817
Tested by: Nima Misaghian (nima_misa hotmail.com) (original version)
PR: kern/146250
Submitted by: Mikolaj Golub (to.my.trociny gmail.com) (different version)
MFC after: 1 week