Centre for Advanced Internet Architectures
Implementing AQM in FreeBSD
* Overview <http://caia.swin.edu.au/freebsd/aqm/index.html>
* Articles, Papers and Presentations
<http://caia.swin.edu.au/freebsd/aqm/papers.html>
* Patches and Tools <http://caia.swin.edu.au/freebsd/aqm/downloads.html>
Overview
Recent years have seen a resurgence of interest in better managing
the depth of bottleneck queues in routers, switches and other places
that get congested. Solutions include transport protocol enhancements
at the end-hosts (such as delay-based or hybrid congestion control
schemes) and active queue management (AQM) schemes applied within
bottleneck queues.
The notion of AQM has been around since at least the late 1990s
(e.g. RFC 2309). In recent years the proliferation of oversized
buffers in all sorts of network devices (aka bufferbloat) has
stimulated keen community interest in four new AQM schemes -- CoDel,
FQ-CoDel, PIE and FQ-PIE.
The IETF AQM working group is looking to document these schemes,
and independent implementations are a corner-stone of the IETF's
process for confirming the clarity of publicly available protocol
descriptions. While significant development work on all three schemes
has occured in the Linux kernel, there is very little in FreeBSD.
Project Goals
This project began in late 2015, and aims to design and implement
functionally-correct versions of CoDel, FQ-CoDel, PIE and FQ_PIE
in FreeBSD (with code BSD-licensed as much as practical). We have
chosen to do this as extensions to FreeBSD's ipfw/dummynet firewall
and traffic shaper. Implementation of these AQM schemes in FreeBSD
will:
* Demonstrate whether the publicly available documentation is
sufficient to enable independent, functionally equivalent implementations
* Provide a broader suite of AQM options for sections the networking
community that rely on FreeBSD platforms
Program Members:
* Rasool Al Saadi (developer)
* Grenville Armitage (project lead)
Acknowledgements:
This project has been made possible in part by a gift from the
Comcast Innovation Fund.
Submitted by: Rasool Al-Saadi <ralsaadi@swin.edu.au>
X-No objection: core
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D6388
Not all mbufs passed up from device drivers are M_WRITABLE(). In
particular, the Chelsio T4/T5 driver uses a feature called "buffer packing"
to receive multiple frames in a single receive buffer. The mbufs for
these frames all share the same external storage so are treated as
read-only by the rest of the stack when multiple frames are in flight.
Previously tcp_respond() would blindly overwrite read-only mbufs when
INVARIANTS was disabled or panic with an assertion failure if INVARIANTS
was enabled. Note that the new case is a bit of a mix of the two other
cases in tcp_respond(). The TCP and IP headers must be copied explicitly
into the new mbuf instead of being inherited (similar to the m == NULL
case), but the addresses and ports must be swapped in the reply (similar
to the m != NULL case).
Reviewed by: glebius
"qsort()".
The kernel's "qsort()" routine can in worst case spend O(N*N) amount of
comparisons before the input array is sorted. It can also recurse a
significant amount of times using up the kernel's interrupt thread
stack.
The custom sorting routine takes advantage of that the sorting key is
only 64 bits. Based on set and cleared bits in the sorting key it
partitions the array until it is sorted. This process has a recursion
limit of 64 times, due to the number of set and cleared bits which can
occur. Compiled with -O2 the sorting routine was measured to use
64-bytes of stack. Multiplying this by 64 gives a maximum stack
consumption of 4096 bytes for AMD64. The same applies to the execution
time, that the array to be sorted will not be traversed more than 64
times.
When serving roughly 80Gb/s with 80K TCP connections, the old method
consisting of "qsort()" and "tcp_lro_mbuf_compare_header()" used 1.4%
CPU, while the new "tcp_lro_sort()" used 1.1% for LRO related sorting
as measured by Intel Vtune. The testing was done using a sysctl to
toggle between "qsort()" and "tcp_lro_sort()".
Differential Revision: https://reviews.freebsd.org/D6472
Sponsored by: Mellanox Technologies
Tested by: Netflix
Reviewed by: gallatin, rrs, sephe, transport
* include the SCTP common header, if possible
* include the first 8 bytes of the INIT chunk, if possible
This provides the necesary information for the receiver of the ICMP
packet to process it.
MFC after: 1 week
control to a three way setting.
0 - Totally disable ECN. (no change)
1 - Enable ECN if incoming connections request it. Outgoing
connections will request ECN. (no change from present != 0 setting)
2 - Enable ECN if incoming connections request it. Outgoing
conections will not request ECN.
Change the default value of net.inet.tcp.ecn.enable from 0 to 2.
Linux version 2.4.20 and newer, Solaris, and Mac OS X 10.5 and newer have
similar capabilities. The actual values above match Linux, and the default
matches the current Linux default.
Reviewed by: eadler
MFC after: 1 month
MFH: yes
Sponsored by: https://reviews.freebsd.org/D6386
structures in the add of a new tcp-stack that came in late to me
via email after the last commit. It also makes it so that a new
stack may optionally get a callback during a retransmit
timeout. This allows the new stack to clear specific state (think
sack scoreboards or other such structures).
Sponsored by: Netflix Inc.
Differential Revision: http://reviews.freebsd.org/D6303
objects with the same name in different sets.
Add optional manage_sets() callback to objects rewriting framework.
It is intended to implement handler for moving and swapping named
object's sets. Add ipfw_obj_manage_sets() function that implements
generic sets handler. Use new callback to implement sets support for
lookup tables.
External actions objects are global and they don't support sets.
Modify eaction_findbyname() to reflect this.
ipfw(8) now may fail to move rules or sets, because some named objects
in target set may have conflicting names.
Note that ipfw_obj_ntlv type was changed, but since lookup tables
actually didn't support sets, this change is harmless.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
There was the requirement that two structures are in sync,
which is not valid anymore. Therefore don't rely on this
in the code anymore.
Thanks to Radek Malcic for reporting the issue. He found this
when using the userland stack.
MFC after: 1 week
Ease more work concerning active list, e.g. hash table etc.
Reviewed by: gallatin, rrs (earlier version)
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D6137
chunk, enable UDP encapsulation for all those addresses.
This helps clients using a userland stack to support multihoming if
they are not behind a NAT.
MFC after: 1 week
This is currently only a code change without any functional
change. But this allows to set the remote encapsulation port
in a more detailed way, which will be provided in a follow-up
commit.
MFC after: 1 week
So the underlying drivers can use it to select the sending queue
properly for SYN|ACK instead of rolling their own hash.
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D6120
interested in having tunneled UDP and finding out about the
ICMP (tested by Michael Tuexen with SCTP.. soon to be using
this feature).
Differential Revision: http://reviews.freebsd.org/D5875
async_drain functionality. This as been tested in NF as well as
by Verisign. Still to do in here is to remove all the old flags. They
are currently left being maintained but probably are no longer needed.
Sponsored by: Netflix Inc.
Differential Revision: http://reviews.freebsd.org/D5924
In principle n is only used to carry a copy of ipi_count, which is
unsigned, in the non-VIMAGE case, however ipi_count can be used
directly so it is not needed at all. Removing it makes things look
cleaner.
The disgusting macro INP_WLOCK_RECHECK may early-return. In
tcp_default_ctloutput() the TCP_CCALGOOPT case allocates memory before invoking
this macro, which may leak memory.
Add a _CLEANUP variant that takes a code argument to perform variable cleanup
in the early return path. Use it to free the 'pbuf' allocated in
tcp_default_ctloutput().
I am not especially happy with this macro, but I reckon it's not any worse than
INP_WLOCK_RECHECK already was.
Reported by: Coverity
CID: 1350286
Sponsored by: EMC / Isilon Storage Division
tp->snd_wnd. This can happen, for example, when the remote side responds to
a window probe by ACKing the one byte it contains.
Differential Revision: https://reviews.freebsd.org/D5625
Reviewed by: hiren
Obtained from: Juniper Networks (earlier version)
MFC after: 2 weeks
Sponsored by: Juniper Networks