"qsort()".
The kernel's "qsort()" routine can in worst case spend O(N*N) amount of
comparisons before the input array is sorted. It can also recurse a
significant amount of times using up the kernel's interrupt thread
stack.
The custom sorting routine takes advantage of that the sorting key is
only 64 bits. Based on set and cleared bits in the sorting key it
partitions the array until it is sorted. This process has a recursion
limit of 64 times, due to the number of set and cleared bits which can
occur. Compiled with -O2 the sorting routine was measured to use
64-bytes of stack. Multiplying this by 64 gives a maximum stack
consumption of 4096 bytes for AMD64. The same applies to the execution
time, that the array to be sorted will not be traversed more than 64
times.
When serving roughly 80Gb/s with 80K TCP connections, the old method
consisting of "qsort()" and "tcp_lro_mbuf_compare_header()" used 1.4%
CPU, while the new "tcp_lro_sort()" used 1.1% for LRO related sorting
as measured by Intel Vtune. The testing was done using a sysctl to
toggle between "qsort()" and "tcp_lro_sort()".
Differential Revision: https://reviews.freebsd.org/D6472
Sponsored by: Mellanox Technologies
Tested by: Netflix
Reviewed by: gallatin, rrs, sephe, transport
* include the SCTP common header, if possible
* include the first 8 bytes of the INIT chunk, if possible
This provides the necesary information for the receiver of the ICMP
packet to process it.
MFC after: 1 week
control to a three way setting.
0 - Totally disable ECN. (no change)
1 - Enable ECN if incoming connections request it. Outgoing
connections will request ECN. (no change from present != 0 setting)
2 - Enable ECN if incoming connections request it. Outgoing
conections will not request ECN.
Change the default value of net.inet.tcp.ecn.enable from 0 to 2.
Linux version 2.4.20 and newer, Solaris, and Mac OS X 10.5 and newer have
similar capabilities. The actual values above match Linux, and the default
matches the current Linux default.
Reviewed by: eadler
MFC after: 1 month
MFH: yes
Sponsored by: https://reviews.freebsd.org/D6386
structures in the add of a new tcp-stack that came in late to me
via email after the last commit. It also makes it so that a new
stack may optionally get a callback during a retransmit
timeout. This allows the new stack to clear specific state (think
sack scoreboards or other such structures).
Sponsored by: Netflix Inc.
Differential Revision: http://reviews.freebsd.org/D6303
objects with the same name in different sets.
Add optional manage_sets() callback to objects rewriting framework.
It is intended to implement handler for moving and swapping named
object's sets. Add ipfw_obj_manage_sets() function that implements
generic sets handler. Use new callback to implement sets support for
lookup tables.
External actions objects are global and they don't support sets.
Modify eaction_findbyname() to reflect this.
ipfw(8) now may fail to move rules or sets, because some named objects
in target set may have conflicting names.
Note that ipfw_obj_ntlv type was changed, but since lookup tables
actually didn't support sets, this change is harmless.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
There was the requirement that two structures are in sync,
which is not valid anymore. Therefore don't rely on this
in the code anymore.
Thanks to Radek Malcic for reporting the issue. He found this
when using the userland stack.
MFC after: 1 week
Ease more work concerning active list, e.g. hash table etc.
Reviewed by: gallatin, rrs (earlier version)
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D6137
chunk, enable UDP encapsulation for all those addresses.
This helps clients using a userland stack to support multihoming if
they are not behind a NAT.
MFC after: 1 week
This is currently only a code change without any functional
change. But this allows to set the remote encapsulation port
in a more detailed way, which will be provided in a follow-up
commit.
MFC after: 1 week
So the underlying drivers can use it to select the sending queue
properly for SYN|ACK instead of rolling their own hash.
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D6120
interested in having tunneled UDP and finding out about the
ICMP (tested by Michael Tuexen with SCTP.. soon to be using
this feature).
Differential Revision: http://reviews.freebsd.org/D5875
async_drain functionality. This as been tested in NF as well as
by Verisign. Still to do in here is to remove all the old flags. They
are currently left being maintained but probably are no longer needed.
Sponsored by: Netflix Inc.
Differential Revision: http://reviews.freebsd.org/D5924
In principle n is only used to carry a copy of ipi_count, which is
unsigned, in the non-VIMAGE case, however ipi_count can be used
directly so it is not needed at all. Removing it makes things look
cleaner.
The disgusting macro INP_WLOCK_RECHECK may early-return. In
tcp_default_ctloutput() the TCP_CCALGOOPT case allocates memory before invoking
this macro, which may leak memory.
Add a _CLEANUP variant that takes a code argument to perform variable cleanup
in the early return path. Use it to free the 'pbuf' allocated in
tcp_default_ctloutput().
I am not especially happy with this macro, but I reckon it's not any worse than
INP_WLOCK_RECHECK already was.
Reported by: Coverity
CID: 1350286
Sponsored by: EMC / Isilon Storage Division
tp->snd_wnd. This can happen, for example, when the remote side responds to
a window probe by ACKing the one byte it contains.
Differential Revision: https://reviews.freebsd.org/D5625
Reviewed by: hiren
Obtained from: Juniper Networks (earlier version)
MFC after: 2 weeks
Sponsored by: Juniper Networks
It allows implementing loadable kernel modules with new actions and
without needing to modify kernel headers and ipfw(8). The module
registers its action handler and keyword string, that will be used
as action name. Using generic syntax user can add rules with this
action. Also ipfw(8) can be easily modified to extend basic syntax
for external actions, that become a part base system.
Sample modules will coming soon.
Obtained from: Yandex LLC
Sponsored by: Yandex LLC
containing an INIT chunk. These need to be handled in case the peer
does not support SCTP and returns an ICMP messages indicating destination
unreachable, protocol unreachable.
MFC after: 1 week
the outer IP header, the ICMP header, the inner IP header and the
first n bytes are stored in contgous memory. The ctlinput functions
currently rely on this for n = 8. This fixes a bug in case the inner IP
header had options.
While there, remove the options from the outer header and provide a
way to increase n to allow improved ICMP handling for SCTP. This will
be added in another commit.
MFC after: 1 week
It does not cause any real issues because the variable is overwritten
only when the packet is forwarded (and the variable is not used anymore).
Obtained from: pfSense
MFC after: 2 weeks
Sponsored by: Rubicon Communications (Netgate)
is required to check the verification tag. However, this
requires the verification tag to be not 0. Enforce this.
For packets with a verification tag of 0, we need to
check it it contains an INIT chunk and use the initiate
tag for the validation. This will be a separate commit,
since it touches also other code.
MFC after: 1 week
It looks like as with the safety belt of DELAY() fastened (*) we can
completely tear down and free all memory for TCP (after r281599).
(*) in theory a few ticks should be good enough to make sure the timers
are all really gone. Could we use a better matric here and check a
tcbcb count as an optimization?
PR: 164763
Reviewed by: gnn, emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5734
The tcp_inpcb (pcbinfo) zone should be safe to destroy.
PR: 164763
Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5732
We attach the "counter" to the tcpcbs. Thus don't free the
TCP Fastopen zone before the tcpcbs are gone, as otherwise
the zone won't be empty.
With that it should be safe to destroy the "tfo" zone without
leaking the memory.
PR: 164763
Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5731
While there is no dependency interaction, stopping the timer before
freeing the rest of the resources seems more natural and avoids it
being scheduled an extra time when it is no longer needed.
Reviewed by: gnn, emaste
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5733
No need to keep type stability on raw sockets zone.
We've also been running with a KASSERT since r222488 to make sure the
ipi_count is 0 on destroy.
PR: 164763
Reviewed by: gnn
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5735
adds the new I-Data (Interleaved Data) message. This allows a user
to be able to have complete freedom from Head Of Line blocking that
was previously there due to the in-ability to send multiple large
messages without the TSN's being in sequence. The code as been
tested with Michaels various packet drill scripts as well as
inter-networking between the IETF's location in Argentina and Germany.
This is kinda critical to the performance when the CPU is slow and
network bandwidth is high, e.g. in the hypervisor.
Reviewed by: rrs, gallatin, Dexuan Cui <decui microsoft com>
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5765
And factor out tcp_lro_rx_done, which deduplicates the same logic with
netinet/tcp_lro.c
Reviewed by: gallatin (1st version), hps, zbb, np, Dexuan Cui <decui microsoft com>
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5725
So that callers could react accordingly.
Reviewed by: gallatin (no objection)
MFC after: 1 week
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5695
- properly V_irtualise variable access unbreaking VIMAGE kernels.
- remove the volatile from the function return type to make architecture
using gcc happy [-Wreturn-type]
"type qualifiers ignored on function return type"
I am not entirely happy with this solution putting the u_int there
but it will do for now.
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.
Submitted by: Mike Karels
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D4306
Furthermore, there is no reason this needs to be a 64-bit integer
for the forseeable future.
Also, there is an inconsistency between to_flags and the mask in
tcp_addoptions(). Before r195654, to_flags was a u_long and the mask in
tcp_addoptions() was a u_int. r195654 changed to_flags to be a u_int64_t
but left the mask in tcp_addoptions() as a u_int, meaning that these
variables will only be the same width on platforms with 64-bit integers.
Convert both to_flags and the mask in tcp_addoptions() to be explicitly
32-bit variables. This may save a few cycles on 32-bit platforms, and
avoids unnecessarily mixing types.
Differential Revision: https://reviews.freebsd.org/D5584
Reviewed by: hiren
MFC after: 2 weeks
Sponsored by: Juniper Networks
struct tcpstat, because the structure can be zeroed out by netstat(1) -z,
and of course running connection counts shouldn't be touched.
Place running connection counts into separate array, and provide
separate read-only sysctl oid for it.
stack is not compliant with RFC 7323, which requires that TCP stacks send
a timestamp option on all packets (except, optionally, RSTs) after the
session is established.
This patch adds that support. It also adds a TCP signature option to the
packet, if appropriate.
PR: 206047
Differential Revision: https://reviews.freebsd.org/D4808
Reviewed by: hiren
MFC after: 2 weeks
Sponsored by: Juniper Networks
- Reorder variables by size
- Move initializer closer to where it is used
- Remove unneeded variable
Differential Revision: https://reviews.freebsd.org/D4808
Reviewed by: hiren
MFC after: 2 weeks
Sponsored by: Juniper Networks
for output and drop; connect didn't always fire a user probe
some probes were missing in fastpath
Submitted by: Hannes Mehnert
Sponsored by: REMS, EPSRC
Differential Revision: https://reviews.freebsd.org/D5525
included in loader.conf. It also fixes it so that no matter if some one incorrectly
specifies a load order, the lists and such will be initialized on demand at that
time so no one can make that mistake.
Reviewed by: hiren
Differential Revision: D5189
ACK aggregation limit is append count based, while the TCP data segment
aggregation limit is length based. Unless the network driver sets these
two limits, it's an NO-OP.
Reviewed by: adrian, gallatin (previous version), hselasky (previous version)
Approved by: adrian (mentor)
MFC after: 1 week
Sponsored by: Microsoft OSTC
Differential Revision: https://reviews.freebsd.org/D5185
Fix a panic that occurs when a vnet interface is unavailable at the time the
vnet jail referencing said interface is stopped.
Sponsored by: FIS Global, Inc.