We hold the SOCKBUF_LOCK so use soroverflow_locked here.
This bug may manifest as a non-killable process stuck in [*so_rcv].
Approved by: scottl
Reviewed by: Roy Marples <roy@marples.name>
Fixes: 7045b1603bdf
MFC after: 10 days
Differential Revision: https://reviews.freebsd.org/D31374
SO_RERROR indicates that receive buffer overflows should be handled as
errors. Historically receive buffer overflows have been ignored and
programs could not tell if they missed messages or messages had been
truncated because of overflows. Since programs historically do not
expect to get receive overflow errors, this behavior is not the
default.
This is really really important for programs that use route(4) to keep
in sync with the system. If we loose a message then we need to reload
the full system state, otherwise the behaviour from that point is
undefined and can lead to chasing bogus bug reports.
Reviewed by: philip (network), kbowling (transport), gbe (manpages)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D26652
If the socket is configured such that the sender is expected to supply
the IP header, then we need to verify that it actually did so.
Reported by: syzkaller+KMSAN
Reviewed by: donner
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31302
Import OpenBSD's syncookie support for pf. This feature help pf resist
TCP SYN floods by only creating states once the remote host completes
the TCP handshake rather than when the initial SYN packet is received.
This is accomplished by using the initial sequence numbers to encode a
cookie (hence the name) in the SYN+ACK response and verifying this on
receipt of the client ACK.
Reviewed by: kbowling
Obtained from: OpenBSD
MFC after: 1 week
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D31138
Fix a bug in VNET handling, which occurs when using specific NICs.
PR: 257195
Reviewed by: rrs
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D31212
There is a bug in the error path where rack_bbr_common does a m_pullup() and the pullup fails.
There is a stray mfree(m) after m is set to NULL. This is not a good idea :-)
Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31194
Currently the LRO parser, if given a packet that say has ETH+IP header but the TCP header
is in the next mbuf (split), would walk garbage. Lets make sure we keep track as we
parse of the length and return NULL anytime we exceed the length of the mbuf.
Reviewed by: tuexen, hselasky
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31195
In reviewing tcp_lro.c we have a possibility that some drives may send a mbuf into
LRO without making sure that the checksum passes. Some drivers actually are
aware of this and do not call lro when the csum failed, others do not do this and
thus could end up sending data up that we think has a checksum passing when
it does not.
This change will fix that situation by properly verifying that the mbuf
has the correct markings (CSUM VALID bits as well as csum in mbuf header
is set to 0xffff).
Reviewed by: tuexen, hselasky, gallatin
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31155
The packet_limit can fall to 0, leading to a divide by zero abort in
the "packets % packet_limit".
An possible solution would be to apply a lower limit of 1 after the
calculation of packet_limit, but since any number modulo 1 gives 0,
the more efficient solution is to skip the modulo operation for
packet_limit <= 1.
Since this is a fix for a panic observed in stable/12, merging this
fix to stable/12 and stable/13 before expiry of the 3 day waiting
period might be justified, if it works for the reporter of the issue.
Reported by: Karl Denninger <karl@denninger.net>
MFC after: 3 days
This fixes the incorrect use of a sysctl add to u64. It
was for a useconds time, but on 32 bit platforms its
not a u64. Instead use the long directive.
Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31107
When fixing another bug, I noticed that the alternate
TCP stacks do not build when various combinations of
ipv4 and ipv6 are disabled.
Reviewed by: rrs, tuexen
Differential Revision: https://reviews.freebsd.org/D31094
Sponsored by: Netflix
HPTS drives both rack and bbr, and yet there have been many complaints
about performance. This bit of work restructures hpts to help reduce CPU
overhead. It does this by now instead of relying on the timer/callout to
drive it instead use user return from a system call as well as lro flushes
to drive hpts. The timer becomes a backstop that dynamically adjusts
based on how "late" we are.
Reviewed by: tuexen, glebius
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31083
There are several cases where we make a goodput measurement and we are running
out of data when we decide to make the measurement. In reality we should not make
such a measurement if there is no chance we can have "enough" data. There is also
some corner case TLP's that end up not registering as a TLP like they should, we
fix this by pushing the doing_tlp setup to the actual timeout that knows it did
a TLP. This makes it so we always have the appropriate flag on the sendmap
indicating a TLP being done as well as count correctly so we make no more
that two TLP's.
In addressing the goodput lets also add a "quality" metric that can be viewed via
blackbox logs so that a casual observer does not have to figure out how good
of a measurement it is. This is needed due to the fact that we may still make
a measurement that is of a poorer quality as we run out of data but still have
a minimal amount of data to make a measurement.
Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D31076
Ifnet (inline) hw kTLS NICs typically keep state within
a TLS record, so that when transmitting in-order,
they can continue encryption on each segment sent without
DMA'ing extra state from the host.
This breaks down when transmits are out of order (eg,
TCP retransmits). In this case, the NIC must re-DMA
the entire TLS record up to and including the segment
being retransmitted. This means that when re-transmitting
the last 1448 byte segment of a TLS record, the NIC will
have to re-DMA the entire 16KB TLS record. This can lead
to the NIC running out of PCIe bus bandwidth well before
it saturates the network link if a lot of TCP connections have
a high retransmoit rate.
This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct),
where TCP connections with higher retransmit rate will be
switched to SW kTLS so as to conserve PCIe bandwidth.
Reviewed by: hselasky, markj, rrs
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D30908
The kernel part of ipfw(8) does initialize LibAlias uncondistionally
with an zeroized port range (allowed ports from 0 to 0). During
restucturing of libalias, port ranges are used everytime and are
therefor initialized with different values than zero. The secondary
initialization from ipfw (and probably others) overrides the new
default values and leave the instance in an unfunctional state. The
obvious solution is to detect such reinitializations and use the new
default value instead.
MFC after: 3 days
The expiration time of direct address mappings is explicitly
uninitialized. Expire times are always compared during housekeeping.
Despite the uninitialized value does not harm, it's simpler to just
set it to a reasonable default. This was detected during valgrinding
the test suite.
MFC after: 3 days
Comparing elements in a tree requires transitiviy. If a < b and b < c
then a must be smaller than c. This way the tree elements are always
pairwise comparable.
Tristate comparsion functions returning values lower, equal, or
greater than zero, are usually implemented by a simple subtraction of
the operands. If the size of the operands are equal to the size of
the result, integer modular arithmetics kick in and violates the
transitivity.
Example:
Working on byte with 0, 120, and 240. Now computing the differences:
120 - 0 = 120
240 - 120 = 120
240 - 0 = -16
MFC after: 3 days
Some TCP stacks negotiate TS support, but do not send TS at all
or not for keep-alive segments. Since this includes modern widely
deployed stacks, tolerate the violation of RFC 7323 per default.
Reviewed by: rgrimes, rrs, rscheff
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D30740
Sponsored by: Netflix, Inc.
Hardware TLS is now supported in some interface cards and it works well. Except that
when we have connections that retransmit a lot we get into trouble with all the retransmits.
This prep step makes way for change that Drew will be making so that we can "kick out" a
session from hardware TLS.
Reviewed by: mtuexen, gallatin
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30895
There were two bugs that prevented V4 sockets from connecting to
a rack server running a V4/V6 socket. As well as a bug that stops the
mapped v4 in V6 address from working.
Reviewed by: mtuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D30885
Compiling libalias results in warnings about unused functions.
Those warnings are caused by clang's heuristic to consider an inline
function as in use, iff the declaration is in a *.c file.
Declarations in *.h files do not emit those warnings.
Hence the declarations must be moved to an extra *.h file.
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D30844
tcp_twcheck now expects a read lock on the inp for the SYN case
instead of a write lock.
Reviewed by: np
Fixes: 1db08fbe3ffa tcp_input: always request read-locking of PCB for any pure SYN segment.
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D30782
are unneccessary and used to be there before TFO as an invariant. With
TFO and after 8d5719aa74f the "so" value is still needed.
Reported & tested by: tuexen
Fixes: 8d5719aa74f1d1441ee5ee365d45d53f934e81d6
Current data structure is using a hash of unordered lists. Those
unordered lists are quite efficient, because the least recently
inserted entries are most likely to be used again. In order to avoid
long search times in other cases, the lists are hashed into many
buckets. Unfortunatly a search for a miss needs an exhaustive
inspection and a careful definition of the hash.
Splay trees offer a similar feature: Almost O(1) for access of the
least recently used entries, and amortized O(ln(n)) for almost all
other cases. Get rid of the hash.
Now the data structure should able to quickly react to external
packets without eating CPU cycles for breakfast, preventing a DoS.
PR: 192888
Discussed with: Dimitry Luhtionov
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D30536
Current data structure is using a hash of unordered lists. Those
unordered lists are quite efficient, because the least recently
inserted entries are most likely to be used again. In order to avoid
long search times in other cases, the lists are hashed into many
buckets. Unfortunatly a search for a miss needs an exhaustive
inspection and a careful definition of the hash.
Splay trees offer a similar feature - almost O(1) for access of the
least recently used entries), and amortized O(ln(n) - for almost all
other cases. Get rid of the hash.
Discussed with: Dimitry Luhtionov
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D30516
The entry deleteAllLinks in the struct libalias is only used to signal
a state between internal calls. It's not used between API calls.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D30604
Get rid of PORT_BASE, replace by AliasRange. Simplify code.
Factor out the search for a new port. Improves the perfomance a bit.
Discussed with: Dimitry Luhtionov
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D30581
Let PPTP use its own data structure.
Regroup and rename other lists, which are not PPTP.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D30580
Reorder incoming links by grouping of common search terms.
Significant performance improvement for incoming (missing) flows.
Remove LSNAT from outgoing search.
Slight speedup due to less comparsions in the loop.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D30574
Factor out the outgoing search function.
Preparation for a new data structure.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D30572
Separate the partially specified links into a separate data structure.
This would causes a major parformance impact, if there are many of
them. Use a (smaller) hash table to speed up the partially link
access.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D30570
This completes PRR cwnd reduction in all circumstances
for the base TCP stack (SACK loss recovery, ECN window reduction,
non-SACK loss recovery), preventing the arriving ACKs to
clock out new data at the old, too high rate. This
reduces the chance to induce additional losses while
recovering from loss (during congested network conditions).
For non-SACK loss recovery, each ACK is assumed to have
one MSS delivered. In order to prevent ACK-split attacks,
only one window worth of ACKs is considered to actually
have delivered new data.
MFC after: 6 weeks
Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D29441
Search fully specified links first. Some performance loss due to need
to revisit the db twice, if not found.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D30569
Factor out the common Out and In filter
Slightly better performance due to eager skip of search loop
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D30568
Summary:
- Use LibAliasTime as a real global variable for central timekeeping.
- Reduce number of syscalls in user space considerably.
- Dynamically adjust the packet counters to match the second resolution.
- Only check the first few packets after a time increase for expiry.
Discussed with: hselasky
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D30566
Stats counters are used as unsigned valued (i.e. printf("%u")) but are
defined as signed int. This causes trouble later, so fix it early.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D30587