Calculate the correct amount of bytes that are in-flight for a connection as
suggested by RFC 6675.
MFC: r292046
r290122 added 4 bytes and removed 8 in struct sackhint. Add a pad entry of 4
bytes to restore the size.
Implementation of server-side TCP Fast Open (TFO) [RFC7413].
TFO is disabled by default in the kernel build. See the top comment
in sys/netinet/tcp_fastopen.c for implementation particulars.
Differential Revision: https://reviews.freebsd.org/D4350
Sponsored by: Verisign, Inc.
In the same way fix the problem described in r291578 for IGMPv3.
In case when router has a lot of multicast groups, the reply can take
several packets due to MTU limitation.
Also we have a limit IGMP_MAX_RESPONSE_BURST == 4, that limits the number
of packets we send in one shot. Then we recalculate the timer value and
schedule the remaining packets for sending.
The problem is that when we call igmp_v3_dispatch_general_query() to send
remaining packets, we queue new reply in the same mbuf queue. And when
number of packets is bigger than IGMP_MAX_RESPONSE_BURST, we get endless
reply of IGMPv3 reports.
To fix this, add the check for remaining packets in the queue.
The r241129 description was wrong that the scenario is possible
only for read locks on pcbs. The same race can happen with write
lock semantics as well.
The race scenario:
- Two threads (1 and 2) locate pcb with writer semantics (INPLOOKUP_WLOCKPCB)
and do in_pcbref() on it.
- 1 and 2 both drop the inp hash lock.
- Another thread (3) grabs the inp hash lock. Then it runs in_pcbfree(),
which wlocks the pcb. They must happen faster than 1 or 2 come INP_WLOCK()!
- 1 and 2 congest in INP_WLOCK().
- 3 does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(),
which doesn't free the pcb due to two references on it.
Then it unlocks the pcb.
- 1 (or 2) gets wlock on the pcb, runs in_pcbrele_wlocked(), which doesn't
report inp as freed, due to 2 (or 1) still helding extra reference on it.
The thread tries to do smth with a disconnected pcb and crashes.
Submitted by: emeric.poupon@stormshield.eu
Reviewed by: glebius@
Sponsored by: Stormshield
Tested by: Cassiano Peixoto, Stormshield
Turning on IPSEC used to introduce a slight amount of performance
degradation (7%) for host host TCP connections over 10Gbps links,
even when there were no secuirty policies in place. There is no
change in performance on 1Gbps network links. Testing GENERIC vs.
GENERIC-NOIPSEC vs. GENERIC with this change shows that the new
code removes any overhead introduced by having IPSEC always in the
kernel.
Differential Revision: D3993
Sponsored by: Rubicon Communications (Netgate)
Fix an unnecessarily aggressive behavior where mtu clamping begins on first
retransmission timeout (rto) when blackhole detection is enabled. Make
sure it only happens when the second attempt to send the same segment also fails
with rto.
Also make sure that each mtu probing stage (usually 1448 -> 1188 -> 524) follows
the same pattern and gets 2 chances (rto) before further clamping down.
Note: RFC4821 doesn't specify implementation details on how this situation
should be handled.
Update TSO limits to include all headers.
To make driver programming easier the TSO limits are changed to
reflect the values used in the BUSDMA tag a network adapter driver is
using. The TCP/IP network stack will subtract space for all linklevel
and protocol level headers and ensure that the full mbuf chain passed
to the network adapter fits within the given limits. See r287775
for a more detailed description.
Differential Revision: https://reviews.freebsd.org/D3477
Reviewed by: rmacklem
Fix wrong formatting of 0.0.0.0/X table records in ipfw(8).
Add `flags` u16 field to the hole in ipfw_table_xentry structure.
Kernel has been guessing address family for supplied record based
on xent length size.
Userland, however, has been getting fixed-size ipfw_table_xentry structures
guessing address family by checking address by IN6_IS_ADDR_V4COMPAT().
Fix this behavior by providing specific IPFW_TCF_INET flag for IPv4 records.
PR: bin/189471,kern/200169
Since the IETF has redefined the meaning of the tos field to accommodate
a set of differentiated services, set IPTOS_PREC_* macros using
IPTOS_DSCP_* macro definitions.
While here, add IPTOS_DSCP_VA macro according to RFC 5865.
Differential Revision: https://reviews.freebsd.org/D3119
Reviewed by: gnn
Avoid a situation where we do not set persist timer after a zero window
condition.
If you send a 0-length packet, but there is data is the socket buffer, and
neither the rexmt or persist timer is already set, then activate the persist
timer.
PR: 192599
Approved by: re (delphij)
Check TCP timestamp option flag so that the automatic receive buffer
scaling code does not use an uninitialized timestamp echo reply value
from the stack when timestamps are not enabled.
Approved by: re (gjb)
Extend fixes made in r278103 and r38754 by copying the complete packet
header and not only partial flags and fields. Firewalls can attach
classification tags to the outgoing mbufs which should be copied to
all the new fragments. Else only the first fragment will be let
through by the firewall. This can easily be tested by sending a large
ping packet through a firewall. It was also discovered that VLAN
related flags and fields should be copied for packets traversing
through VLANs. This is all handled by "m_dup_pkthdr()".
Regarding the MAC policy check in ip_fragment(), the tag provided by
the originating mbuf is copied instead of using the default one
provided by m_gethdr().
Tested by: Karim Fodil-Lemelin <fodillemlinkarim at gmail.com>
Sponsored by: Mellanox Technologies
PR: 7802
In case of an output error, continue with the next net, don't try to
continue sending on the same net.
This fixes a bug where an invalid mbuf chain was constructed, if a
full size frame of control chunks should be sent and there is a
output error.
Based on a discussion with rrs@, change move to the next net. This fixes
the bug and improves the behaviour.
Thanks to Irene Ruengeler for spending a lot of time in narrowing this
problem down.
Fix a callout race condition introduced in TCP timers callouts with r281599.
In TCP timer context, it is not enough to check callout_stop() return value
to decide if a callout is still running or not, previous callout_reset()
return values have also to be checked.
Change struct attribute to avoid aligned operations mismatch
Previous __alignment(4) allowed compiler to assume that operations are
performed on aligned region. On ARM processor, this led to alignment fault
Remove in_gif.h and in6_gif.h files. They only contain function
declarations used by gif(4). Instead declare these functions in C files.
Also make some variables static.
MFC r276215:
Extern declarations in C files loses compile-time checking that
the functions' calls match their definitions. Move them to header files.
Overhaul if_gre(4).
Split it into two modules: if_gre(4) for GRE encapsulation and
if_me(4) for minimal encapsulation within IP.
gre(4) changes:
* convert to if_transmit;
* rework locking: protect access to softc with rmlock,
protect from concurrent ioctls with sx lock;
* correct interface accounting for outgoing datagramms (count only payload size);
* implement generic support for using IPv6 as delivery header;
* make implementation conform to the RFC 2784 and partially to RFC 2890;
* add support for GRE checksums - calculate for outgoing datagramms and check
for inconming datagramms;
* add support for sending sequence number in GRE header;
* remove support of cached routes. This fixes problem, when gre(4) doesn't
work at system startup. But this also removes support for having tunnels with
the same addresses for inner and outer header.
* deprecate support for various GREXXX ioctls, that doesn't used in FreeBSD.
Use our standard ioctls for tunnels.
me(4):
* implementation conform to RFC 2004;
* use if_transmit;
* use the same locking model as gre(4);
PR: 164475
MFC r274289 (by bz):
gcc requires variables to be initialised in two places. One of them
is correctly used only under the same conditional though.
For module builds properly check if the kernel supports INET or INET6,
as otherwise various mips kernels without IPv6 support would fail to build.
MFC r274964:
Add ip_gre.h to ObsoleteFiles.inc.
- Virtualize interface cloner for gre(4). This fixes a panic when destroying
a vnet jail which has a gre(4) interface.
- Make net.link.gre.max_nesting vnet-local.
Remove route chaching support from ipsec code. It isn't used for some time.
* remove sa_route_union declaration and route_cache member from struct secashead;
* remove key_sa_routechange() call from ICMP and ICMPv6 code;
* simplify ip_ipsec_mtu();
* remove #include <net/route.h>;
Sponsored by: Yandex LLC
Add an ability accept encapsulated packets from different sources by one
gif(4) interface. Add new option "ignore_source" for gif(4) interface.
When it is enabled, gif's encapcheck function requires match only for
packet's destination address.
Differential Revision: https://reviews.freebsd.org/D2004
Sponsored by: Yandex LLC
Take source and destination address into account when determining
the scope.
This fixes a problem when a client with a global address
connects to a server with a private address.
Thanks to Irene Ruengeler in helping me to find the issue.
Fix a bug where messages would not be sent in SHUTDOWN_RECEIVED state.
This problem was reported by Mark Bonnekessel and Markus Boese.
Thanks to Irene Ruengeler for helping me to fix the cause of
the problem. It can be tested with the following packetdrill script:
+0.0 socket(..., SOCK_STREAM, IPPROTO_SCTP) = 3
+0.0 fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
+0.0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
// Check the handshake with an empty(!) cookie
+0.1 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0.0 > sctp: INIT[flgs=0, tag=1, a_rwnd=..., os=..., is=..., tsn=0, ...]
+0.1 < sctp: INIT_ACK[flgs=0, tag=2, a_rwnd=10000, os=1, is=1, tsn=0, STATE_COOKIE[len=4, val=...]]
+0.0 > sctp: COOKIE_ECHO[flgs=0, len=4, val=...]
+0.1 < sctp: COOKIE_ACK[flgs=0]
+0.0 getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
+0.0 write(3, ..., 1024) = 1024
+0.0 > sctp: DATA[flgs=BE, len=1040, tsn=0, sid=0, ssn=0, ppid=0]
+0.0 write(3, ..., 1024) = 1024 // Pending due to Nagle
+0.0 < sctp: SHUTDOWN[flgs=0, cum_tsn=0]
+0.0 > sctp: DATA[flgs=BE, len=1040, tsn=1, sid=0, ssn=1, ppid=0]
+0.0 < sctp: SACK[flgs=0, cum_tsn=1, a_rwnd=10000, gaps=[], dups=[]] // Do we need another SHUTDOWN here?
+0.0 > sctp: SHUTDOWN_ACK[flgs=0]
+0.0 < sctp: SHUTDOWN_COMPLETE[flgs=0]
+0.0 close(3) = 0
Ensure that the COOKIE-ACK can be sent over UDP if the COOKIE-ECHO was
received over UDP.
Thanks to Felix Weinrank for makeing me aware of the problem and to
Irene Ruengeler for providing the fix.