The in_pcbdrop() KPI, which is used solely by TCP, allows to remove a
pcb from hash list and mark it as dropped. The comment suggests that
such pcb won't be returned by lookups. Indeed, every call to
in_pcblookup*() is accompanied by a check for INP_DROPPED. Do what
comment suggests: never return such pcbs and remove unnecessary checks.
Reviewed by: tuexen
Differential revision: https://reviews.freebsd.org/D37061
This adds the capability for a modular congestion control
to select which variant of ECN-capable-transport it wants to use
when sending out elegible segments. As an initial CC to utilize
this, DCTCP was selected.
Event: IETF 115 Hackathon
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D24869
Keep all ECN related code in (mostly) one place.
No functional change.
Event: IETF 115 Hackathon
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D37285
Accurate ECN asks to conservatively estimate, when the
ACE counter may have wrapped due to a single ACK covering a larger
number of segments. This is described in Annex A.2 of the
accurate-ecn draft.
Event: IETF 115 Hackathon
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D37281
Provide diagnostics information around AccECN into
the tcpinfo struct.
Event: IETF 115 Hackathon
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D37280
Currently SO_REUSEPORT_LB silently does nothing when set by a jailed
process. It is trivial to support this option in VNET jails, but it's
also useful in traditional jails.
This patch enables LB groups in jails with the following semantics:
- all PCBs in a group must belong to the same jail,
- PCB lookup prefers jailed groups to non-jailed groups
This is a straightforward extension of the semantics used for individual
listening sockets. One pre-existing quirk of the lbgroup implementation
is that non-jailed lbgroups are searched before jailed listening
sockets; that is preserved with this change.
Discussed with: glebius
MFC after: 1 month
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37029
If a memory allocation failure causes bind to fail, we should take the
inpcb back out of its LB group since it's not prepared to handle
connections.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37027
Some auditing of the code shows that "cred" is never non-NULL in these
functions, either because all callers pass a non-NULL reference or
because they unconditionally dereference "cred". So, let's simplify the
code a bit and remove NULL checks. No functional change intended.
Reviewed by: glebius
MFC after: 1 week
Sponsored by: Modirum MDPay
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D37025
It is called only from tcp_input() and always has valid parameter.
Reviewed by: rscheff, tuexen
Differential revision: https://reviews.freebsd.org/D37115
The suppresion was added in 5f311da2cc with no explanation in the
commit message of the exact problem that was fixed. In the BSDCan
2006 talk [1], slides 12 to 14, we can find that it seems that there
was some problem with the TIME_WAIT state not properly being handled
on the remote side (also FreeBSD!), and this switching off the
suppression had hidden the problem. The rationale of the change was
that other stacks may also be buggy wrt the TIME_WAIT.
I did not find the actual problem in TIME_WAIT that the suppression
has hidden, neither a commit that would fix it. However, since that
time we started to handle SYNs with RFC5961 instead of RFC793, see
3220a2121c. We also now have the tcp-testsuite [2], that has full
coverage of all possible scenarios of receiving SYN in TIME_WAIT.
This effectively reverts 5f311da2cc
and 6ee79c59d2.
[1] https://www.bsdcan.org/2006/papers/ImprovingTCPIP.pdf
[2] https://github.com/freebsd-net/tcp-testsuite
Reviewed by: rscheff
Discussed with: rscheff, rrs, tuexen
Differential revision: https://reviews.freebsd.org/D37042
The assertion was incorrectly removed in 0d7445193a. The leak of
a TIME-WAIT state into tfb_do_segment_nounlock method was fixed in
31bc602ff8. The TIME-WAIT connections are processed by the main
tcp_input() always.
We recently got rid of the explicit INP_TIMEWAIT state, this has caused some
minor breakage to both rack and bbr. Basically the timewait check that was
in tcp_lro.c is now gone. This means that compressed_ack and mbuf_queued
packets will arrive at TCP without going through tcp_input_with_port(). We need
to expand the check that was stripped to look at the tcp_state (t_state) and
not "LRO" packets that are in the TCPS_TIMEWAIT state.
Reviewed by: tuexen, gliebus
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D37080
tcp_respond is another function to build a tcp control packet
quickly. With ECN++ and AccECN, both the IP ECN header, and
the TCP ECN flags are supposed to reflect the correct state.
Also ensure that on receiving multiple ECN SYN-ACKs, the
responses triggered will reflect the latest state.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36973
- Use C99 types uintXX_t instead of u_intXX_t.
- Try to make space/tab usage a little bit more consistent.
- Shorten comments to fit into 80 chars.
Not a functional change, just making future changes easier to read.
Ensure that the initiate tag of the INIT-ACK chunk is used as the
verification tag of the packet containing the ABORT chunk.
Reported by: Suganya Dharma
MFC after: 1 week
Similar to 2cd6ad766e for inet6 drop ifma_restart use, creating more
problems than solving. It is no longer needed after epoch introduction.
While there, add NULL check for ifma_ifp in igmp_change_state(), that
sometimes caused panics on interface destruction.
MFC after: 2 weeks
When multiple <SYN> segments are received, update the <SYN,ACK>
sent in response to the latest IP ECN and TCP ECN information.
On retransmitting the <SYN,ACK>, once ECN maxtries are done, not
only disable RFC3168 ECN, but AccECN also.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36875
On passive sessions, honor the local settings disabling or
enabling window scaling and timestamp options.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36874
Mechanically cleanup INP_TIMEWAIT from the kernel sources. After
0d7445193a, this commit shall not cause any functional changes.
Note: this flag was very often checked together with INP_DROPPED.
If we modify in_pcblookup*() not to return INP_DROPPED pcbs, we
will be able to remove most of this checks and turn them to
assertions. Some of them can be turned into assertions right now,
but that should be carefully done on a case by case basis.
Differential revision: https://reviews.freebsd.org/D36400
The memory savings the tcptw brought back in 2003 (see 340c35de6a) no
longer justify the complexity required to maintain it. For longer
explanation please check out the email [1].
Surpisingly through almost 20 years the TCP stack functionality of
handling the TIME_WAIT state with a normal tcpcb did not bitrot. The
existing tcp_input() properly handles a tcpcb in TCPS_TIME_WAIT state,
which is confirmed by the packetdrill tcp-testsuite [2].
This change just removes tcptw and leaves INP_TIMEWAIT. The flag will
be removed in a separate commit. This makes it easier to review and
possibly debug the changes.
[1] https://lists.freebsd.org/archives/freebsd-net/2022-January/001206.html
[2] https://github.com/freebsd-net/tcp-testsuite
Differential revision: https://reviews.freebsd.org/D36398
TCP has an idle-reduce feature that allows a connection to reduce its
cwnd after it has been idle more than an RTT. This feature only works
for a sending side connection. It does this by at output checking the
idle time (t_rcvtime vs ticks) to see if its more than the RTO timeout.
The problem comes if you are a web server. You get a request and
then send out all the data.. then go idle. The next time you would
send is in response to a request from the peer asking for more data.
But the thing is you updated t_rcvtime when the request came in so
you never reduce.
The fix is to do the idle reduce check also on inbound.
Reviewed by: tuexen, rscheff
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36721
Two functions to call tcp_drop() and tcp_close() from a callout context.
Garbage collect tcp_inpinfo_lock_del(), it has a single use now.
Differential revision: https://reviews.freebsd.org/D36397
In the original design of the network stack from the protocol control
input method pr_ctlinput was used notify the protocols about two very
different kinds of events: internal system events and receival of an
ICMP messages from outside. These events were coded with PRC_ codes.
Today these methods are removed from the protosw(9) and are isolated
to IPv4 and IPv6 stacks and are called only from icmp*_input(). The
PRC_ codes now just create a shim layer between ICMP codes and errors
or actions taken by protocols.
- Change ipproto_ctlinput_t to pass just pointer to ICMP header. This
allows protocols to not deduct it from the internal IP header.
- Change ip6proto_ctlinput_t to pass just struct ip6ctlparam pointer.
It has all the information needed to the protocols. In the structure,
change ip6c_finaldst fields to sockaddr_in6. The reason is that
icmp6_input() already has this address wrapped in sockaddr, and the
protocols want this address as sockaddr.
- For UDP tunneling control input, as well as for IPSEC control input,
change the prototypes to accept a transparent union of either ICMP
header pointer or struct ip6ctlparam pointer.
- In icmp_input() and icmp6_input() do only validation of ICMP header and
count bad packets. The translation of ICMP codes to errors/actions is
done by protocols.
- Provide icmp_errmap() and icmp6_errmap() as substitute to inetctlerrmap,
inet6ctlerrmap arrays.
- In protocol ctlinput methods either trust what icmp_errmap() recommend,
or do our own logic based on the ICMP header.
Differential revision: https://reviews.freebsd.org/D36731
Now these functions are called only from icmp*_input(). The pointer
to the ICMP data is never NULL and cmd has a limited set of values.
In the past the functions were demultiplexing control messages from
ICMP layer, as well as internally generated events. In the latter
case the the pointer to IP would be NULL.
Differential revision: https://reviews.freebsd.org/D36729
and mark those PRC_* codes, that are used. The rest are dead code.
This is not a functional change, but illustrative to make easier
review of following changes.
After decoupling of protosw(9) and IP wire protocols in 78b1fc05b2 for
IPv4 we got vector ip_ctlprotox[] that is executed only and only from
icmp_input() and respectively for IPv6 we got ip6_ctlprotox[] executed
only and only from icmp6_input(). This allows to use protocol specific
argument types in these methods instead of struct sockaddr and void.
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36727
The netinet/ipprotosw.h and netinet6/ip6protosw.h were KAME relics, with
the former removed in f0ffb944d2 in 2001 and the latter survived until
today. It has been reduced down to only one useful declaration that
moves to ip6_var.h
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36726
With this change one can make a forward declaration of a function
that is of UDP tunneling type.
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36724
Ensure that an RFC3168 ECN reaction only occurs on non-SYN
segments.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36867