When the TCP_LOG option is used to enable logging on a listening
socket, inherit this if the listener is not auto selected and does
not have a log id set.
Reviewed by: cc
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D38436
These flags are TCP specific. While here, make also several LRO
internal functions to pass tcpcb pointer instead of inpcb one.
Reviewed by: rrs
Differential Revision: https://reviews.freebsd.org/D39698
We need to make the syncache aware of the flag and not do ECN if its set. Note that this
is not 100% full proof but the best we can do (i.e. its still possible that you can get in a
situation where the peer try's to do ecn).
Reviewed by: tuexen, glebius, rscheff
Sponsored by: Netflix Inc
Differential Revision:https://reviews.freebsd.org/D39672
Avoid magic numbers when handling the IPv6 flow ID for
DSCP and ECN fields and use the named variable instead.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D39503
So stack switching as always been a bit of a issue. We currently use a break before make setup which means that
if something goes wrong you have to try to get back to a stack. This patch among a lot of other things changes that so
that it is a make before break. We also expand some of the function blocks in prep for new features in rack that will allow
more controlled pacing. We also add other abilities such as the pathway for a stack to query a previous stack to acquire from
it critical state information so things in flight don't get dropped or mis-handled when switching stacks. We also add the
concept of a timer granularity. This allows an alternate stack to change from the old ticks granularity to microseconds and
of course this even gives us a pathway to go to nanosecond timekeeping if we need to (something for the data center to consider
for sure).
Once all this lands I will then update rack to begin using all these new features.
Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D39210
syncache_socket() does some unnecessary work: before connecting the PCB,
it saves the local address on the stack and restores it before freeing
the PCB in case of an error. However:
- There's no need to restore the old address in the error case.
- The PCB's local address will always be equal to that of the syncache
entry anyway.
So just remove this unnecessary code, which appears to date from the
introduction of the syncache 20+ years ago.
No functional change intended.
Reviewed by: tuexen, glebius
Sponsored by: Klara, Inc.
Sponsored by: Modirum MDPay
Differential Revision: https://reviews.freebsd.org/D38391
Do the cast from sockaddr to either IPv4 or IPv6 sockaddr in the
protocol's pr_connect method and from there on go down the call
stack with family specific argument.
Reviewed by: markj
Differential revision: https://reviews.freebsd.org/D38356
The SYN_RCVD state count is tricky here due to default code path and TFO
being so different. In the default case the count is incremented when a
syncache entry is added to the the database in syncache_insert(). Later
when connection transitions from syncache entry to a socket in
syncache_expand(), this counter is inherited by the tcpcb. If socket or
tcpcb allocation failed in syncache_socket() failed the syncache_expand()
is responsible for decrement. In the TFO case the syncache entry is not
inserted into database and count of SYN_RCVD is first incremented in the
syncache_tfo_expand() after successful socket allocation. Thus, inside
syncache_socket() we can't tell whether we need to decrement in a case of
a failure or not. The caller is responsible for this book keeping.
Fixes: 07285bb4c2
Differential revision: https://reviews.freebsd.org/D37610
They are shared between UDP over IPv4 and over IPv6. To prevent all
possible kernel build failures wrap them in #ifdef _SYS_PROTOSW_H_.
Prompted by feedback from jhb@ and jrtc27@ on c93db4abf4.
When multiple <SYN> segments are received, update the <SYN,ACK>
sent in response to the latest IP ECN and TCP ECN information.
On retransmitting the <SYN,ACK>, once ECN maxtries are done, not
only disable RFC3168 ECN, but AccECN also.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36875
On passive sessions, honor the local settings disabling or
enabling window scaling and timestamp options.
Reviewed By: tuexen, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D36874
Rack has had the ability to timeout connections that just sit idle automatically. This
feature of course is off by default and requires the user set it on (though the socket option
has been missing in tcp_usrreq.c). Lets get the progress timeout fully supported in
the base stack as well as rack.
Reviewed by: tuexen
Sponsored by: Netflix Inc
Differential Revision: https://reviews.freebsd.org/D36716
- The soisconnected() call on transition from SYN_RCVD to ESTABLISHED
is also necessary for a half-synchronized connection. Fix that
just setting the flag, when we transfer SYN-SENT -> SYN-RECEIVED.
- Provide a comment that explains at what conditions the call to
soisconnected() is necessary.
- Hence mechanically rename the TF_INCQUEUE flag to TF_SONOTCONN.
- Extend the change to the BBR and RACK stacks.
Note: the interaction between the accept_filter(9) and the socket layer
is not fully consistent, yet. For most accept filters this call to
soisconnected() will not move the connection from the incomplete queue
to the complete. The move would happen only when the filter has received
the desired data, and soisconnected() would be called once again from
sorwakeup(). Ideally, we should mark socket as connected only there,
and leave the soisconnected() from SYN_RCVD->ESTABLISHED only for the
simultaneous open case. However, this doesn't yet work.
Reviewed by: rscheff, tuexen, rrs
Differential revision: https://reviews.freebsd.org/D36641
This call existed since pre-FreeBSD times, and it is hard to understand
why it was there in the first place. After 6f3caa6d81 it definitely
became necessary always and commit message from f1ee30ccd6 confirms that.
Now that 6f3caa6d81 is effectively backed out by 07285bb4c2, the call
appears to be useful only for sockets that landed on the incomplete queue,
e.g. sockets that have accept_filter(9) enabled on them.
Provide a new TCP flag to mark connections that are known to be on the
incomplete queue, and call soisconnected() only for those connections.
Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D36488
This streamlines cloning of a socket from a listener. Now we do not
drop the inpcb lock during creation of a new socket, do not do useless
state transitions, and put a fully initialized socket+inpcb+tcpcb into
the listen queue.
Before this change, first we would allocate the socket and inpcb+tcpcb via
tcp_usr_attach() as TCPS_CLOSED, link them into global list of pcbs, unlock
pcb and put this onto incomplete queue (see 6f3caa6d81). Then, after
sonewconn() we would lock it again, transition into TCPS_SYN_RECEIVED,
insert into inpcb hash, finalize initialization of tcpcb. And then, in
call into tcp_do_segment() and upon transition to TCPS_ESTABLISHED call
soisconnected(). This call would lock the listening socket once again
with a LOR protection sequence and then we would relocate the socket onto
the complete queue and only now it is ready for accept(2).
Reviewed by: rrs, tuexen
Differential revision: https://reviews.freebsd.org/D36064
Improve the "syncache: mbuf too small" assertion message with various
variables (some not actually needed) but enough that it will be obvious
if (a) we use IPv4 or IPv6, (b) if UDP tunneling is on, (c) what
max_linkhdr is, and (d) what MHLEN is.
This should help diagnostics in the future.
The case was hit with wireless drivers setting a large ic_headroom
and using IPv6.
Reviewed by: gallatin, tuexen, rscheff
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D34217
Reduce the burden to maintain correct and
extensible ECN related code across multiple
stacks and codepaths.
Formally no functional change.
Incidentially this establishes correct
ECN operation in one instance.
Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34162
Reduce the burden to maintain correct and
extensible ECN related code across multiple
stacks and codepaths.
Formally no functional change.
Incidentially this establishes correct
ECN operation in one instance.
Reviewed By: rrs, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34162
In order to consistently provide access to all
(including reserved) TCP header flag bits,
use an accessor function tcp_get_flags and
tcp_set_flags. Also expand any flag variable from
uint8_t / char to uint16_t.
Reviewed By: hselasky, tuexen, glebius, #transport
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D34130
When TCP_MD5SIG is set on a socket, all packets are dropped that don't
contain an MD5 signature. Relax this behavior to accept a non-signed
packet when a security association doesn't exist with the peer.
This is useful when a listen socket set with TCP_MD5SIG wants to handle
connections protected with and without MD5 signatures.
Reviewed by: bz (previous version)
Sponsored by: nepustil.net
Sponsored by: Klara Inc.
Differential Revision: https://reviews.freebsd.org/D33227
This reverts commit 266f97b5e9, reversing
changes made to a10253cffe.
A mismerge of a merge to catch up to main resulted in files being
committed which should not have been.
With upcoming changes to the inpcb synchronisation it is going to be
broken. Even its current status after the move of PCB synchronization
to the network epoch is very questionable.
This experimental feature was sponsored by Juniper but ended never to
be used in Juniper and doesn't exist in their source tree [sjg@, stevek@,
jtl@]. In the past (AFAIK, pre-epoch times) it was tried out at Netflix
[gallatin@, rrs@] with no positive result and at Yandex [ae@, melifaro@].
I'm up to resurrecting it back if there is any interest from anybody.
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D33020
are unneccessary and used to be there before TFO as an invariant. With
TFO and after 8d5719aa74 the "so" value is still needed.
Reported & tested by: tuexen
Fixes: 8d5719aa74
Adding support for TCP over UDP allows communication with
TCP stacks which can be implemented in userspace without
requiring special priviledges or specific support by the OS.
This is joint work with rrs.
Reviewed by: rrs
Sponsored by: Netflix, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D29469
A security feature from c06f087ccb appeared to be a huge bottleneck
under SYN flood. To mitigate that add a sysctl that would make
syncache(4) globally visible, ignoring UID/GID, jail(2) and mac(4)
checks. When turned on, we won't need to call crhold() on the listening
socket credential for every incoming SYN packet.
Reviewed by: bz
When packet is a SYN packet, we don't need to modify any existing PCB.
Normally SYN arrives on a listening socket, we either create a syncache
entry or generate syncookie, but we don't modify anything with the
listening socket or associated PCB. Thus create a new PCB lookup
mode - rlock if listening. This removes the primary contention point
under SYN flood - the listening socket PCB.
Sidenote: when SYN arrives on a synchronized connection, we still
don't need write access to PCB to send a challenge ACK or just to
drop. There is only one exclusion - tcptw recycling. However,
existing entanglement of tcp_input + stacks doesn't allow to make
this change small. Consider this patch as first approach to the problem.
Reviewed by: rrs
Differential revision: https://reviews.freebsd.org/D29576
the negotiation of TCP features. This affects most TCP options but
adherance to RFC7323 with the timestamp option will prevent a session
from getting established.
PR: 253576
Reviewed By: tuexen, #transport
MFC after: 3 days
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D28652
When timestamp support has been negotiated, TCP segements received
without a timestamp should be discarded. However, there are broken
TCP implementations (for example, stacks used by Omniswitch 63xx and
64xx models), which send TCP segments without timestamps although
they negotiated timestamp support.
This patch adds a sysctl variable which tolerates such TCP segments
and allows to interoperate with broken stacks.
Reviewed by: jtl@, rscheff@
Differential Revision: https://reviews.freebsd.org/D28142
Sponsored by: Netflix, Inc.
PR: 252449
MFC after: 1 week
* TCP segments without timestamps should be dropped when support for
the timestamp option has been negotiated.
* TCP segments with timestamps should be processed normally if support
for the timestamp option has not been negotiated.
This patch enforces the above.
PR: 250499
Reviewed by: gnn, rrs
MFC after: 1 week
Sponsored by: Netflix, Inc
Differential Revision: https://reviews.freebsd.org/D27148
* Let the accepted TCP/IPv4 socket inherit the configured TTL and
TOS value.
* Let the accepted TCP/IPv6 socket inherit the configured Hop Limit.
* Use the configured Hop Limit and Traffic Class when sending
IPv6 packets.
Reviewed by: rrs, lutz_donnerhacke.de
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D25909
sure that
* ECN is disabled if the client sends an non-ECN-setup SYN segment.
* ECN is disabled is the ECN-setup SYN-ACK segment is retransmitted more
than net.inet.tcp.ecn.maxretries times.
Reviewed by: rscheff
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D26008
cookies, use the same flow label for the segments sent during the
handshake and after the handshake.
This fixes a bug by making sure that sc_flowlabel is always stored in
network byte order.
Reviewed by: bz@
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D23957
sending a TCP segment from the TCP SYN cache (like a SYN-ACK).
This fix initialises it to zero. This is correct for the ECN bits,
but is does not honor the DSCP what an application might have set via
the IPPROTO_IPV6 level socket options IPV6_TCLASS. That will be
fixed separately.
Reviewed by: Richard Scheffenegger
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D23900
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.
This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.
Mark all obvious cases as MPSAFE. All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT
Approved by: kib (mentor, blanket)
Commented by: kib, gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D23718
the RFC and only enable ECN when both the
CWR and ECT bits our set within the SYN packet.
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D23645
When expanding a SYN-cache entry to a socket/inp a two step approach was
taken:
1) The local address was filled in, then the inp was added to the hash
table.
2) The remote address was filled in and the inp was relocated in the
hash table.
Before the epoch changes, a write lock was held when this happens and
the code looking up entries was holding a corresponding read lock.
Since the read lock is gone away after the introduction of the
epochs, the half populated inp was found during lookup.
This resulted in processing TCP segments in the context of the wrong
TCP connection.
This patch changes the above procedure in a way that the inp is fully
populated before inserted into the hash table.
Thanks to Paul <devgs@ukr.net> for reporting the issue on the net@
mailing list and for testing the patch!
Reviewed by: rrs@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D22971
This allows adding more ECN related flags in the future.
No functional change intended.
Submitted by: Richard Scheffenegger
Reviewed by: rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D22497
improvements, the ECN bits need to be exposed to the TCP SYNcache.
This change is a minimal modification to the function headers, without any
functional change intended.
Submitted by: Richard Scheffenegger
Reviewed by: rgrimes@, rrs@, tuexen@
Differential Revision: https://reviews.freebsd.org/D22436