An IPv4 address is embedded into an ifaddr which is freed
via epoch. And the in_ifaddrhead is already a CK list. Use
the network epoch to protect against use after free.
Next step would be to CK-ify the in_addr hash and get rid of the...
Reviewed by: melifaro
Differential Revision: https://reviews.freebsd.org/D32434
Tracking the number of unused holes in the trie and the range table
was a bad metric based on which full trie and / or range rebuilds
were triggered, which would happen in vain by far too frequently,
particularly with live BGP feeds.
Instead, track the total unused space inside the trie and range table
structures, and trigger rebuilds if the percentage of unused space
exceeds a sysctl-tunable threshold.
MFC after: 3 days
PR: 257965
DSACK accounting has been for quite some time under a NETFLIX_STATS ifdef. Statistics
on DSACKs however are very useful in figuring out how much bad retransmissions you
are doing. This is further complicated, however, by stacks that do TLP. A TLP
when discovering a lost ack in the reverse path will cause the generation
of a DSACK. For this situation we introduce a new dsack-tlp-bytes as well
as the more traditional dsack-bytes and dsack-packets. These will now
all display in netstat -p tcp -s. This also updates all stacks that
are currently built to keep track of these stats.
Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32158
Traversing a single list of unused range chunks in search for a block
of optimal size was suboptimal.
The experience with real-world BGP workloads has shown that on average
unused range chunks are tiny, mostly in length from 1 to 4 or 5, when
DXR is configured with K = 20 which is the current default (D16X4R).
Therefore, introduce a limited amount of buckets to accomodate descriptors
of empty blocks of fixed (small) size, so that those can be found in O(1)
time. If no empty chunks of the requested size can be found in fixed-size
buckets, the search continues in an unsorted list of empty chunks of
variable lengths, which should only happen infrequently.
This change should permit us to manage significantly more empty range
chunks without sacrifying the speed of incremental range table updating.
MFC after: 3 days
The compressed ack path of rack is not following proper procedures in updating
the peers window. It should be checking the seq and ack values before updating and
instead it is blindly updating the values. This could in theory get the wrong window
in the connection for some length of time.
Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32082
In extensive testing in NF we have found two issues inside
the rack stack.
1) An incorrect offset is being generated by the fast send path when a fast send is initiated on
the end of the socket buffer and before the fast send runs, the sb_compress macro adds data to the trailing socket.
This fools the fast send code into thinking the sb offset changed and it miscalculates a "updated offset".
It should only do that when the mbuf in question got smaller.. i.e. an ack was processed. This can lead to
a panic deref'ing a NULL mbuf if that packet is ever retransmitted. At the best case it leads to invalid data being
sent to the client which usually terminates the connection. The fix is to have the proper logic (that is in the rsm fast path)
to make sure we only update the offset when the mbuf shrinks.
2) The other issue is more bothersome. The timestamp check in rack needs to use the msec timestamp when
comparing the timestamp echo to now. It was using a microsecond timestamp which ends up giving error
prone results but causes only small harm in trying to identify which send to use in RTT calculations if its a retransmit.
Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32062
Callers are getting the stcb send lock, so just KASSERT that.
No need to signal this when calling stream scheduler functions.
No functional change intended.
MFC after: 1 week
ktls_get_(rx|tx)_mode() can return an errno value or a TLS mode, so
errors are effectively hidden. Fix this by using a separate output
parameter. Convert to the new socket buffer locking macros while here.
Note that the socket buffer lock is not needed to synchronize the
SOLISTENING check here, we can rely on the PCB lock.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31977
The address with a host part of all zeros was used as a broadcast long
ago, but the default has been all ones since 4.3BSD and RFC1122. Until
now, we would broadcast the host zero address as well as the configured
address. Change to not broadcasting that address by default, but add a
sysctl (net.inet.ip.broadcast_lowest) to re-enable it. Note that the
correct way to use the zero address for broadcast would be to configure
it as the broadcast address for the network.
See https:/datatracker.ietf.org/doc/draft-schoen-intarea-lowest-address/
and the discussion in https://reviews.freebsd.org/D19316. Note, Linux
now implements this.
Reviewed by: rgrimes, tuexen; melifaro (previous version)
MFC after: 1 month
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D31861
A division by zero would occur if DXR would be activated on a vnet
with no IP addresses configured on any interfaces.
PR: 257965
MFC after: 3 days
Reported by: Raul Munoz
If the "len" variable is non-zero, we can assume that the sum of
"tp->t_snd_rxt_bytes + tp->t_sndbytes" is also non-zero.
It is also assumed that the 64-bit byte counters will never wrap around.
Differential Revision: https://reviews.freebsd.org/D31959
Reviewed by: gallatin, rrs and tuexen
Found by: "I told you so", also called hselasky
MFC after: 1 week
Sponsored by: NVIDIA Networking
Move the type and function pointers for operations on existing send
tags (modify, query, next, free) out of 'struct ifnet' and into a new
'struct if_snd_tag_sw'. A pointer to this structure is added to the
generic part of send tags and is initialized by m_snd_tag_init()
(which now accepts a switch structure as a new argument in place of
the type).
Previously, device driver ifnet methods switched on the type to call
type-specific functions. Now, those type-specific functions are saved
in the switch structure and invoked directly. In addition, this more
gracefully permits multiple implementations of the same tag within a
driver. In particular, NIC TLS for future Chelsio adapters will use a
different implementation than the existing NIC TLS support for T6
adapters.
Reviewed by: gallatin, hselasky, kib (older version)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D31572
There are two flags to request a non-blocking receive on a socket:
MSG_NBIO and MSG_DONTWAIT. They are handled a bit differently in that
soreceive_generic() and soreceive_stream() will block on the socket I/O
lock when MSG_NBIO is set, but not if MSG_DONTWAIT is set. In general,
MSG_NBIO seems to mean, "don't block if there is no data to receive" and
MSG_DONTWAIT means "don't go to sleep for any reason".
SCTP's soreceive implementation did not allow blocking on the I/O lock
if either flag is set, but this violates an assumption in
aio_process_sb(), which specifies MSG_NBIO but nonetheless
expects to make progress if data is available to read. Change
sctp_sorecvmsg() to block on the I/O lock only if MSG_DONTWAIT
is not set.
Reported by: syzbot+c7d22dbbb9aef509421d@syzkaller.appspotmail.com
Reviewed by: tuexen
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31915
All callers of sctp_aloc_assoc() mark the PCB as connected after a
successful call (for one-to-one-style sockets). In all cases this is
done without the PCB lock, so the PCB's flags can be corrupted. We also
do not atomically check whether a one-to-one-style socket is a listening
socket, which violates various assumptions in solisten_proto().
We need to hold the PCB lock across all of sctp_aloc_assoc() to fix
this. In order to do that without introducing lock order reversals, we
have to hold the global info lock as well.
So:
- Convert sctp_aloc_assoc() so that the inp and info locks are
consistently held. It returns with the association lock held, as
before.
- Fix an apparent bug where we failed to remove an association from a
global hash if sctp_add_remote_addr() fails.
- sctp_select_a_tag() is called when initializing an association, and it
acquires the global info lock. To avoid lock recursion, push locking
into its callers.
- Introduce sctp_aloc_assoc_connected(), which atomically checks for a
listening socket and sets SCTP_PCB_FLAGS_CONNECTED.
There is still one edge case in sctp_process_cookie_new() where we do
not update PCB/socket state correctly.
Reviewed by: tuexen
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31908
Free memory before return from arprequest_internal(). In in_arpinput(),
if arp_fillheader() fails, it should use goto drop.
Reviewed by: melifaro, imp, markj
MFC after: 1 week
Pull Request: https://github.com/freebsd/freebsd-src/pull/534
When port reuse is enabled in a one-to-one-style socket, sctp_listen()
may call sctp_swap_inpcb_for_listen() to move the PCB out of the "TCP
pool". In so doing it will drop the PCB lock, yielding an LOR since we
now hold several socket locks. Reorder sctp_listen() so that it
performs this operation before beginning the conversion to a listening
socket. Also modify sctp_swap_inpcb_for_listen() to return with PCB
write-locked, since that's what sctp_listen() expects now.
Reviewed by: tuexen
Fixes: bd4a39cc93d9 ("socket: Properly interlock when transitioning to a listening socket")
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31879
After commit bd4a39cc93d9 we now hold the global inp info lock across
the call to sctp_swap_inpcb_for_listen(), which attempts to acquire it
again. Since sctp_swap_inpcb_for_listen()'s sole caller is
sctp_listen(), we can simply change it to not try to acquire the lock.
Reported by: syzbot+a76b19ea2f8e1190c451@syzkaller.appspotmail.com
Reported by: syzbot+a1b6cef257ad145b7187@syzkaller.appspotmail.com
Reviewed by: tuexen
Fixes: bd4a39cc93d9 ("socket: Properly interlock when transitioning to a listening socket")
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31878
Current logic always selects an IFA of the same family from the
outgoing interfaces. In IPv4 over IPv6 setup there can be just
single non-127.0.0.1 ifa, attached to the loopback interface.
Create a separate rt_getifa_family() to handle entire ifa selection
for the IPv4 over IPv6.
Differential Revision: https://reviews.freebsd.org/D31868
MFC after: 1 week
... when applied to one-to-one-style sockets. sctp_listen() cannot be
used to toggle the listening state of such a socket. See RFC 6458's
description of expected listen(2) semantics for one-to-one- and
one-to-many-style sockets.
Reviewed by: tuexen
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31774
Currently, most protocols implement pru_listen with something like the
following:
SOCK_LOCK(so);
error = solisten_proto_check(so);
if (error) {
SOCK_UNLOCK(so);
return (error);
}
solisten_proto(so);
SOCK_UNLOCK(so);
solisten_proto_check() fails if the socket is connected or connecting.
However, the socket lock is not used during I/O, so this pattern is
racy.
The change modifies solisten_proto_check() to additionally acquire
socket buffer locks, and the calling thread holds them until
solisten_proto() or solisten_proto_abort() is called. Now that the
socket buffer locks are preserved across a listen(2), this change allows
socket I/O paths to properly interlock with listen(2).
This fixes a large number of syzbot reports, only one is listed below
and the rest will be dup'ed to it.
Reported by: syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com
Reviewed by: tuexen, gallatin
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31659
In preparation for moving sockbuf locks into the containing socket,
provide alternative macros for the sockbuf I/O locks:
SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK(). These operate on a
socket rather than a socket buffer. Note that these locks are used only
to prevent concurrent readers and writters from interleaving I/O.
When locking for I/O, return an error if the socket is a listening
socket. Currently the check is racy since the sockbuf sx locks are
destroyed during the transition to a listening socket, but that will no
longer be true after some follow-up changes.
Modify a few places to check for errors from
sblock()/SOCK_IO_(SEND|RECV)_LOCK() where they were not before. In
particular, add checks to sendfile() and sorflush().
Reviewed by: tuexen, gallatin
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31657
- The SCTP_PCB_FLAGS_SND_ITERATOR_UP check was racy, since two threads
could observe that the flag is not set and then both set it. I'm not
sure if this is actually a problem in practice, i.e., maybe there's no
problem having multiple sends for a single PCB in the iterator list?
- sctp_sendall() was modifying sctp_flags without the inp lock held.
The change simply acquires the PCB write lock before toggling the flag,
fixing both problems.
Reviewed by: tuexen
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31813
This appears to be unused in usrsctp as well. No functional change
intended.
Reviewed by: tuexen
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31812
sctp_close() and sctp_abort() disassociate the PCB from its socket.
As a part of this, they attempt to free the PCB, which may end up
lingering. Fix some bugs in this area:
- For some reason, sctp_close() and sctp_abort() set
SCTP_PCB_FLAGS_SOCKET_GONE using an atomic compare-and-set without the
PCB lock held. This is racy since sctp_flags is normally updated
without atomics, using the PCB lock to synchronize. So, the update
can be lost, which can cause all sort of races with other SCTP
components which look for the _GONE flag. Fix the problem simply by
acquiring the PCB lock in order to set the flag. Note that we have to
drop and re-acquire the lock again in sctp_inpcb_free(), but I don't
see a good way around that for now. If it's a real problem, the _GONE
flag could be split out of sctp_flags and into a dedicated sctp_inpcb
field.
- In sctp_inpcb_free(), load sctp_socket after acquiring the PCB lock,
to avoid possible races with parallel sctp_inpcb_free() calls.
- Add an assertion sctp_inpcb_free() to verify that _ALLGONE is not set.
Reviewed by: tuexen
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31811
With the new FIB_ALGO infrastructure, nearly all subsystems use
fib[46]_lookup() functions, which provides lockless lookups.
A number of places remains that uses old-style lookup functions, that
still requires RIB read lock to return the result. One of such places
is arp processing code.
FIB_ALGO implementation makes some tradeoffs, resulting in (relatively)
prolonged periods of holding RIB_WLOCK. If the lock is held and datapath
competes for it, the RX ring may get blocked, ending in traffic delays and losses.
As currently arp processing is performed directly in the interrupt handler,
handling ARP replies triggers the problem descibed above when the amount of
ARP replies is high.
To be more specific, prior to creating new ARP entry, routing lookup for the entry
address in interface fib is executed. The following conditions are the verified:
1. If lookup returns an empty result, or the resulting prefix is non-directly-reachable,
failure is returned. The only exception are host routes w/ gateway==address.
2. If the routing lookup returns different interface and non-host route,
we want to support the use case of having multiple interfaces with the same prefix.
In fact, the current code just checks if the returned prefix covers target address
(always true) and effectively allow allocating ARP entries for any directly-reachable prefix,
regardless of its interface.
Change the code to perform the following:
1) use fib4_lookup() to get the nexthop, instead of requesting exact prefix.
2) Rewrite first condition check using nexthop flags (1:1 match)
3) Rewrite second condition to check for interface addresses matching target address on
the input interface.
Differential Revision: https://reviews.freebsd.org/D31824
Reviewed by: ae
MFC after: 1 week
PR: 257965
We previously did this only in the normal case where no association
exists yet. However, it is not safe to process COOKIE-ECHO even if an
association exists, as sctp_process_cookie_existing() may dereference
the socket pointer.
See also commit 0c7dc84076b64ef74c24f04400d572f75ef61bb4.
Reviewed by: tuexen
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D31755