Commit Graph

7278 Commits

Author SHA1 Message Date
Gleb Smirnoff
9a8cf950b2 carp: fix send error demotion recovery
The problem is that carp(4) would clear the error counter on first
successful send, and stop counting successes after that.  Fix this
logic and document it in human language.

PR:			260499
Differential revision:	https://reviews.freebsd.org/D33536
2021-12-18 17:19:26 -08:00
Gleb Smirnoff
75add59a8e tcp: allocate statistics in the main tcp_init()
No reason to have a separate SYSINIT.
2021-12-17 10:50:56 -08:00
Gleb Smirnoff
d8b45c8e14 inpcb: don't leak the port zone in in_pcbinfo_destroy() 2021-12-16 15:15:02 -08:00
Mark Johnston
014f98b119 udp: Fix a use-after-free in udp_multi_input()
"ip" is a pointer into the input mbuf chain, so we shouldn't access it
after the chain is freed.

Fix style at the call site while here.

Reported by:	syzbot+7c8258509722af1b6145@syzkaller.appspotmail.com
Reviewed by:	tuexen, glebius
Fixes:		de2d47842e ("SMR protection for inpcbs")
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33473
2021-12-16 09:17:05 -05:00
Randall Stewart
9b60296531 tcp: Rack in a rare case we can get stuck sending a very small amount.
If a tlp sending new data fails, and then the peer starts
talking to us again, we can be in a situation where the
tlp_new_data count is set, we are not in recovery and
we always send one packet every RTT. The failure
has to occur when we send the TLP initially from the ip_output()
which is rare. But if it occurs you are basically stuck.

This fixes it so we use the new_data count and clear it so
we know it will be cleared. If a failure occurs the tlp timer
will regenerate a new amount anyway so it is un-needed to
carry the value on.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D33325
2021-12-15 09:41:33 -05:00
Gleb Smirnoff
185e659c40 inpcb: use locked variant of prison_check_ip*()
The pcb lookup always happens in the network epoch and in SMR section.
We can't block on a mutex due to the latter.  Right now this patch opens
up a race.  But soon that will be addressed by D33339.

Reviewed by:		markj, jamie
Differential revision:	https://reviews.freebsd.org/D33340
Fixes:			de2d47842e
2021-12-14 09:38:52 -08:00
Gleb Smirnoff
d74b7baeb0 ifnet_byindex() actually requires network epoch
Sweep over potentially unsafe calls to ifnet_byindex() and wrap them
in epoch.  Most of the code touched remains unsafe, as the returned
pointer is being used after epoch exit.  Mark that with a comment.

Validate the index argument inside the function, reducing argument
validation requirement from the callers and making V_if_index
private to if.c.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D33263
2021-12-06 09:32:31 -08:00
Randall Stewart
dadbc04250 tcp: rack fails to send out a TLP after a MTU change
When rack sends out a TLP it sets up various state to make sure
it avoids the cwnd (its been more than 1 RTT since our last send) and
it may at times send new data. If an MTU change as occurred
and our cwnd has collapsed we can have a situation where must_retran
flag is set and we obey the cwnd thus never sending the TLP and then
sitting stuck.

This one line fix addresses that problem
Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D33231
2021-12-06 09:56:09 -05:00
Gleb Smirnoff
eb93b99d69 in_pcb: delay crfree() down into UMA dtor
inpcb lookups, which check inp_cred, work with pcbs that potentially went
through in_pcbfree().  So inp_cred should stay valid until SMR guarantees
its invisibility to lookups.

While here, put the whole inpcb destruction sequence of in_pcbfree(),
inpcb_dtor() and inpcb_fini() sequentially.

Submitted by:		markj
Differential revision:	https://reviews.freebsd.org/D33273
2021-12-05 10:46:37 -08:00
Michael Tuexen
54912d47b6 sctp: unbreak NOINET6 builds.
PR:		260119
Reported by:	kostikbel
MFC after:	1 week
2021-12-04 19:16:18 +01:00
Michael Tuexen
d79676fb13 sctp: inherit IP level socket options from listening socket
Ensure that TTL and TOS values set on a listener get inheritet
to the accepted sockets.

PR:		260119
MFC after:	1 week
2021-12-03 22:44:01 +01:00
Gleb Smirnoff
36f42c5ebf tcp_ccalgounload(): initialize the inpcb iterator when curvnet is set
Pointy hat to:	glebius
Fixes:		de2d47842e
2021-12-03 12:39:56 -08:00
Peter Lei
4c018b5aed in_pcb: limit the effect of wraparound in TCP random port allocation check
The check to see if TCP port allocation should change from random to
sequential port allocation mode may incorrectly cause a false positive
due to negative wraparound.
Example:
    V_ipport_tcpallocs = 2147483585 (0x7fffffc1)
    V_ipport_tcplastcount = 2147483553 (0x7fffffa1)
    V_ipport_randomcps = 100
The original code would compare (2147483585 <= -2147483643) and thus
incorrectly move to sequential allocation mode.

Compute the delta first before comparing against the desired limit to
limit the wraparound effect (since tcplastcount is always a snapshot
of a previous tcpallocs).
2021-12-03 12:38:12 -08:00
Michael Tuexen
f32357be53 sctp: use the correct traffic class when sending SCTP/IPv6 packets
When sending packets the stcb was used to access the inp and then
access the endpoint specific IPv6 level options. This fails when
there exists an inp, but no stcb yet. This is the case for sending
an INIT-ACK in response to an INIT when no association already
exists. Fix this by just providing the inp instead of the stcb.

PR:		260120
MFC after:	1 week
2021-12-03 21:36:44 +01:00
Peter Lei
13e3f3349f in_pcb: fix TCP local ephemeral port accounting
Fix logic error causing UDP(-Lite) local ephemeral port bindings
to count against the TCP allocation counter, potentially causing
TCP to go from random to sequential port allocation mode prematurely.
2021-12-03 12:30:21 -08:00
Gleb Smirnoff
12ae3476f3 tcp_drain(): initialize the inpcb iterator when curvnet is set
Reported by:	cy
Pointy hat to:	glebius
Fixes:		de2d47842e
2021-12-02 21:08:30 -08:00
Gleb Smirnoff
651a545143 udp_detach(): fix set but not used warning 2021-12-02 20:12:40 -08:00
Gleb Smirnoff
bd1d085045 udp_multi_input(): the UDP header is only needed for probes
Reported by:	kib
Fixes:		de2d47842e
2021-12-02 20:12:40 -08:00
Cy Schubert
db0ac6ded6 Revert "wpa: Import wpa_supplicant/hostapd commit 14ab4a816"
This reverts commit 266f97b5e9, reversing
changes made to a10253cffe.

A mismerge of a merge to catch up to main resulted in files being
committed which should not have been.
2021-12-02 14:45:04 -08:00
Cy Schubert
266f97b5e9 wpa: Import wpa_supplicant/hostapd commit 14ab4a816
This is the November update to vendor/wpa committed upstream 2021-11-26.

MFC after:      1 month
2021-12-02 13:35:14 -08:00
Gleb Smirnoff
3cce6164ab ip_input: remove pointless check in INP_RECVIF handling
An mbuf rcvif pointer is supposed to be valid and doesn't
need extra checks.  The code appeared in d314ad7b73.
2021-12-02 11:15:04 -08:00
Gleb Smirnoff
2e27230ff9 tcp_hpts: rewrite inpcb synchronization
Just trust the pcb database, that if we did in_pcbref(), no way
an inpcb can go away.  And if we never put a dropped inpcb on
our queue, and tcp_discardcb() always removes an inpcb to be
dropped from the queue, then any inpcb on the queue is valid.

Now, to solve LOR between inpcb lock and HPTS queue lock do the
following trick.  When we are about to process a certain time
slot, take the full queue of the head list into on stack list,
drop the HPTS lock and work on our queue.  This of course opens
a race when an inpcb is being removed from the on stack queue,
which was already mentioned in comments.  To address this race
introduce generation count into queues.  If we want to remove
an inpcb with generation count mismatch, we can't do that, we
can only mark it with desired new time slot or -1 for remove.

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D33026
2021-12-02 10:48:49 -08:00
Gleb Smirnoff
f971e79139 tcp_hpts: rename input queue to drop queue and trim dead code
The HPTS input queue is in reality used only for "delayed drops".
When a TCP stack decides to drop a connection on the output path
it can't do that due to locking protocol between main tcp_output()
and stacks.  So, rack/bbr utilize HPTS to drop the connection in
a different context.

In the past the queue could also process input packets in context
of HPTS thread, but now no stack uses this, so remove this
functionality.

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D33025
2021-12-02 10:48:48 -08:00
Gleb Smirnoff
b0a7c008cb tcp_hpts: make struct tcp_hpts_entry private to the module.
Also, make some of the functions also private to the module. Remove
unused functions discovered after that.

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D33024
2021-12-02 10:48:48 -08:00
Gleb Smirnoff
50f081ecb7 tcp_hpts: provide tcp_in_hpts().
It will hide some internal HPTS knowledge from the consumers.

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D33023
2021-12-02 10:48:48 -08:00
Gleb Smirnoff
de2d47842e SMR protection for inpcbs
With introduction of epoch(9) synchronization to network stack the
inpcb database became protected by the network epoch together with
static network data (interfaces, addresses, etc).  However, inpcb
aren't static in nature, they are created and destroyed all the
time, which creates some traffic on the epoch(9) garbage collector.

Fairly new feature of uma(9) - Safe Memory Reclamation allows to
safely free memory in page-sized batches, with virtually zero
overhead compared to uma_zfree().  However, unlike epoch(9), it
puts stricter requirement on the access to the protected memory,
needing the critical(9) section to access it.  Details:

- The database is already build on CK lists, thanks to epoch(9).
- For write access nothing is changed.
- For a lookup in the database SMR section is now required.
  Once the desired inpcb is found we need to transition from SMR
  section to r/w lock on the inpcb itself, with a check that inpcb
  isn't yet freed.  This requires some compexity, since SMR section
  itself is a critical(9) section.  The complexity is hidden from
  KPI users in inp_smr_lock().
- For a inpcb list traversal (a pcblist sysctl, or broadcast
  notification) also a new KPI is provided, that hides internals of
  the database - inp_next(struct inp_iterator *).

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D33022
2021-12-02 10:48:48 -08:00
Gleb Smirnoff
565655f4e3 inpcb: reduce some aliased functions after removal of PCBGROUP.
Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D33021
2021-12-02 10:48:48 -08:00
Gleb Smirnoff
93c67567e0 Remove "options PCBGROUP"
With upcoming changes to the inpcb synchronisation it is going to be
broken. Even its current status after the move of PCB synchronization
to the network epoch is very questionable.

This experimental feature was sponsored by Juniper but ended never to
be used in Juniper and doesn't exist in their source tree [sjg@, stevek@,
jtl@]. In the past (AFAIK, pre-epoch times) it was tried out at Netflix
[gallatin@, rrs@] with no positive result and at Yandex [ae@, melifaro@].

I'm up to resurrecting it back if there is any interest from anybody.

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D33020
2021-12-02 10:48:48 -08:00
Gleb Smirnoff
1cec1c5831 Allow to compile RSS without PCBGROUP.
Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D33019
2021-12-02 10:48:48 -08:00
Randall Stewart
dcf2dfed26 tcp: unloading a module that is set to default should error.
I just discovered that the return of the EBUSY error was incorrectly
rigged so that you could unload a CC module that was set to default.
Its supposed to be an EBUSY error. Make it so.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D33229
2021-12-02 06:12:16 -05:00
Michael Tuexen
13c196a41e sctp: improve handling of assoc ids in socket options
For socket options related to local and remote addresses providing
generic association ids does not make sense. Report EINVAL in this
case.

MFC after:	1 week
2021-12-01 14:54:55 +01:00
Michael Tuexen
a01b8859cb sctp: cleanup, no functional change intended.
MFC after:	1 week
2021-12-01 09:19:40 +01:00
Gordon Bergling
1dadeab367 netinet: Fix a common typo in source code comments
- s/segement/segment/

MFC after:	3 days
2021-11-30 10:37:20 +01:00
Gordon Bergling
27c4abc7cd inet(3): Fix two typos in sysctl descriptions
- s/sequental/sequential/

MFC after:	3 days
2021-11-30 10:21:47 +01:00
Gordon Bergling
b4aa9cb217 tcp(4): Fix a typo in a sysctl description
- s/entires/entries/

MFC after:	3 days
2021-11-30 07:17:30 +01:00
Michael Tuexen
147bf5e930 tcp: Don't try to upgrade a read lock just for logging
Reviewed by:		glebius, lstewart, rrs
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D33098
2021-11-29 13:48:40 +01:00
Michael Tuexen
3c1ba6f394 sctp: improve consistency, no functional change intended 2021-11-26 12:53:43 +01:00
Michael Tuexen
0906362646 sctp: add some asserts, no functional changes intended
This might help in narrowing down
https://syzkaller.appspot.com/bug?id=fbd79abaec55f5aede63937182f4247006ea883b
2021-11-26 12:19:33 +01:00
Mark Johnston
44775b163b netinet: Remove unneeded mb_unmapped_to_ext() calls
in_cksum_skip() now handles unmapped mbufs on platforms where they're
permitted.

Reviewed by:	glebius, jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33097
2021-11-24 13:31:16 -05:00
Mark Johnston
0d9c3423f5 netinet: Implement in_cksum_skip() using m_apply()
This allows it to work with unmapped mbufs.  In particular,
in_cksum_skip() calls no longer need to be preceded by calls to
mb_unmapped_to_ext() to avoid a page fault.

PR:		259645
Reviewed by:	gallatin, glebius, jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33096
2021-11-24 13:31:16 -05:00
Mark Johnston
ecbbe83144 netinet: Deduplicate most in_cksum() implementations
in_cksum() and related routines are implemented separately for each
platform, but only i386 and arm have optimized versions.  Other
platforms' copies of in_cksum.c are identical except for style
differences and support for big-endian CPUs.

Deduplicate the implementations for the rest of the platforms.  This
will make it easier to implement in_cksum() for unmapped mbufs.  On arm
and i386, define HAVE_MD_IN_CKSUM to mean that the MI implementation is
not to be compiled.

No functional change intended.

Reviewed by:	kp, glebius
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33095
2021-11-24 13:31:16 -05:00
Mark Johnston
5195bcc212 netinet: Remove in_cksum.c
It does not get compiled into the kernel.  No functional change
inteneded.

Reviewed by:	kp, glebius, cy
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33094
2021-11-24 13:31:16 -05:00
Gordon Bergling
b4fbc855a5 cc_newreno(4): Fix a typo in a source code comment
- s/conditons/conditions/

MFC after:	3 days
2021-11-19 19:16:02 +01:00
Gleb Smirnoff
ff94500855 Add tcp_freecb() - single place to free tcpcb.
Until this change there were two places where we would free tcpcb -
tcp_discardcb() in case if all timers are drained and tcp_timer_discard()
otherwise.  They were pretty much copy-n-paste, except that in the
default case we would run tcp_hc_update().  Merge this into single
function tcp_freecb() and move new short version of tcp_timer_discard()
to tcp_timer.c and make it static.

Reviewed by:		rrs, hselasky
Differential revision:	https://reviews.freebsd.org/D32965
2021-11-18 20:27:45 -08:00
Gleb Smirnoff
fb8588d2cb tcp_timewait: use on stack struct tcptw as last resort
In case we failed to uma_zalloc() and also failed to reuse with
tcp_tw_2msl_scan(), then just use on stack tcptw.  This will allow
to run through tcp_twrespond() and standard tcpcb discard routine.

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D32965
2021-11-18 20:27:45 -08:00
Randall Stewart
97e28f0f58 tcp: Rack ack war with a mis-behaving firewall or nat with resets.
Previously we added ack-war prevention for misbehaving firewalls. This is
where the f/w or nat messes up its sequence numbers and causes an ack-war.
There is yet another type of ack war that we have found in the wild that is
like unto this. Basically the f/w or nat gets a ack (keep-alive probe or such)
and instead of turning the ack/seq around and adding a TH_RST it does something
real stupid and sends a new packet with seq=0. This of course triggers the challenge
ack in the reset processing which then sends in a challenge ack (if the seq=0 is within
the range of possible sequence numbers allowed by the challenge) and then we rinse-repeat.

This will add the needed tweaks (similar to the last ack-war prevention using the same sysctls and counters)
to prevent it and allow say 5 per second by default.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32938
2021-11-17 09:45:51 -05:00
Mark Johnston
756bb50b6a sctp: Remove now-unneeded mb_unmapped_to_ext() calls
sctp_delayed_checksum() now handles unmapped mbufs, thanks to m_apply().

No functional change intended.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32942
2021-11-16 13:38:09 -05:00
Mark Johnston
b4d758a0cc sctp: Use m_apply() to calcuate a checksum for an mbuf chain
m_apply() works on unmapped mbufs, so this will let us elide
mb_unmapped_to_ext() calls preceding sctp_calculate_cksum() calls in
the network stack.

Modify sctp_calculate_cksum() to assume it's passed an mbuf header.
This assumption appears to be true in practice, and we need to know the
full length of the chain.

No functional change intended.

Reviewed by:	tuexen, jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32941
2021-11-16 13:36:30 -05:00
Mike Karels
2f35e7d9fa kernel: partially revert e9efb1125a15, default inet mask
When no mask is supplied to the ioctl adding an Internet interface
address, revert to using the historical class mask rather than a
single default.  Similarly for the NFS bootp code.

MFC after:	3 weeks
Reviewed by:	melifaro glebius
Differential Revision: https://reviews.freebsd.org/D32951
2021-11-14 14:12:25 -06:00
Michael Tuexen
2f62f92e37 tcp: Fix a locking issue related to logging
tcp_respond() is sometimes called with only a read lock.
The logging however, requires a write lock. So either
try to upgrade the lock if needed, or don't log the packet.

Reported by:		syzbot+8151ef969c170f76706b@syzkaller.appspotmail.com
Reported by:		syzbot+eb679adb3304c511c1e4@syzkaller.appspotmail.com
Reviewed by:		markj, rrs
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D32983
2021-11-14 15:04:27 +01:00
Gleb Smirnoff
ef396441ce tcp_usr_detach: revert debugging piece from f5cf1e5f5a.
The code was probably useful during the problem being chased down,
but for brevity makes sense just to return to the original KASSERT.

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D32968
2021-11-13 08:33:32 -08:00
Gleb Smirnoff
9a06a82455 tcp_timers: check for (INP_TIMEWAIT | INP_DROPPED) only once
All timers keep inpcb locked through their execution.  We need to
check these flags only once.  Checking for INP_TIMEWAIT earlier is
is also safer, since such inpcbs point into tcptw rather than tcpcb,
and any dereferences of inp_ppcb as tcpcb are erroneous.

Reviewed by:		rrs, hselasky
Differential revision:	https://reviews.freebsd.org/D32967
2021-11-13 08:32:06 -08:00
Michael Tuexen
df07bfda67 tcp: Fix a locking issue
INP_WLOCK_RECHECK_CLEANUP() and INP_WLOCK_RECHECK() might return
from the function, so any locks held must be released.

Reported by:		syzbot+b1a888df08efaa7b4bf1@syzkaller.appspotmail.com
Reviewed by:		markj
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D32975
2021-11-12 22:13:50 +01:00
Mark Johnston
034a924009 tcp: Ensure that vnets have an initialized V_default_cc_ptr
This causes new vnets to inherit the cc algorithm from vnet0. This is a
temporary patch to fix vnet jail creation.

With encouragement from: glebius
Fixes: b8d60729de ("tcp: Congestion control cleanup.")
Differential Revision: https://reviews.freebsd.org/D32970
2021-11-12 12:18:12 -07:00
Warner Losh
7e3c9ec906 tcp: better congestion control defaults
Define CC_NEWRENO in all the appropriate DEFAULTS and std.* config
files. It's the default congestion control algorithm.  Add code to cc.c
so that CC_DEFAULT is "newreno" if it's not overriden in the config
file.

Sponsored by: Netflix
Fixes: b8d60729de ("tcp: Congestion control cleanup.")
Revired by: manu, hselasky, jhb, glebius, tuexen
Differential Revision:	https://reviews.freebsd.org/D32964
2021-11-12 12:16:11 -07:00
Gleb Smirnoff
2ce85919bb Add net.inet.ip.source_address_validation
Drop packets arriving from the network that have our source IP
address.  If maliciously crafted they can create evil effects
like an RST exchange between two of our listening TCP ports.
Such packets just can't be legitimate.  Enable the tunable
by default.  Long time due for a modern Internet host.

Reviewed by:		donner, melifaro
Differential revision:	https://reviews.freebsd.org/D32914
2021-11-12 09:00:33 -08:00
Gleb Smirnoff
9c89392f12 Add in_localip_fib(), in6_localip_fib().
Check if given address/FIB exists locally.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D32913
2021-11-12 08:59:42 -08:00
Gleb Smirnoff
81674f121e ip_input: packet filters shall not modify m_pkthdr.rcvif
Quick review confirms that they do not, also IPv6 doesn't expect
such a change in mbuf.  In IPv4 this appeared in 0aade26e6d,
which doesn't seem to have a valid explanation why.

Reviewed by:		donner, kp, melifaro
Differential revision:	https://reviews.freebsd.org/D32913
2021-11-12 08:58:27 -08:00
Gleb Smirnoff
94df3271d6 Rename net.inet.ip.check_interface to rfc1122_strong_es and document it.
This very questionable feature was enabled in FreeBSD for a very short
time.  It was disabled very soon upon merging to RELENG_4 - 23d7f14119.
And in HEAD was also disabled pretty soon - 4bc37f9836.

The tunable has very vague name. Check interface for what? Given that
it was never documented and almost never enabled, I think it is fine
to rename it together with documenting it.

Also, count packets dropped by this tunable as ips_badaddr, otherwise
they fall down to ips_cantforward counter, which is misleading, as
packet was not supposed to be forwarded, it was destined locally.

Reviewed by:		donner, kp
Differential revision:	https://reviews.freebsd.org/D32912
2021-11-12 08:57:06 -08:00
Mateusz Guzik
0359e7a5e4 net: sprinkle __predict_false in ip_input on error conditions
While here rearrange the RVSP check to inspect proto first and avoid
evaluating V_rsvp in the common case to begin with (most notably avoid
the expensive read).

Reviewed by:	glebius
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D32929
2021-11-12 15:40:28 +00:00
Randall Stewart
26cbd0028c tcp: Rack may still calculate long RTT on persists probes.
When a persists probe is lost, we will end up calculating a long
RTT based on the initial probe and when the response comes from the
second probe (or third etc). This means we have a minimum of a
confidence level of 3 on a incorrect probe. This commit will change it
so that we have one of two options
a) Just not count RTT of probes where we had a loss
<or>
b) Count them still but degrade the confidence to 0.

I have set in this the default being to just not measure them, but I am open
to having the default be otherwise.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32897
2021-11-11 06:35:51 -05:00
Randall Stewart
b8d60729de tcp: Congestion control cleanup.
NOTE: HEADS UP read the note below if your kernel config is not including GENERIC!!

This patch does a bit of cleanup on TCP congestion control modules. There were some rather
interesting surprises that one could get i.e. where you use a socket option to change
from one CC (say cc_cubic) to another CC (say cc_vegas) and you could in theory get
a memory failure and end up on cc_newreno. This is not what one would expect. The
new code fixes this by requiring a cc_data_sz() function so we can malloc with M_WAITOK
and pass in to the init function preallocated memory. The CC init is expected in this
case *not* to fail but if it does and a module does break the
"no fail with memory given" contract we do fall back to the CC that was in place at the time.

This also fixes up a set of common newreno utilities that can be shared amongst other
CC modules instead of the other CC modules reaching into newreno and executing
what they think is a "common and understood" function. Lets put these functions in
cc.c and that way we have a common place that is easily findable by future developers or
bug fixers. This also allows newreno to evolve and grow support for its features i.e. ABE
and HYSTART++ without having to dance through hoops for other CC modules, instead
both newreno and the other modules just call into the common functions if they desire
that behavior or roll there own if that makes more sense.

Note: This commit changes the kernel configuration!! If you are not using GENERIC in
some form you must add a CC module option (one of CC_NEWRENO, CC_VEGAS, CC_CUBIC,
CC_CDG, CC_CHD, CC_DCTCP, CC_HTCP, CC_HD). You can have more than one defined
as well if you desire. Note that if you create a kernel configuration that does not
define a congestion control module and includes INET or INET6 the kernel compile will
break. Also you need to define a default, generic adds 'options CC_DEFAULT=\"newreno\"
but you can specify any string that represents the name of the CC module (same names
that show up in the CC module list under net.inet.tcp.cc). If you fail to add the
options CC_DEFAULT in your kernel configuration the kernel build will also break.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
RELNOTES:YES
Differential Revision: https://reviews.freebsd.org/D32693
2021-11-11 06:28:18 -05:00
John Baldwin
e3ba94d4f3 Don't require the socket lock for sorele().
Previously, sorele() always required the socket lock and dropped the
lock if the released reference was not the last reference.  Many
callers locked the socket lock just before calling sorele() resulting
in a wasted lock/unlock when not dropping the last reference.

Move the previous implementation of sorele() into a new
sorele_locked() function and use it instead of sorele() for various
places in uipc_socket.c that called sorele() while already holding the
socket lock.

The sorele() macro now uses refcount_release_if_not_last() try to drop
the socket reference without locking the socket.  If that shortcut
fails, it locks the socket and calls sorele_locked().

Reviewed by:	kib, markj
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D32741
2021-11-09 10:50:12 -08:00
Mike Karels
20d5940396 kernel: deprecate Internet Class A/B/C
Hide historical Class A/B/C macros unless IN_HISTORICAL_NETS is defined;
define it for user level.  Define IN_MULTICAST separately from IN_CLASSD,
and use it in pf instead of IN_CLASSD.  Stop using class for setting
default masks when not specified; instead, define new default mask
(24 bits).  Warn when an Internet address is set without a mask.

MFC after:	1 month
Reviewed by:	cy
Differential Revision: https://reviews.freebsd.org/D32708
2021-11-09 09:32:38 -06:00
Randall Stewart
477aeb3dd4 tcp: Printf should be removed.
There is a printf when a socket option down to the CC module fails, this really
should not be a printf. In fact this whole option needs to be re-thought in coordination
with some other changes in the CC modules (its just not right but its ok what it
does here if it fails since it will just use the ECN beta).

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32894
2021-11-08 11:49:34 -05:00
Hans Petter Selasky
10a62eb109 Use layer five checksum flags in the mbuf packet header to pass on crypto state.
The mbuf protocol flags get cleared between layers, and also it was discovered
that M_DECRYPTED conflicts with M_HASFCS when receiving ethernet patckets.

Add the proper CSUM_TLS_MASK and CSUM_TLS_DECRYPTED defines, and start using
these instead of M_DECRYPTED inside the TCP LRO code.

This change is needed by coming TLS RX hardware offload support patches.

Suggested by:	kib@
Reviewed by:	jhb@
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2021-11-04 18:52:06 +01:00
Allan Jude
34d8fffff3 SIFTR: Fix compilation with -DSIFTR_IPV6
A few pieces of the SIFTR code that are behind #ifdef SIFTR_IPV6 have
not been updated as APIs have changed, etc.

Reported by:	Alexander Sideropoulos <Alexander.Sideropoulos@netapp.com>
Reviewed by:	rscheff, lstewart
Sponsored by:	NetApp
Sponsored by:	Klara Inc.
Differential Revision:	https://reviews.freebsd.org/D32698
2021-11-04 00:32:17 +00:00
Gleb Smirnoff
3ea9a7cf7b blackhole(4): disable for locally originated TCP/UDP packets
In most cases blackholing for locally originated packets is undesired,
leads to different kind of lags and delays. Provide sysctls to enforce
it, e.g. for debugging purposes.

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D32718
2021-11-03 13:02:44 -07:00
Gordon Bergling
c28e39c3d6 Fix a common typo in syctl descriptions
- s/maxiumum/maximum/

MFC after:	3 days
2021-11-03 20:49:24 +01:00
Gleb Smirnoff
3358df2973 udp_input: remove a BSD stack relict
I should had removed it 9 years ago in 8ad458a471.  That commit
left save_ip as a write-only variable.

With save_ip removed we got one case when IP header can be modified:
the calculation of IP checksum with zeroed out header.  This place
already has had a header saver char b[9].  However, the b[9] saver
didn't cover the ip_sum field, which we explicitly overwrite aliased
as (struct ipovly *)->ih_len.  This was fine in cb34210012, since
checksum doesn't need to be restored if packet is consumed.  Now we
need to extend up to ip_sum field.

In collaboration with:	ae
Differential revision:	https://reviews.freebsd.org/D32719
2021-11-03 10:39:34 -07:00
Gordon Bergling
bb91496a85 netinet: Fix a common typo in source code comments
- s/writting/writing/

MFC after:	3 days
2021-11-03 16:21:49 +01:00
Andrey V. Elsukov
4a9e95286c ip_divert: calculate delayed checksum for IPv6 adress family
Before passing an IPv6 packet to application apply delayed checksum
calculation. Mbuf flags will be lost when divert listener will return a
packet back, so we will not be able to do delayed checksum calculation
later. Also an application will get a packet with correct checksum.

Reviewed by:	donner
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D32807
2021-11-03 15:20:51 +03:00
Mateusz Guzik
8e27968786 inet: remove tcp_debug from netinet/tcp_debug.h
It was a hack only needed for trpt, which can just define it locally.

This makes it possible to fix up systat which also includes the file.

Sponsored by:	Rubicon Communications, LLC ("Netgate")
2021-11-01 23:10:30 +00:00
Marius Halden
1019354b54 carp: deal with negative net.inet.carp.demotion
Given nodes 1 and 2, where node 1 has an advskew of 0 and node 2 has an
advskew of 100, making them master and backup respectively.

If net.inet.carp.demotion is set to a negative value on node 1, node 2
might become master while node 1 still retains it master status. Wether
or not node 2 becomes master seems to depend on the nodes advskew and
what the demotion sysctl was set to on node 1.

The reason for node 2 becoming master seems to be that the calculated
advskew taking demotion into account is truncated to a single unsigned
byte when copied into the carp header for sending, and node 1 stays
master since it takes uses the whole non-truncated calculated advskew
when deciding wether to stay master.

PR:		259528
Reviewed by:	donner, glebius
MFC after:	3 weeks
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D32759
2021-11-01 17:08:23 +01:00
Randall Stewart
141a53cd58 tcp: Rack might retransmit forever.
If we get a Sacked peer with an MTU change we can retransmit forever if the
last bytes are sacked and the client goes away (think power off). Then we
never see the end condition and continually retransmit.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32671
2021-10-29 17:37:49 -04:00
Randall Stewart
aeda852782 tcp: Rack at times can miscalculate the RTT from what it thinks is a persists probe respone.
Turns out that if a peer sends in a window update right after rack fires off
a persists probe, we can mis-interpret the window update and calculate
a bogus RTT (very short). We still process the window update and send
the data but we incorrectly generate an RTT. We should be only doing
the RTT stuff if the rwnd is still small and has not changed.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32717
2021-10-29 03:17:43 -04:00
Gleb Smirnoff
92b3e07229 Enable net.inet.tcp.nolocaltimewait.
This feature has been used for many years at large sites and
didn't show any pitfalls.
2021-10-28 15:34:00 -07:00
Wojciech Macek
8a727c3df8 mroute: add missing WUNLOCK
Add missing WNLOCK as in all other error cases.

Reported by:		Stormshield
Obtained from:		Semihalf
2021-10-28 07:12:23 +02:00
Wojciech Macek
fb3854845f mroute: fix memory leak
Add MFC to linked list to store incoming packets
before MCAST JOIN was captured.

Sponsored by:		Stormshield
Obtained from:		Semihalf
MFC after:		2 weeks
2021-10-28 07:12:16 +02:00
Gleb Smirnoff
5d3bf5b1d2 rack: Update the fast send block on setsockopt(2)
Rack caches TCP/IP header for fast send, so it doesn't call
tcpip_fillheaders().  After certain socket option changes,
namely IPV6_TCLASS, IP_TOS and IP_TTL it needs to update
its fast block to be in sync with the inpcb.

Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D32655
2021-10-27 08:22:00 -07:00
Gleb Smirnoff
f581a26e46 Factor out tcp6_use_min_mtu() to handle IPV6_USE_MIN_MTU by TCP.
Pass control for IP/IP6 level options from generic tcp_ctloutput_set()
down to per-stack ctloutput.

Call tcp6_use_min_mtu() from tcp stack tcp_default_ctloutput().

Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D32655
2021-10-27 08:22:00 -07:00
Gleb Smirnoff
de156263a5 Several IP level socket options may affect TCP.
After handling them in IP level ctloutput, pass them down to TCP
ctloutput.

We already have a hack to handle IPV6_USE_MIN_MTU. Leave it in place
for now, but comment out how it should be handled.

For IPv4 we are interested in IP_TOS and IP_TTL.

Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D32655
2021-10-27 08:21:59 -07:00
Gleb Smirnoff
fc4d53cc2e Split tcp_ctloutput() into set/get parts.
Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D32655
2021-10-27 08:21:59 -07:00
Peter Lei
e28330832b tcp: socket option to get stack alias name
TCP stack sysctl nodes are currently inserted using the stack
name alias. Allow the user to get the current stack's alias to
allow for programatic sysctl access.

Obtained from:	Netflix
2021-10-27 08:21:59 -07:00
Randall Stewart
12752978d3 tcp: The rack stack can incorrectly have an overflow when calculating a burst delay.
If the congestion window is very large the fact that we multiply it by 1000 (for microseconds) can
cause the uint32_t to overflow and we incorrectly calculate a very small divisor. This will then
cause the burst timer to be very large when it should be 0. Instead lets make the three variables
uint64_t and avoid the issue.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32668
2021-10-26 13:17:58 -04:00
Michael Tuexen
b15b053596 tcp: allow new reno functions to be called from other CC modules
Some new reno functions use the internal data, but are also called
from functions of other CC modules. Ensure that in this case, the
internal data is not accessed.

Reported by:		syzbot+1d219ea351caa5109d4b@syzkaller.appspotmail.com
Reported by:    	syzbot+b08144f8cad9c67258c5@syzkaller.appspotmail.com
Reviewed by:		rrs
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D32649
2021-10-25 22:53:49 +02:00
Gleb Smirnoff
f2d266f3b0 Don't run ip_ctloutput() for divert socket.
It was here since divert(4) was introduced, probably just came with a
protocol definition boilerplate.  There is no useful socket option
that can be set or get for a divert socket.

Reviewed by:		donner
Differential Revision:	https://reviews.freebsd.org/D32608
2021-10-25 11:16:59 -07:00
Gleb Smirnoff
d89c820b0d Remove div_ctlinput().
This function does nothing since 97d8d152c2. It was introduced
in 252f24a2cf with a sidenote "may not be needed".

Reviewed by:		donner
Differential Revision:	https://reviews.freebsd.org/D32608
2021-10-25 11:16:49 -07:00
Gleb Smirnoff
c8ee75f231 Use network epoch to protect local IPv4 addresses hash.
The modification to the hash are already naturally locked by
in_control_sx.  Convert the hash lists to CK lists. Remove the
in_ifaddr_rmlock. Assert the network epoch where necessary.

Most cases when the hash lookup is done the epoch is already entered.
Cover a few cases, that need entering the epoch, which mostly is
initial configuration of tunnel interfaces and multicast addresses.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D32584
2021-10-22 14:40:53 -07:00
Randall Stewart
4e4c84f8d1 tcp: Add hystart-plus to cc_newreno and rack.
TCP Hystart draft version -03:
https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-hystartplusplus

Is a new version of hystart that allows one to carefully exit slow start if the RTT
spikes too much. The newer version has a slower-slow-start so to speak that then
kicks in for five round trips. To see if you exited too early, if not into congestion avoidance.
This commit will add that feature to our newreno CC and add the needed bits in rack to
be able to enable it.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D32373
2021-10-22 07:10:28 -04:00
Roy Marples
5c5340108e net: Allow binding of unspecified address without address existance
Previously in_pcbbind_setup returned EADDRNOTAVAIL for empty
V_in_ifaddrhead (i.e., no IPv4 addresses configured) and in6_pcbbind
did the same for empty V_in6_ifaddrhead (no IPv6 addresses).

An equivalent test has existed since 4.4-Lite.  It was presumably done
to avoid extra work (assuming the address isn't going to be found
later).

In normal system operation *_ifaddrhead will not be empty: they will
at least have the loopback address(es).  In practice no work will be
avoided.

Further, this case caused net/dhcpd to fail when run early in boot
before assignment of any addresses.  It should be possible to bind the
unspecified address even if no addresses have been configured yet, so
just remove the tests.

The now-removed "XXX broken" comments were added in 59562606b9,
which converted the ifaddr lists to TAILQs.  As far as I (emaste) can
tell the brokenness is the issue described above, not some aspect of
the TAILQ conversion.

PR:		253166
Reviewed by:	ae, bz, donner, emaste, glebius
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D32563
2021-10-20 19:25:51 -04:00
Gleb Smirnoff
9b7501e797 in_mcast: garbage collect inp_gcmoptions()
It is is used only once, merge it into inp_freemoptions().
2021-10-18 11:36:07 -07:00
Gleb Smirnoff
0f617ae48a Add in_pcb_var.h for KPIs that are private to in_pcb.c and in6_pcb.c. 2021-10-18 10:19:57 -07:00
Gleb Smirnoff
744a64bd92 in_pcb: garbage collect in_pcbrele() 2021-10-18 10:07:16 -07:00
Gleb Smirnoff
5a78df20ce in_pcb: garbage collect unused structure in_pcblist 2021-10-18 10:06:39 -07:00
Maxim Sobolev
461e6f23db Fix fragmented UDP packets handling since rev.360967.
Consider IP_MF flag when checking length of the UDP packet to
match the declared value.

Sponsored by:	Sippy Software, Inc.
Differential Revision:	https://reviews.freebsd.org/D32363
MFC after:	2 weeks
2021-10-15 16:48:12 -07:00
Gleb Smirnoff
2144431c11 Remove in_ifaddr_lock acquisiton to access in_ifaddrhead.
An IPv4 address is embedded into an ifaddr which is freed
via epoch. And the in_ifaddrhead is already a CK list. Use
the network epoch to protect against use after free.

Next step would be to CK-ify the in_addr hash and get rid of the...

Reviewed by:		melifaro
Differential Revision:	https://reviews.freebsd.org/D32434
2021-10-13 10:04:46 -07:00
Marko Zec
bc8b8e106b [fib_algo][dxr] Retire counters which are no longer used
The number of chunks can still be tracked via vmstat -z|fgrep dxr.

MFC after:	3 days
2021-10-09 13:47:10 +02:00
Marko Zec
1549575f22 [fib_algo][dxr] Improve incremental updating strategy
Tracking the number of unused holes in the trie and the range table
was a bad metric based on which full trie and / or range rebuilds
were triggered, which would happen in vain by far too frequently,
particularly with live BGP feeds.

Instead, track the total unused space inside the trie and range table
structures, and trigger rebuilds if the percentage of unused space
exceeds a sysctl-tunable threshold.

MFC after:	3 days
PR:		257965
2021-10-09 13:22:27 +02:00
Michael Tuexen
bd19202c92 sctp: improve KASSERT messages
MFC after:	1 week
2021-10-08 11:33:56 +02:00
Michael Tuexen
3ff3733991 sctp: don't keep being locked on a stream which is removed
Reported by:	syzbot+f5f551e8a3a0302a4914@syzkaller.appspotmail.com
MFC after:	1 week
2021-10-02 00:48:01 +02:00
Randall Stewart
a36230f75e tcp: Make dsack stats available in netstat and also make sure its aware of TLP's.
DSACK accounting has been for quite some time under a NETFLIX_STATS ifdef. Statistics
on DSACKs however are very useful in figuring out how much bad retransmissions you
are doing. This is further complicated, however, by stacks that do TLP. A TLP
when discovering a lost ack in the reverse path will cause the generation
of a DSACK. For this situation we introduce a new dsack-tlp-bytes as well
as the more traditional dsack-bytes and dsack-packets. These will now
all display in netstat -p tcp -s. This also updates all stacks that
are currently built to keep track of these stats.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32158
2021-10-01 10:36:27 -04:00
Michael Tuexen
28ea947078 sctp: provide a specific stream scheduler function for FCFS
A KASSERT in the genric routine does not apply and triggers
incorrectly.

Reported by:	syzbot+8435af157238c6a11430@syzkaller.appspotmail.com
MFC after:	1 week
2021-09-29 02:08:37 +02:00
Michael Tuexen
fa947a3687 sctp: cleanup and adding KASSERT()s, no functional change
MFC after:	1 week
2021-09-28 20:31:12 +02:00
Michael Tuexen
5b53e749a9 sctp: fix usage of stream scheduler functions
sctp_ss_scheduled() should only be called for streams that are
scheduled. So call sctp_ss_remove_from_stream() before it.
This bug was uncovered by the earlier cleanup.

Reported by:	syzbot+bbf739922346659df4b2@syzkaller.appspotmail.com
Reported by:	syzbot+0a0857458f4a7b0507c8@syzkaller.appspotmail.com
Reported by:	syzbot+a0b62c6107b34a04e54d@syzkaller.appspotmail.com
Reported by:	syzbot+0aa0d676429ebcd53299@syzkaller.appspotmail.com
Reported by:	syzbot+104cc0c1d3ccf2921c1d@syzkaller.appspotmail.com
MFC after:	1 week
2021-09-28 05:25:58 +02:00
Michael Tuexen
171633765c sctp: avoid locking an already locked mutex
Reported by:	syzbot+f048680690f2e8d7ddad@syzkaller.appspotmail.com
Reported by:	syzbot+0725c712ba89d123c2e9@syzkaller.appspotmail.com
MFC after:	1 week
2021-09-28 05:17:03 +02:00
Gordon Bergling
d2e616147d sctp: Fix a typo in a comment
- s/assue/assume/

MFC after:	3 days
2021-09-26 15:15:39 +02:00
Marko Zec
43880c511c [fib_algo][dxr] Split unused range chunk list in multiple buckets
Traversing a single list of unused range chunks in search for a block
of optimal size was suboptimal.

The experience with real-world BGP workloads has shown that on average
unused range chunks are tiny, mostly in length from 1 to 4 or 5, when
DXR is configured with K = 20 which is the current default (D16X4R).

Therefore, introduce a limited amount of buckets to accomodate descriptors
of empty blocks of fixed (small) size, so that those can be found in O(1)
time.  If no empty chunks of the requested size can be found in fixed-size
buckets, the search continues in an unsorted list of empty chunks of
variable lengths, which should only happen infrequently.

This change should permit us to manage significantly more empty range
chunks without sacrifying the speed of incremental range table updating.

MFC after:	3 days
2021-09-25 06:29:48 +02:00
Randall Stewart
1ca931a540 tcp: Rack compressed ack path updates the recv window too easily
The compressed ack path of rack is not following proper procedures in updating
the peers window. It should be checking the seq and ack values before updating and
instead it is blindly updating the values. This could in theory get the wrong window
in the connection for some length of time.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32082
2021-09-23 11:43:29 -04:00
Randall Stewart
fd69939e79 tcp: Two bugs in rack one of which can lead to a panic.
In extensive testing in NF we have found two issues inside
the rack stack.

1) An incorrect offset is being generated by the fast send path when a fast send is initiated on
   the end of the socket buffer and before the fast send runs, the sb_compress macro adds data to the trailing socket.
   This fools the fast send code into thinking the sb offset changed and it miscalculates a "updated offset".
   It should only do that when the mbuf in question got smaller.. i.e. an ack was processed. This can lead to
   a panic deref'ing a NULL mbuf if that packet is ever retransmitted. At the best case it leads to invalid data being
   sent to the client which usually terminates the connection. The fix is to have the proper logic (that is in the rsm fast path)
   to make sure we only update the offset when the mbuf shrinks.
2) The other issue is more bothersome. The timestamp check in rack needs to use the msec timestamp when
   comparing the timestamp echo to now. It was using a microsecond timestamp which ends up giving error
   prone results but causes only small harm in trying to identify which send to use in RTT calculations if its a retransmit.

Reviewed by: tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32062
2021-09-23 10:54:23 -04:00
Michael Tuexen
414499b3f9 sctp: Cleanup stream schedulers.
No functional change intended.

MFC after:	1 week
2021-09-23 14:16:56 +02:00
Michael Tuexen
762ae0ec8d sctp: Simplify stream scheduler usage
Callers are getting the stcb send lock, so just KASSERT that.
No need to signal this when calling stream scheduler functions.
No functional change intended.

MFC after:	1 week
2021-09-21 17:13:57 +02:00
Michael Tuexen
0b79a76f84 sctp: improve consistency when calling stream scheduler
Hold always the stcb send lock when calling sctp_ss_init() and
sctp_ss_remove_from_stream().

MFC after:	1 week
2021-09-21 00:54:13 +02:00
Michael Tuexen
34b1efcea1 sctp: use a valid outstream when adding it to the scheduler
Without holding the stcb send lock, the outstreams might get
reallocated if the number of streams are increased.

Reported by:	syzbot+4a5431d7caa666f2c19c@syzkaller.appspotmail.com
Reported by:	syzbot+aa2e3b013a48870e193d@syzkaller.appspotmail.com
Reported by:	syzbot+e4368c3bde07cd2fb29f@syzkaller.appspotmail.com
Reported by:	syzbot+fe2f110e34811ea91690@syzkaller.appspotmail.com
Reported by:	syzbot+ed6e8de942351d0309f4@syzkaller.appspotmail.com
MFC after:	1 week
2021-09-20 15:52:10 +02:00
Marko Zec
2ac039f7be [fib_algo][dxr] Merge adjacent empty range table chunks.
MFC after:	3 days
2021-09-20 06:30:45 +02:00
Michael Tuexen
e19d93b19d sctp: fix FCFS stream scheduler
Reported by:	syzbot+c6793f0f0ce698bce230@syzkaller.appspotmail.com
MFC after:	1 week
2021-09-19 11:56:26 +02:00
Mark Johnston
bf25678226 ktls: Fix error/mode confusion in TCP_*TLS_MODE getsockopt handlers
ktls_get_(rx|tx)_mode() can return an errno value or a TLS mode, so
errors are effectively hidden.  Fix this by using a separate output
parameter.  Convert to the new socket buffer locking macros while here.

Note that the socket buffer lock is not needed to synchronize the
SOLISTENING check here, we can rely on the PCB lock.

Reviewed by:	jhb
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31977
2021-09-17 14:19:05 -04:00
Mike Karels
fd0765933c Change lowest address on subnet (host 0) not to broadcast by default.
The address with a host part of all zeros was used as a broadcast long
ago, but the default has been all ones since 4.3BSD and RFC1122.  Until
now, we would broadcast the host zero address as well as the configured
address.  Change to not broadcasting that address by default, but add a
sysctl (net.inet.ip.broadcast_lowest) to re-enable it.  Note that the
correct way to use the zero address for broadcast would be to configure
it as the broadcast address for the network.

See https:/datatracker.ietf.org/doc/draft-schoen-intarea-lowest-address/
and the discussion in https://reviews.freebsd.org/D19316.  Note, Linux
now implements this.

Reviewed by:	rgrimes, tuexen; melifaro (previous version)
MFC after:	1 month
Relnotes:	yes
Differential Revision: https://reviews.freebsd.org/D31861
2021-09-16 19:42:20 -05:00
Marko Zec
eb3148cc4d [fib algo][dxr] Fix division by zero.
A division by zero would occur if DXR would be activated on a vnet
with no IP addresses configured on any interfaces.

PR:		257965
MFC after:	3 days
Reported by:	Raul Munoz
2021-09-16 16:34:05 +02:00
Marko Zec
b51f8bae57 [fib algo][dxr] Optimize trie updating.
Don't rebuild in vain trie parts unaffected by accumulated incremental
RIB updates.

PR:		257965
Tested by:	Konrad Kreciwilk
MFC after:	3 days
2021-09-15 22:42:49 +02:00
Marko Zec
442c8a245e [fib algo][dxr] Fix undefined behavior.
The result of shifting uint32_t by 32 (or more) is undefined: fix it.
2021-09-15 22:42:48 +02:00
Hans Petter Selasky
e3e7d95332 tcp: Avoid division by zero when KERN_TLS is enabled in tcp_account_for_send().
If the "len" variable is non-zero, we can assume that the sum of
"tp->t_snd_rxt_bytes + tp->t_sndbytes" is also non-zero.

It is also assumed that the 64-bit byte counters will never wrap around.

Differential Revision:	https://reviews.freebsd.org/D31959
Reviewed by:	gallatin, rrs and tuexen
Found by:	"I told you so", also called hselasky
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2021-09-15 18:05:31 +02:00
Michael Tuexen
4542164685 sctp: cleanup, no functional change intended
MFC after:	1 week
2021-09-15 10:18:11 +02:00
John Baldwin
c782ea8bb5 Add a switch structure for send tags.
Move the type and function pointers for operations on existing send
tags (modify, query, next, free) out of 'struct ifnet' and into a new
'struct if_snd_tag_sw'.  A pointer to this structure is added to the
generic part of send tags and is initialized by m_snd_tag_init()
(which now accepts a switch structure as a new argument in place of
the type).

Previously, device driver ifnet methods switched on the type to call
type-specific functions.  Now, those type-specific functions are saved
in the switch structure and invoked directly.  In addition, this more
gracefully permits multiple implementations of the same tag within a
driver.  In particular, NIC TLS for future Chelsio adapters will use a
different implementation than the existing NIC TLS support for T6
adapters.

Reviewed by:	gallatin, hselasky, kib (older version)
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D31572
2021-09-14 11:43:41 -07:00
Mark Johnston
e6c19aa94d sctp: Allow blocking on I/O locks even with non-blocking sockets
There are two flags to request a non-blocking receive on a socket:
MSG_NBIO and MSG_DONTWAIT.  They are handled a bit differently in that
soreceive_generic() and soreceive_stream() will block on the socket I/O
lock when MSG_NBIO is set, but not if MSG_DONTWAIT is set.  In general,
MSG_NBIO seems to mean, "don't block if there is no data to receive" and
MSG_DONTWAIT means "don't go to sleep for any reason".

SCTP's soreceive implementation did not allow blocking on the I/O lock
if either flag is set, but this violates an assumption in
aio_process_sb(), which specifies MSG_NBIO but nonetheless
expects to make progress if data is available to read.  Change
sctp_sorecvmsg() to block on the I/O lock only if MSG_DONTWAIT
is not set.

Reported by:	syzbot+c7d22dbbb9aef509421d@syzkaller.appspotmail.com
Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31915
2021-09-14 09:02:05 -04:00
Michael Tuexen
29545986bd sctp: avoid LOR
Don't lock the inp-info lock while holding an stcb lock.

MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D31921
2021-09-12 21:11:14 +02:00
Michael Tuexen
4181fa2a20 sctp: minor cleanup, no functional change
MFC after:	1 week
2021-09-12 19:21:15 +02:00
Mark Johnston
2d5c48eccd sctp: Tighten up locking around sctp_aloc_assoc()
All callers of sctp_aloc_assoc() mark the PCB as connected after a
successful call (for one-to-one-style sockets).  In all cases this is
done without the PCB lock, so the PCB's flags can be corrupted.  We also
do not atomically check whether a one-to-one-style socket is a listening
socket, which violates various assumptions in solisten_proto().

We need to hold the PCB lock across all of sctp_aloc_assoc() to fix
this.  In order to do that without introducing lock order reversals, we
have to hold the global info lock as well.

So:
- Convert sctp_aloc_assoc() so that the inp and info locks are
  consistently held.  It returns with the association lock held, as
  before.
- Fix an apparent bug where we failed to remove an association from a
  global hash if sctp_add_remote_addr() fails.
- sctp_select_a_tag() is called when initializing an association, and it
  acquires the global info lock.  To avoid lock recursion, push locking
  into its callers.
- Introduce sctp_aloc_assoc_connected(), which atomically checks for a
  listening socket and sets SCTP_PCB_FLAGS_CONNECTED.

There is still one edge case in sctp_process_cookie_new() where we do
not update PCB/socket state correctly.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31908
2021-09-11 10:15:21 -04:00
orange30
f5777c123a net: Fix memory leaks upon arp_fillheader() failures
Free memory before return from arprequest_internal().  In in_arpinput(),
if arp_fillheader() fails, it should use goto drop.

Reviewed by:	melifaro, imp, markj
MFC after:	1 week
Pull Request:	https://github.com/freebsd/freebsd-src/pull/534
2021-09-10 09:45:26 -04:00
Michael Tuexen
3ea2cdd45e sctp: add explicit cast, no functional change intended
MFC after:	3 days
2021-09-09 19:13:47 +02:00
Michael Tuexen
0c1a20beb4 sctp: use appropriate argument when freeing association
Reported by:	syzbot+7fe26e26911344e7211d@syzkaller.appspotmail.com
MFC after:	3 days
2021-09-09 18:01:35 +02:00
Mark Johnston
4250aa1188 sctp: Clear assoc socket references when freeing a PCB
This restores behaviour present in the first import of SCTP.  Commit
ceaad40ae7 commented this out and commit
62fb761ff2 removed it.  However, once
sctp_inpcb_free() returns, the socket reference is gone no matter what,
so we need to clear it.

Reported by:	syzbot+30dd69297fcbc5f0e10a@syzkaller.appspotmail.com
Reported by:	syzbot+7b2f9d4bcac1c9569291@syzkaller.appspotmail.com
Reported by:	syzbot+ed3e651f7d040af480a6@syzkaller.appspotmail.com
Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31886
2021-09-09 08:33:26 -04:00
Michael Tuexen
58a7bf124c sctp: cleanup timewait handling for vtags
MFC after:	1 week
2021-09-09 01:18:58 +02:00
Mark Johnston
ee4731179c sctp: Fix a lock order reversal in sctp_swap_inpcb_for_listen()
When port reuse is enabled in a one-to-one-style socket, sctp_listen()
may call sctp_swap_inpcb_for_listen() to move the PCB out of the "TCP
pool".  In so doing it will drop the PCB lock, yielding an LOR since we
now hold several socket locks.  Reorder sctp_listen() so that it
performs this operation before beginning the conversion to a listening
socket.  Also modify sctp_swap_inpcb_for_listen() to return with PCB
write-locked, since that's what sctp_listen() expects now.

Reviewed by:	tuexen
Fixes:	bd4a39cc93 ("socket: Properly interlock when transitioning to a listening socket")
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31879
2021-09-08 11:41:19 -04:00
Mark Johnston
6e3af6321b sctp: Fix lock recursion in sctp_swap_inpcb_for_listen()
After commit bd4a39cc93 we now hold the global inp info lock across
the call to sctp_swap_inpcb_for_listen(), which attempts to acquire it
again.  Since sctp_swap_inpcb_for_listen()'s sole caller is
sctp_listen(), we can simply change it to not try to acquire the lock.

Reported by:	syzbot+a76b19ea2f8e1190c451@syzkaller.appspotmail.com
Reported by:	syzbot+a1b6cef257ad145b7187@syzkaller.appspotmail.com
Reviewed by:	tuexen
Fixes:	bd4a39cc93 ("socket: Properly interlock when transitioning to a listening socket")
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31878
2021-09-08 11:41:18 -04:00
Michael Tuexen
aab1d593b2 sctp: minor cleanups, no functional change intended 2021-09-08 15:13:49 +02:00
Alexander V. Chernikov
4b631fc832 routing: fix source address selection rules for IPv4 over IPv6.
Current logic always selects an IFA of the same family from the
 outgoing interfaces. In IPv4 over IPv6 setup there can be just
 single non-127.0.0.1 ifa, attached to the loopback interface.

Create a separate rt_getifa_family() to handle entire ifa selection
 for the IPv4 over IPv6.

Differential Revision: https://reviews.freebsd.org/D31868
MFC after:	1 week
2021-09-07 21:41:05 +00:00
Mark Johnston
c4b44adcf0 sctp: Remove special handling for a listen(2) backlog of 0
... when applied to one-to-one-style sockets.  sctp_listen() cannot be
used to toggle the listening state of such a socket.  See RFC 6458's
description of expected listen(2) semantics for one-to-one- and
one-to-many-style sockets.

Reviewed by:	tuexen
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31774
2021-09-07 17:12:09 -04:00
Mark Johnston
bd4a39cc93 socket: Properly interlock when transitioning to a listening socket
Currently, most protocols implement pru_listen with something like the
following:

	SOCK_LOCK(so);
	error = solisten_proto_check(so);
	if (error) {
		SOCK_UNLOCK(so);
		return (error);
	}
	solisten_proto(so);
	SOCK_UNLOCK(so);

solisten_proto_check() fails if the socket is connected or connecting.
However, the socket lock is not used during I/O, so this pattern is
racy.

The change modifies solisten_proto_check() to additionally acquire
socket buffer locks, and the calling thread holds them until
solisten_proto() or solisten_proto_abort() is called.  Now that the
socket buffer locks are preserved across a listen(2), this change allows
socket I/O paths to properly interlock with listen(2).

This fixes a large number of syzbot reports, only one is listed below
and the rest will be dup'ed to it.

Reported by:	syzbot+9fece8a63c0e27273821@syzkaller.appspotmail.com
Reviewed by:	tuexen, gallatin
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31659
2021-09-07 17:11:43 -04:00
Mark Johnston
f94acf52a4 socket: Rename sb(un)lock() and interlock with listen(2)
In preparation for moving sockbuf locks into the containing socket,
provide alternative macros for the sockbuf I/O locks:
SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK().  These operate on a
socket rather than a socket buffer.  Note that these locks are used only
to prevent concurrent readers and writters from interleaving I/O.

When locking for I/O, return an error if the socket is a listening
socket.  Currently the check is racy since the sockbuf sx locks are
destroyed during the transition to a listening socket, but that will no
longer be true after some follow-up changes.

Modify a few places to check for errors from
sblock()/SOCK_IO_(SEND|RECV)_LOCK() where they were not before.  In
particular, add checks to sendfile() and sorflush().

Reviewed by:	tuexen, gallatin
MFC after:	1 month
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31657
2021-09-07 15:06:48 -04:00
Mark Johnston
173a7a4ee4 sctp: Fix iterator synchronization in sctp_sendall()
- The SCTP_PCB_FLAGS_SND_ITERATOR_UP check was racy, since two threads
  could observe that the flag is not set and then both set it.  I'm not
  sure if this is actually a problem in practice, i.e., maybe there's no
  problem having multiple sends for a single PCB in the iterator list?
- sctp_sendall() was modifying sctp_flags without the inp lock held.

The change simply acquires the PCB write lock before toggling the flag,
fixing both problems.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31813
2021-09-07 11:19:29 -04:00
Mark Johnston
e8e23ec127 sctp: Remove an unused sctp_inpcb field
This appears to be unused in usrsctp as well.  No functional change
intended.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31812
2021-09-07 11:19:29 -04:00
Mark Johnston
c17b531bed sctp: Fix races around sctp_inpcb_free()
sctp_close() and sctp_abort() disassociate the PCB from its socket.
As a part of this, they attempt to free the PCB, which may end up
lingering.  Fix some bugs in this area:

- For some reason, sctp_close() and sctp_abort() set
  SCTP_PCB_FLAGS_SOCKET_GONE using an atomic compare-and-set without the
  PCB lock held.  This is racy since sctp_flags is normally updated
  without atomics, using the PCB lock to synchronize.  So, the update
  can be lost, which can cause all sort of races with other SCTP
  components which look for the _GONE flag.  Fix the problem simply by
  acquiring the PCB lock in order to set the flag.  Note that we have to
  drop and re-acquire the lock again in sctp_inpcb_free(), but I don't
  see a good way around that for now.  If it's a real problem, the _GONE
  flag could be split out of sctp_flags and into a dedicated sctp_inpcb
  field.
- In sctp_inpcb_free(), load sctp_socket after acquiring the PCB lock,
  to avoid possible races with parallel sctp_inpcb_free() calls.
- Add an assertion sctp_inpcb_free() to verify that _ALLGONE is not set.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31811
2021-09-07 11:19:29 -04:00
Alexander V. Chernikov
936f4a42fa lltable: do not require prefix lookup when checking lle allocation rules.
With the new FIB_ALGO infrastructure, nearly all subsystems use
 fib[46]_lookup() functions, which provides lockless lookups.
A number of places remains that uses old-style lookup functions, that
 still requires RIB read lock to return the result. One of such places
 is arp processing code.
FIB_ALGO implementation makes some tradeoffs, resulting in (relatively)
 prolonged periods of holding RIB_WLOCK. If the lock is held and datapath
 competes for it, the RX ring may get blocked, ending in traffic delays and losses.
As currently arp processing is performed directly in the interrupt handler,
 handling ARP replies triggers the problem descibed above when the amount of
 ARP replies is high.

To be more specific, prior to creating new ARP entry, routing lookup for the entry
 address in interface fib is executed. The following conditions are the verified:

1. If lookup returns an empty result, or the resulting prefix is non-directly-reachable,
 failure is returned. The only exception are host routes w/ gateway==address.
2. If the routing lookup returns different interface and non-host route,
 we want to support the use case of having multiple interfaces with the same prefix.
 In fact, the current code just checks if the returned prefix covers target address
 (always true) and effectively allow allocating ARP entries for any directly-reachable prefix,
 regardless of its interface.

Change the code to perform the following:

1) use fib4_lookup() to get the nexthop, instead of requesting exact prefix.
2) Rewrite first condition check using nexthop flags (1:1 match)
3) Rewrite second condition to check for interface addresses matching target address on
 the input interface.

Differential Revision: https://reviews.freebsd.org/D31824
Reviewed by:	ae
MFC after:	1 week
PR:	257965
2021-09-06 21:03:22 +00:00
Gordon Bergling
631504fb34 Fix a common typo in source code comments
- s/existant/existent/

MFC after:	3 days
2021-09-04 12:56:57 +02:00
Mark Johnston
c98bf2a45e sctp: Always check for a vanishing inpcb when processing COOKIE-ECHO
We previously did this only in the normal case where no association
exists yet.  However, it is not safe to process COOKIE-ECHO even if an
association exists, as sctp_process_cookie_existing() may dereference
the socket pointer.

See also commit 0c7dc84076.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31755
2021-09-01 10:28:17 -04:00
Mark Johnston
d35be50f57 sctp: Hold association locks across socket wakeups when freeing
At this point we do not hold the inpcb lock, so the only thing holding
the socket reference live is the TCB lock, which needs to be acquired by
sctp_inpcb_free() in order to destroy associations.  Defer the unlock to
until after we dereference the socket reference.

Reported by:	syzbot+1d0f2c4675de76a4cf1e@syzkaller.appspotmail.com
Reported by:	syzbot+fabee77954fe69d3a5ad@syzkaller.appspotmail.com
Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31754
2021-09-01 10:27:51 -04:00
Mark Johnston
65f30a39e1 sctp: Release the socket reference when detaching an association
Later in sctp_free_assoc(), when we clean up chunk lists,
sctp_free_spbufspace() is used to reset the byte count in the socket
send buffer.  However, if the PCB is going away, the socket may already
have been detached from the PCB, in which case this becomes a use-after
free.  Clear the socket reference from the association before detaching
it from the PCB, if the PCB has already lost its socket reference.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31753
2021-09-01 10:27:31 -04:00
Mark Johnston
457abbb857 sctp: Implement sctp_inpcb_bind_locked()
This will be used by sctp_listen() to avoid dropping locks when
performing an implicit bind.  No functional change intended.

Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31757
2021-09-01 10:06:18 -04:00
Mark Johnston
be8ee77e9e sctp: Add macros to assert on inp info lock state
Reviewed by:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D31756
2021-09-01 10:06:18 -04:00