Commit Graph

7145 Commits

Author SHA1 Message Date
Gordon Bergling
27c4abc7cd inet(3): Fix two typos in sysctl descriptions
- s/sequental/sequential/

MFC after:	3 days
2021-11-30 10:21:47 +01:00
Gordon Bergling
b4aa9cb217 tcp(4): Fix a typo in a sysctl description
- s/entires/entries/

MFC after:	3 days
2021-11-30 07:17:30 +01:00
Michael Tuexen
147bf5e930 tcp: Don't try to upgrade a read lock just for logging
Reviewed by:		glebius, lstewart, rrs
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D33098
2021-11-29 13:48:40 +01:00
Michael Tuexen
3c1ba6f394 sctp: improve consistency, no functional change intended 2021-11-26 12:53:43 +01:00
Michael Tuexen
0906362646 sctp: add some asserts, no functional changes intended
This might help in narrowing down
https://syzkaller.appspot.com/bug?id=fbd79abaec55f5aede63937182f4247006ea883b
2021-11-26 12:19:33 +01:00
Mark Johnston
44775b163b netinet: Remove unneeded mb_unmapped_to_ext() calls
in_cksum_skip() now handles unmapped mbufs on platforms where they're
permitted.

Reviewed by:	glebius, jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33097
2021-11-24 13:31:16 -05:00
Mark Johnston
0d9c3423f5 netinet: Implement in_cksum_skip() using m_apply()
This allows it to work with unmapped mbufs.  In particular,
in_cksum_skip() calls no longer need to be preceded by calls to
mb_unmapped_to_ext() to avoid a page fault.

PR:		259645
Reviewed by:	gallatin, glebius, jhb
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33096
2021-11-24 13:31:16 -05:00
Mark Johnston
ecbbe83144 netinet: Deduplicate most in_cksum() implementations
in_cksum() and related routines are implemented separately for each
platform, but only i386 and arm have optimized versions.  Other
platforms' copies of in_cksum.c are identical except for style
differences and support for big-endian CPUs.

Deduplicate the implementations for the rest of the platforms.  This
will make it easier to implement in_cksum() for unmapped mbufs.  On arm
and i386, define HAVE_MD_IN_CKSUM to mean that the MI implementation is
not to be compiled.

No functional change intended.

Reviewed by:	kp, glebius
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33095
2021-11-24 13:31:16 -05:00
Mark Johnston
5195bcc212 netinet: Remove in_cksum.c
It does not get compiled into the kernel.  No functional change
inteneded.

Reviewed by:	kp, glebius, cy
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D33094
2021-11-24 13:31:16 -05:00
Gordon Bergling
b4fbc855a5 cc_newreno(4): Fix a typo in a source code comment
- s/conditons/conditions/

MFC after:	3 days
2021-11-19 19:16:02 +01:00
Gleb Smirnoff
ff94500855 Add tcp_freecb() - single place to free tcpcb.
Until this change there were two places where we would free tcpcb -
tcp_discardcb() in case if all timers are drained and tcp_timer_discard()
otherwise.  They were pretty much copy-n-paste, except that in the
default case we would run tcp_hc_update().  Merge this into single
function tcp_freecb() and move new short version of tcp_timer_discard()
to tcp_timer.c and make it static.

Reviewed by:		rrs, hselasky
Differential revision:	https://reviews.freebsd.org/D32965
2021-11-18 20:27:45 -08:00
Gleb Smirnoff
fb8588d2cb tcp_timewait: use on stack struct tcptw as last resort
In case we failed to uma_zalloc() and also failed to reuse with
tcp_tw_2msl_scan(), then just use on stack tcptw.  This will allow
to run through tcp_twrespond() and standard tcpcb discard routine.

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D32965
2021-11-18 20:27:45 -08:00
Randall Stewart
97e28f0f58 tcp: Rack ack war with a mis-behaving firewall or nat with resets.
Previously we added ack-war prevention for misbehaving firewalls. This is
where the f/w or nat messes up its sequence numbers and causes an ack-war.
There is yet another type of ack war that we have found in the wild that is
like unto this. Basically the f/w or nat gets a ack (keep-alive probe or such)
and instead of turning the ack/seq around and adding a TH_RST it does something
real stupid and sends a new packet with seq=0. This of course triggers the challenge
ack in the reset processing which then sends in a challenge ack (if the seq=0 is within
the range of possible sequence numbers allowed by the challenge) and then we rinse-repeat.

This will add the needed tweaks (similar to the last ack-war prevention using the same sysctls and counters)
to prevent it and allow say 5 per second by default.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32938
2021-11-17 09:45:51 -05:00
Mark Johnston
756bb50b6a sctp: Remove now-unneeded mb_unmapped_to_ext() calls
sctp_delayed_checksum() now handles unmapped mbufs, thanks to m_apply().

No functional change intended.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32942
2021-11-16 13:38:09 -05:00
Mark Johnston
b4d758a0cc sctp: Use m_apply() to calcuate a checksum for an mbuf chain
m_apply() works on unmapped mbufs, so this will let us elide
mb_unmapped_to_ext() calls preceding sctp_calculate_cksum() calls in
the network stack.

Modify sctp_calculate_cksum() to assume it's passed an mbuf header.
This assumption appears to be true in practice, and we need to know the
full length of the chain.

No functional change intended.

Reviewed by:	tuexen, jhb
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D32941
2021-11-16 13:36:30 -05:00
Mike Karels
2f35e7d9fa kernel: partially revert e9efb1125a15, default inet mask
When no mask is supplied to the ioctl adding an Internet interface
address, revert to using the historical class mask rather than a
single default.  Similarly for the NFS bootp code.

MFC after:	3 weeks
Reviewed by:	melifaro glebius
Differential Revision: https://reviews.freebsd.org/D32951
2021-11-14 14:12:25 -06:00
Michael Tuexen
2f62f92e37 tcp: Fix a locking issue related to logging
tcp_respond() is sometimes called with only a read lock.
The logging however, requires a write lock. So either
try to upgrade the lock if needed, or don't log the packet.

Reported by:		syzbot+8151ef969c170f76706b@syzkaller.appspotmail.com
Reported by:		syzbot+eb679adb3304c511c1e4@syzkaller.appspotmail.com
Reviewed by:		markj, rrs
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D32983
2021-11-14 15:04:27 +01:00
Gleb Smirnoff
ef396441ce tcp_usr_detach: revert debugging piece from f5cf1e5f5a.
The code was probably useful during the problem being chased down,
but for brevity makes sense just to return to the original KASSERT.

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D32968
2021-11-13 08:33:32 -08:00
Gleb Smirnoff
9a06a82455 tcp_timers: check for (INP_TIMEWAIT | INP_DROPPED) only once
All timers keep inpcb locked through their execution.  We need to
check these flags only once.  Checking for INP_TIMEWAIT earlier is
is also safer, since such inpcbs point into tcptw rather than tcpcb,
and any dereferences of inp_ppcb as tcpcb are erroneous.

Reviewed by:		rrs, hselasky
Differential revision:	https://reviews.freebsd.org/D32967
2021-11-13 08:32:06 -08:00
Michael Tuexen
df07bfda67 tcp: Fix a locking issue
INP_WLOCK_RECHECK_CLEANUP() and INP_WLOCK_RECHECK() might return
from the function, so any locks held must be released.

Reported by:		syzbot+b1a888df08efaa7b4bf1@syzkaller.appspotmail.com
Reviewed by:		markj
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D32975
2021-11-12 22:13:50 +01:00
Mark Johnston
034a924009 tcp: Ensure that vnets have an initialized V_default_cc_ptr
This causes new vnets to inherit the cc algorithm from vnet0. This is a
temporary patch to fix vnet jail creation.

With encouragement from: glebius
Fixes: b8d60729de ("tcp: Congestion control cleanup.")
Differential Revision: https://reviews.freebsd.org/D32970
2021-11-12 12:18:12 -07:00
Warner Losh
7e3c9ec906 tcp: better congestion control defaults
Define CC_NEWRENO in all the appropriate DEFAULTS and std.* config
files. It's the default congestion control algorithm.  Add code to cc.c
so that CC_DEFAULT is "newreno" if it's not overriden in the config
file.

Sponsored by: Netflix
Fixes: b8d60729de ("tcp: Congestion control cleanup.")
Revired by: manu, hselasky, jhb, glebius, tuexen
Differential Revision:	https://reviews.freebsd.org/D32964
2021-11-12 12:16:11 -07:00
Gleb Smirnoff
2ce85919bb Add net.inet.ip.source_address_validation
Drop packets arriving from the network that have our source IP
address.  If maliciously crafted they can create evil effects
like an RST exchange between two of our listening TCP ports.
Such packets just can't be legitimate.  Enable the tunable
by default.  Long time due for a modern Internet host.

Reviewed by:		donner, melifaro
Differential revision:	https://reviews.freebsd.org/D32914
2021-11-12 09:00:33 -08:00
Gleb Smirnoff
9c89392f12 Add in_localip_fib(), in6_localip_fib().
Check if given address/FIB exists locally.

Reviewed by:		melifaro
Differential revision:	https://reviews.freebsd.org/D32913
2021-11-12 08:59:42 -08:00
Gleb Smirnoff
81674f121e ip_input: packet filters shall not modify m_pkthdr.rcvif
Quick review confirms that they do not, also IPv6 doesn't expect
such a change in mbuf.  In IPv4 this appeared in 0aade26e6d,
which doesn't seem to have a valid explanation why.

Reviewed by:		donner, kp, melifaro
Differential revision:	https://reviews.freebsd.org/D32913
2021-11-12 08:58:27 -08:00
Gleb Smirnoff
94df3271d6 Rename net.inet.ip.check_interface to rfc1122_strong_es and document it.
This very questionable feature was enabled in FreeBSD for a very short
time.  It was disabled very soon upon merging to RELENG_4 - 23d7f14119.
And in HEAD was also disabled pretty soon - 4bc37f9836.

The tunable has very vague name. Check interface for what? Given that
it was never documented and almost never enabled, I think it is fine
to rename it together with documenting it.

Also, count packets dropped by this tunable as ips_badaddr, otherwise
they fall down to ips_cantforward counter, which is misleading, as
packet was not supposed to be forwarded, it was destined locally.

Reviewed by:		donner, kp
Differential revision:	https://reviews.freebsd.org/D32912
2021-11-12 08:57:06 -08:00
Mateusz Guzik
0359e7a5e4 net: sprinkle __predict_false in ip_input on error conditions
While here rearrange the RVSP check to inspect proto first and avoid
evaluating V_rsvp in the common case to begin with (most notably avoid
the expensive read).

Reviewed by:	glebius
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D32929
2021-11-12 15:40:28 +00:00
Randall Stewart
26cbd0028c tcp: Rack may still calculate long RTT on persists probes.
When a persists probe is lost, we will end up calculating a long
RTT based on the initial probe and when the response comes from the
second probe (or third etc). This means we have a minimum of a
confidence level of 3 on a incorrect probe. This commit will change it
so that we have one of two options
a) Just not count RTT of probes where we had a loss
<or>
b) Count them still but degrade the confidence to 0.

I have set in this the default being to just not measure them, but I am open
to having the default be otherwise.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32897
2021-11-11 06:35:51 -05:00
Randall Stewart
b8d60729de tcp: Congestion control cleanup.
NOTE: HEADS UP read the note below if your kernel config is not including GENERIC!!

This patch does a bit of cleanup on TCP congestion control modules. There were some rather
interesting surprises that one could get i.e. where you use a socket option to change
from one CC (say cc_cubic) to another CC (say cc_vegas) and you could in theory get
a memory failure and end up on cc_newreno. This is not what one would expect. The
new code fixes this by requiring a cc_data_sz() function so we can malloc with M_WAITOK
and pass in to the init function preallocated memory. The CC init is expected in this
case *not* to fail but if it does and a module does break the
"no fail with memory given" contract we do fall back to the CC that was in place at the time.

This also fixes up a set of common newreno utilities that can be shared amongst other
CC modules instead of the other CC modules reaching into newreno and executing
what they think is a "common and understood" function. Lets put these functions in
cc.c and that way we have a common place that is easily findable by future developers or
bug fixers. This also allows newreno to evolve and grow support for its features i.e. ABE
and HYSTART++ without having to dance through hoops for other CC modules, instead
both newreno and the other modules just call into the common functions if they desire
that behavior or roll there own if that makes more sense.

Note: This commit changes the kernel configuration!! If you are not using GENERIC in
some form you must add a CC module option (one of CC_NEWRENO, CC_VEGAS, CC_CUBIC,
CC_CDG, CC_CHD, CC_DCTCP, CC_HTCP, CC_HD). You can have more than one defined
as well if you desire. Note that if you create a kernel configuration that does not
define a congestion control module and includes INET or INET6 the kernel compile will
break. Also you need to define a default, generic adds 'options CC_DEFAULT=\"newreno\"
but you can specify any string that represents the name of the CC module (same names
that show up in the CC module list under net.inet.tcp.cc). If you fail to add the
options CC_DEFAULT in your kernel configuration the kernel build will also break.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
RELNOTES:YES
Differential Revision: https://reviews.freebsd.org/D32693
2021-11-11 06:28:18 -05:00
John Baldwin
e3ba94d4f3 Don't require the socket lock for sorele().
Previously, sorele() always required the socket lock and dropped the
lock if the released reference was not the last reference.  Many
callers locked the socket lock just before calling sorele() resulting
in a wasted lock/unlock when not dropping the last reference.

Move the previous implementation of sorele() into a new
sorele_locked() function and use it instead of sorele() for various
places in uipc_socket.c that called sorele() while already holding the
socket lock.

The sorele() macro now uses refcount_release_if_not_last() try to drop
the socket reference without locking the socket.  If that shortcut
fails, it locks the socket and calls sorele_locked().

Reviewed by:	kib, markj
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D32741
2021-11-09 10:50:12 -08:00
Mike Karels
20d5940396 kernel: deprecate Internet Class A/B/C
Hide historical Class A/B/C macros unless IN_HISTORICAL_NETS is defined;
define it for user level.  Define IN_MULTICAST separately from IN_CLASSD,
and use it in pf instead of IN_CLASSD.  Stop using class for setting
default masks when not specified; instead, define new default mask
(24 bits).  Warn when an Internet address is set without a mask.

MFC after:	1 month
Reviewed by:	cy
Differential Revision: https://reviews.freebsd.org/D32708
2021-11-09 09:32:38 -06:00
Randall Stewart
477aeb3dd4 tcp: Printf should be removed.
There is a printf when a socket option down to the CC module fails, this really
should not be a printf. In fact this whole option needs to be re-thought in coordination
with some other changes in the CC modules (its just not right but its ok what it
does here if it fails since it will just use the ECN beta).

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32894
2021-11-08 11:49:34 -05:00
Hans Petter Selasky
10a62eb109 Use layer five checksum flags in the mbuf packet header to pass on crypto state.
The mbuf protocol flags get cleared between layers, and also it was discovered
that M_DECRYPTED conflicts with M_HASFCS when receiving ethernet patckets.

Add the proper CSUM_TLS_MASK and CSUM_TLS_DECRYPTED defines, and start using
these instead of M_DECRYPTED inside the TCP LRO code.

This change is needed by coming TLS RX hardware offload support patches.

Suggested by:	kib@
Reviewed by:	jhb@
MFC after:	1 week
Sponsored by:	NVIDIA Networking
2021-11-04 18:52:06 +01:00
Allan Jude
34d8fffff3 SIFTR: Fix compilation with -DSIFTR_IPV6
A few pieces of the SIFTR code that are behind #ifdef SIFTR_IPV6 have
not been updated as APIs have changed, etc.

Reported by:	Alexander Sideropoulos <Alexander.Sideropoulos@netapp.com>
Reviewed by:	rscheff, lstewart
Sponsored by:	NetApp
Sponsored by:	Klara Inc.
Differential Revision:	https://reviews.freebsd.org/D32698
2021-11-04 00:32:17 +00:00
Gleb Smirnoff
3ea9a7cf7b blackhole(4): disable for locally originated TCP/UDP packets
In most cases blackholing for locally originated packets is undesired,
leads to different kind of lags and delays. Provide sysctls to enforce
it, e.g. for debugging purposes.

Reviewed by:		rrs
Differential revision:	https://reviews.freebsd.org/D32718
2021-11-03 13:02:44 -07:00
Gordon Bergling
c28e39c3d6 Fix a common typo in syctl descriptions
- s/maxiumum/maximum/

MFC after:	3 days
2021-11-03 20:49:24 +01:00
Gleb Smirnoff
3358df2973 udp_input: remove a BSD stack relict
I should had removed it 9 years ago in 8ad458a471.  That commit
left save_ip as a write-only variable.

With save_ip removed we got one case when IP header can be modified:
the calculation of IP checksum with zeroed out header.  This place
already has had a header saver char b[9].  However, the b[9] saver
didn't cover the ip_sum field, which we explicitly overwrite aliased
as (struct ipovly *)->ih_len.  This was fine in cb34210012, since
checksum doesn't need to be restored if packet is consumed.  Now we
need to extend up to ip_sum field.

In collaboration with:	ae
Differential revision:	https://reviews.freebsd.org/D32719
2021-11-03 10:39:34 -07:00
Gordon Bergling
bb91496a85 netinet: Fix a common typo in source code comments
- s/writting/writing/

MFC after:	3 days
2021-11-03 16:21:49 +01:00
Andrey V. Elsukov
4a9e95286c ip_divert: calculate delayed checksum for IPv6 adress family
Before passing an IPv6 packet to application apply delayed checksum
calculation. Mbuf flags will be lost when divert listener will return a
packet back, so we will not be able to do delayed checksum calculation
later. Also an application will get a packet with correct checksum.

Reviewed by:	donner
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D32807
2021-11-03 15:20:51 +03:00
Mateusz Guzik
8e27968786 inet: remove tcp_debug from netinet/tcp_debug.h
It was a hack only needed for trpt, which can just define it locally.

This makes it possible to fix up systat which also includes the file.

Sponsored by:	Rubicon Communications, LLC ("Netgate")
2021-11-01 23:10:30 +00:00
Marius Halden
1019354b54 carp: deal with negative net.inet.carp.demotion
Given nodes 1 and 2, where node 1 has an advskew of 0 and node 2 has an
advskew of 100, making them master and backup respectively.

If net.inet.carp.demotion is set to a negative value on node 1, node 2
might become master while node 1 still retains it master status. Wether
or not node 2 becomes master seems to depend on the nodes advskew and
what the demotion sysctl was set to on node 1.

The reason for node 2 becoming master seems to be that the calculated
advskew taking demotion into account is truncated to a single unsigned
byte when copied into the carp header for sending, and node 1 stays
master since it takes uses the whole non-truncated calculated advskew
when deciding wether to stay master.

PR:		259528
Reviewed by:	donner, glebius
MFC after:	3 weeks
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D32759
2021-11-01 17:08:23 +01:00
Randall Stewart
141a53cd58 tcp: Rack might retransmit forever.
If we get a Sacked peer with an MTU change we can retransmit forever if the
last bytes are sacked and the client goes away (think power off). Then we
never see the end condition and continually retransmit.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32671
2021-10-29 17:37:49 -04:00
Randall Stewart
aeda852782 tcp: Rack at times can miscalculate the RTT from what it thinks is a persists probe respone.
Turns out that if a peer sends in a window update right after rack fires off
a persists probe, we can mis-interpret the window update and calculate
a bogus RTT (very short). We still process the window update and send
the data but we incorrectly generate an RTT. We should be only doing
the RTT stuff if the rwnd is still small and has not changed.

Reviewed by: Michael Tuexen
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D32717
2021-10-29 03:17:43 -04:00
Gleb Smirnoff
92b3e07229 Enable net.inet.tcp.nolocaltimewait.
This feature has been used for many years at large sites and
didn't show any pitfalls.
2021-10-28 15:34:00 -07:00
Wojciech Macek
8a727c3df8 mroute: add missing WUNLOCK
Add missing WNLOCK as in all other error cases.

Reported by:		Stormshield
Obtained from:		Semihalf
2021-10-28 07:12:23 +02:00
Wojciech Macek
fb3854845f mroute: fix memory leak
Add MFC to linked list to store incoming packets
before MCAST JOIN was captured.

Sponsored by:		Stormshield
Obtained from:		Semihalf
MFC after:		2 weeks
2021-10-28 07:12:16 +02:00
Gleb Smirnoff
5d3bf5b1d2 rack: Update the fast send block on setsockopt(2)
Rack caches TCP/IP header for fast send, so it doesn't call
tcpip_fillheaders().  After certain socket option changes,
namely IPV6_TCLASS, IP_TOS and IP_TTL it needs to update
its fast block to be in sync with the inpcb.

Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D32655
2021-10-27 08:22:00 -07:00
Gleb Smirnoff
f581a26e46 Factor out tcp6_use_min_mtu() to handle IPV6_USE_MIN_MTU by TCP.
Pass control for IP/IP6 level options from generic tcp_ctloutput_set()
down to per-stack ctloutput.

Call tcp6_use_min_mtu() from tcp stack tcp_default_ctloutput().

Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D32655
2021-10-27 08:22:00 -07:00
Gleb Smirnoff
de156263a5 Several IP level socket options may affect TCP.
After handling them in IP level ctloutput, pass them down to TCP
ctloutput.

We already have a hack to handle IPV6_USE_MIN_MTU. Leave it in place
for now, but comment out how it should be handled.

For IPv4 we are interested in IP_TOS and IP_TTL.

Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D32655
2021-10-27 08:21:59 -07:00
Gleb Smirnoff
fc4d53cc2e Split tcp_ctloutput() into set/get parts.
Reviewed by:		rrs
Differential Revision:	https://reviews.freebsd.org/D32655
2021-10-27 08:21:59 -07:00