6570 Commits

Author SHA1 Message Date
Michael Tuexen
ae7cc6c9f8 Make the message size limit used for SCTP_SENDALL configurable via
a sysctl variable instead of a compiled in constant.

This is based on a patch provided by nwhitehorn@.
2020-01-04 20:33:12 +00:00
Mark Johnston
31069f383a Take the ifnet's address lock in igmp_v3_cancel_link_timers().
inm_rele_locked() may remove the multicast address associated with inm.

Reported by:	syzbot+871c5d1fd5fac6c28f52@syzkaller.appspotmail.com
Reviewed by:	hselasky
MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D23009
2020-01-03 17:03:10 +00:00
Michael Tuexen
0eeb0d180e Remove empty line which was added in r356270 by accident.
MFC after:		1 week
2020-01-02 14:04:16 +00:00
Michael Tuexen
ac1d75d23a Improve input validation of the spp_pathmtu field in the
SCTP_PEER_ADDR_PARAMS socket option. The code in the stack assumes
sane values for the MTU.

This issue was found by running an instance of syzkaller.

MFC after:		1 week
2020-01-02 13:55:10 +00:00
Gleb Smirnoff
8d5c56dab1 In r343631 error code for a packet blocked by a firewall was
changed from EACCES to EPERM.  This change was not intentional,
so fix that.  Return EACCESS if a firewall forbids sending.

Noticed by:	ae
2020-01-01 17:31:43 +00:00
Alexander Motin
8c3fbf3c20 Relax locking of carp_forus().
This fixes deadlock between CARP and bridge.  Bridge calls this function
taking CARP lock while holding bridge lock.  Same time CARP tries to send
its announcements via the bridge while holding CARP lock.

Use of CARP_LOCK() here does not solve anything, since sc_addr is constant
while race on sc_state is harmless and use of the lock does not close it.

Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-12-31 18:58:29 +00:00
Michael Tuexen
e11c9783e1 Fix delayed ACK generation for DCTCP.
Submitted by:		Richard Scheffenegger
Reviewed by:		chengc@netapp.com, rgrimes@, tuexen@
Differential Revision:	https://reviews.freebsd.org/D22644
2019-12-31 16:15:47 +00:00
Michael Tuexen
493c98c6d2 Add flags for upcoming patches related to improved ECN handling.
No functional change.
Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@, tuexen@
Differential Revision:	https://reviews.freebsd.org/D22429
2019-12-31 14:32:48 +00:00
Michael Tuexen
83a2839fb9 Clear the flag indicating that the last received packet was marked CE also
in the case where a packet not marked was received.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@, tuexen@
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D19143
2019-12-31 14:23:52 +00:00
Michael Tuexen
7d87664a04 Add curly braces missed in https://svnweb.freebsd.org/changeset/base/354773
Sponsored by:		Netflix, Inc.
CID:			1407649
2019-12-31 12:29:01 +00:00
Michael Tuexen
6088175a18 Improve input validation for some parameters having a too small
reported length.

Thanks to Natalie Silvanovich from Google for finding one of these
issues in the SCTP userland stack and reporting it.

MFC after:		1 week
2019-12-20 15:25:08 +00:00
Alexander V. Chernikov
bdb214a4a4 Remove useless code from in6_rmx.c
The code in questions walks IPv6 tree every 60 seconds and looks into
 the routes with non-zero expiration time (typically, redirected routes).
For each such route it sets RTF_PROBEMTU flag at the expiration time.
No other part of the kernel checks for RTF_PROBEMTU flag.

RTF_PROBEMTU was defined 21 years ago, 30 Jun 1999, as RTF_PROTO1.
RTF_PROTO1 is a de-facto standard indication of a route installed
 by a routing daemon for a last decade.

Reviewed by:	bz, ae
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D22865
2019-12-18 22:10:56 +00:00
Hans Petter Selasky
a4c5668d12 Leave multicast group before reaping and committing state for both
IPv4 and IPv6.

This fixes a regression issue after r349369. When trying to exit a
multicast group before closing the socket, a multicast leave packet
should be sent.

Differential Revision:	https://reviews.freebsd.org/D22848
PR: 242677
Reviewed by:	bz (network)
Tested by:	Aleksandr Fedorov <aleksandr.fedorov@itglobal.com>
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-12-18 12:06:34 +00:00
Randall Stewart
1cf55767b8 This commit is a bit of a re-arrange of deck chairs. It
gets both rack and bbr ready for the completion of the STATs
framework in FreeBSD. For now if you don't have both NF_stats and
stats on it disables them. As soon as the rest of the stats framework
lands we can remove that restriction and then just uses stats when
defined.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D22479
2019-12-17 16:08:07 +00:00
John Baldwin
5773ac113c Use callout_func_t instead of the deprecated timeout_t.
Reviewed by:	kib, imp
Differential Revision:	https://reviews.freebsd.org/D22752
2019-12-10 22:06:53 +00:00
Bjoern A. Zeeb
646e34881c Remove the extra epoch tracker change sneaked into r355449 and was not part
of the originally reviewed or described change.

Pointyhat to:   bz
Reported by:    glebius
2019-12-06 22:20:26 +00:00
Bjoern A. Zeeb
dad68fc301 carp: replace caddr_t with char *
Change the remaining caddr_t usages to char * following the removal
of the KAME macros

No functional change.

Requested by:	glebius
Reviewed by:	glebius
MFC after:	2 weeks
Sponsored by:	Netflix (originally)
Differential Revision:	https://reviews.freebsd.org/D22399
2019-12-06 16:35:48 +00:00
Gleb Smirnoff
e5a084d020 Fix regression from r354484. Don't leak pcb lock if cr_canseeinpcb()
returns non-zero.

PR:		242415
2019-12-04 22:41:52 +00:00
Bjoern A. Zeeb
0700d2c3f0 Make icmp6_reflect() static.
icmp6_reflect() is not used anywhere outside icmp6.c, no reason to
export it.

Sponsored by:	Netflix
2019-12-03 14:46:38 +00:00
Hans Petter Selasky
5b64b824b9 Use refcount from "in_joingroup_locked()" when joining multicast
groups. Do not acquire additional references. This makes the IPv4 IGMP
code in line with the IPv6 MLD code.

Background:
The IPv4 multicast code puts an extra reference on the in_multi struct
when joining groups.  This becomes visible when using daemons like
igmpproxy from ports, that multicast entries do not disappear from the
output of ifmcstat(8) when multicast streams are disconnected.

This fixes a regression issue after r349762.

While at it factor the ip_mfilter_insert() and ip6_mfilter_insert() calls
to avoid repeated "is_new" check.

Differential Revision:	https://reviews.freebsd.org/D22595
Tested by:	Guido van Rooij <guido@gvr.org>
Reviewed by:	rgrimes (network)
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2019-12-03 08:46:59 +00:00
Edward Tomasz Napierala
adc56f5a38 Make use of the stats(3) framework in the TCP stack.
This makes it possible to retrieve per-connection statistical
information such as the receive window size, RTT, or goodput,
using a newly added TCP_STATS getsockopt(3) option, and extract
them using the stats_voistat_fetch(3) API.

See the net/tcprtt port for an example consumer of this API.

Compared to the existing TCP_INFO system, the main differences
are that this mechanism is easy to extend without breaking ABI,
and provides statistical information instead of raw "snapshots"
of values at a given point in time.  stats(3) is more generic
and can be used in both userland and the kernel.

Reviewed by:	thj
Tested by:	thj
Obtained from:	Netflix
Relnotes:	yes
Sponsored by:	Klara Inc, Netflix
Differential Revision:	https://reviews.freebsd.org/D20655
2019-12-02 20:58:04 +00:00
Michael Tuexen
3cf38784e2 Move all ECN related flags from the flags to the flags2 field.
This allows adding more ECN related flags in the future.
No functional change intended.

Submitted by:		Richard Scheffenegger
Reviewed by:		rrs@, tuexen@
Differential Revision:	https://reviews.freebsd.org/D22497
2019-12-01 21:01:33 +00:00
Michael Tuexen
77aabfd94f Make the TF_* flags easier readable by humans by adding leading zeroes
to make them aligned.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@, rrs@, tuexen@
Differential Revision:	https://reviews.freebsd.org/D22428
2019-12-01 20:45:48 +00:00
Michael Tuexen
b72e56e758 This is an initial step in implementing the new congestion window
validation as specified in RFC 7661.

Submitted by:		Richard Scheffenegger
Reviewed by:		rrs@, tuexen@
Differential Revision:	https://reviews.freebsd.org/D21798
2019-12-01 20:35:41 +00:00
Michael Tuexen
8df12ffcc2 Make the IPTOS value available to all substate handlers. This will allow
to add support for L4S or SCE, which require processing of the IP TOS
field.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@, rrs@, tuexen@
Differential Revision:	https://reviews.freebsd.org/D22426
2019-12-01 18:47:53 +00:00
Michael Tuexen
fa49a96419 In order for the TCP Handshake to support ECN++, and further ECN-related
improvements, the ECN bits need to be exposed to the TCP SYNcache.
This change is a minimal modification to the function headers, without any
functional change intended.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@, rrs@, tuexen@
Differential Revision:	https://reviews.freebsd.org/D22436
2019-12-01 18:05:02 +00:00
Michael Tuexen
669a285ffb When changing the MTU of an SCTP path, not only cancel all ongoing
RTT measurements, but also scheldule new ones for the future.

Submitted by:		Julius Flohr
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D22547
2019-12-01 17:35:36 +00:00
Bjoern A. Zeeb
a4adf6cc65 Fix m_pullup() problem after removing PULLDOWN_TESTs and KAME EXT_*macros.
r354748-354750 replaced the KAME macros with m_pulldown() calls.
Contrary to the rest of the network stack m_len checks before m_pulldown()
were not put in placed (see r354748).
Put these m_len checks in place for now (to go along with the style of the
network stack since the initial commits).  These are not put in for
performance but to avoid an error scenario (even though it also will help
performance at the moment as it avoid allocating an extra mbuf; not because
of the unconditional function call).

The observed error case went like this:
(1) an mbuf with M_EXT arrives and we call m_pullup() unconditionally on it.
(2) m_pullup() will call m_get() unless the requested length is larger than
MHLEN (in which case it'll m_freem() the perfectly fine mbuf) and migrate the
requested length of data and pkthdr into the new mbuf.
(3) If m_get() succeeds, a further m_pullup() call going over MHLEN will fail.
This was observed with failing auto-configuration as an RA packet of
200 bytes exceeded MHLEN and the m_pullup() called from nd6_ra_input()
dropped the mbuf.
(Re-)adding the m_len checks before m_pullup() calls avoids this problems
with mbufs using external storage for now.

MFC after:	3 weeks
Sponsored by:	Netflix
2019-12-01 00:22:04 +00:00
Michael Tuexen
f727fee546 Really ignore the SCTP association identifier on 1-to-1 style sockets
as requiresd by the socket API specification.
Thanks to Inaki Baz Castillo, who found this bug running the userland
stack with valgrind and reported the issue in
https://github.com/sctplab/usrsctp/issues/408

MFC after:		1 week
2019-11-28 12:50:25 +00:00
Michael Tuexen
645f3a1cd1 Plug two mbuf leaks during INIT-ACK handling.
One leak happens when there is not enough memory to allocate the
the resources for streams. The other leak happens if the are
unknown parameters in the received INIT-ACK chunk which require
reporting and the INIT-ACK requires sending an ABORT due to illegal
parameter combinations.
Hopefully this fixes
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=19083

MFC after:		1 week
2019-11-27 19:32:29 +00:00
Ryan Libby
200f3ac6f7 in_mcast.c: need if_addr_lock around inm_release_deferred
Apply a similar fix as for in6_mcast.c.

Reviewed by:	hselasky
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D20740
2019-11-25 22:25:34 +00:00
Bjoern A. Zeeb
1a117215c7 Reduce the vnet_set module size of ip_mroute to allow loading as a module.
With VIMAGE kernels modules get special treatment as they need
to also keep the original values and make copies for each instance.
For that a few pages of vnet modspace are provided and the
kernel-linker and the VNET framework know how to deal with things.
When the modspace is (almost) full, other modules which would
overflow the modspace cannot be loaded and kldload will fail.

ip_mroute uses a lot of variable space, mostly be four big arrays:
set_vnet 0000000000000510 vnet_entry_multicast_register_if
set_vnet 0000000000000700 vnet_entry_viftable
set_vnet 0000000000002000 vnet_entry_bw_meter_timers
set_vnet 0000000000002800 vnet_entry_bw_upcalls

Dynamically malloc the three big ones for each instance we need
and free them again on vnet teardown (the 4th is an ifnet).
That way they only need module space for a single pointer and
allow a lot more modules using virtualized variables to be loaded
on a VNET kernel.

PR:		206583
Reviewed by:	hselasky, kp
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D22443
2019-11-19 15:38:55 +00:00
Michael Tuexen
c968c769af Add boundary and overflow checks to the formulas used in the TCP CUBIC
congestion control module.

Submitted by:		Richard Scheffenegger
Reviewed by:		rgrimes@
Differential Revision:	https://reviews.freebsd.org/D19118
2019-11-16 12:00:22 +00:00
Michael Tuexen
b0c1a13e4e Improve TCP CUBIC specific after idle reaction.
The adjustments are inspired by the Linux stack, which has had a
functionally equivalent implementation for more than a decade now.

Submitted by:		Richard Scheffenegger
Reviewed by:		Cheng Cui
Differential Revision:	https://reviews.freebsd.org/D18982
2019-11-16 11:57:12 +00:00
Michael Tuexen
35cd141b4b Implement a tCP CUBIC-specific after idle reaction.
This patch addresses a very common case of frequent application stalls,
where TCP runs idle and looses the state of the network.

Submitted by:		Richard Scheffenegger
Reviewed by:		Cheng Cui
Differential Revision:	https://reviews.freebsd.org/D18954
2019-11-16 11:37:26 +00:00
Michael Tuexen
453e633384 Revert https://svnweb.freebsd.org/changeset/base/354708
I used the wrong Differential Revision, so back it out and do it right
in a follow-up commit.
2019-11-16 11:10:09 +00:00
Bjoern A. Zeeb
b141dd5ddf Remove now unused IPv6 macros and update docs.
After r354748-354750 all uses of the IP6_EXTHDR_CHECK() and
IP6_EXTHDR_GET() macros are gone from the kernel.  IP6_EXTHDR_GET0()
was unused.  Remove the macros and update the documentation.

Sponsored by:	Netflix
2019-11-15 21:55:41 +00:00
Bjoern A. Zeeb
4e619b17c5 IP6_EXTHDR_CHECK(): remove the last instances
While r354748 removed almost all IP6_EXTHDR_CHECK() calls, these
are not part of the PULLDOWN_TESTS.
Equally convert these IP6_EXTHDR_CHECK()s here to m_pullup() and remove
the extra check and m_pullup() in tcp_input() under isipv6 given
tcp6_input() has done exactly that pullup already.

MFC after:	8 weeks
Sponsored by:	Netflix
2019-11-15 21:51:43 +00:00
Bjoern A. Zeeb
63abacc204 netinet*: replace IP6_EXTHDR_GET()
In a few places we have IP6_EXTHDR_GET() left in upper layer protocols.
The IP6_EXTHDR_GET() macro might perform an m_pulldown() in case the data
fragment is not contiguous.

Convert these last remaining instances into m_pullup()s instead.
In CARP, for example, we will a few lines later call m_pullup() anyway,
the IPsec code coming from OpenBSD would otherwise have done the m_pullup()
and are copying the data a bit later anyway, so pulling it in seems no
better or worse.

Note: this leaves very few m_pulldown() cases behind in the tree and we
might want to consider removing them as well to make mbuf management
easier again on a path to variable size mbufs, especially given
m_pulldown() still has an issue not re-checking M_WRITEABLE().

Reviewed by:	gallatin
MFC after:	8 weeks
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D22335
2019-11-15 21:44:17 +00:00
Michael Tuexen
730cbbc10d For idle TCP sessions using the CUBIC congestio control, reset ssthresh
to the higher of the previous ssthresh or 3/4 of the prior cwnd.

Submitted by:		Richard Scheffenegger
Reviewed by:		Cheng Cui
Differential Revision:	https://reviews.freebsd.org/D18982
2019-11-14 16:28:02 +00:00
Bjoern A. Zeeb
a8fe77d877 netinet*: update *mp to pass the proper value back
In ip6_[direct_]input() we are looping over the extension headers
to deal with the next header.  We pass a pointer to an mbuf pointer
to the handling functions.  In certain cases the mbuf can be updated
there and we need to pass the new one back.  That missing in
dest6_input() and route6_input().  In tcp6_input() we should also
update it before we call tcp_input().

In addition to that mark the mbuf NULL all the times when we return
that we are done with handling the packet and no next header should
be checked (IPPROTO_DONE).  This will eventually allow us to assert
proper behaviour and catch the above kind of errors more easily,
expecting *mp to always be set.

This change is extracted from a larger patch and not an exhaustive
change across the entire stack yet.

PR:			240135
Reported by:		prabhakar.lakhera gmail.com
MFC after:		3 weeks
Sponsored by:		Netflix
2019-11-12 15:46:28 +00:00
Gleb Smirnoff
273b2e4c55 Remove now unused INP_INFO_RLOCK macros. 2019-11-07 22:26:54 +00:00
Gleb Smirnoff
43e8b279b8 In TCP HPTS enter the epoch in tcp_hpts_thread() and assert it in
the leaf functions.
2019-11-07 21:30:27 +00:00
Gleb Smirnoff
aed553598d Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER() in TCP
timewait manipulation leaf functions.
2019-11-07 21:29:38 +00:00
Gleb Smirnoff
a81e3ecf46 Since pfslowtimo() runs in the network epoch, tcp_slowtimo()
also does.  This allows to simplify tcp_tw_2msl_scan() and
always require the network epoch in it.
2019-11-07 21:28:46 +00:00
Gleb Smirnoff
032677ceb5 Now that there is no R/W lock on PCB list the pcblist sysctls
handlers can be greatly simplified.  All the previous double
cycling and complex locking was added to avoid these functions
holding global PCB locks for extended period of time, preventing
addition of new entries.
2019-11-07 21:27:32 +00:00
Gleb Smirnoff
d40c0d47cd Now that all of the tcp_input() and all its branches are executed
in the network epoch, we can greatly simplify synchronization.
Remove all unneccesary epoch enters hidden under INP_INFO_RLOCK macro.
Remove some unneccesary assertions and convert necessary ones into the
NET_EPOCH_ASSERT macro.
2019-11-07 21:23:07 +00:00
Gleb Smirnoff
80577e5583 Remove unnecessary recursive epoch enter via INP_INFO_RLOCK
macro in udp_input().  It shall always run in the network epoch.
2019-11-07 21:08:49 +00:00
Gleb Smirnoff
5015a05f0b Remove now unused INP_HASH_RLOCK() macros. 2019-11-07 21:03:15 +00:00
Gleb Smirnoff
2435e507de Now with epoch synchronized PCB lookup tables we can greatly simplify
locking in udp_output() and udp6_output().

First, we select if we need read or write lock in PCB itself, we take
the lock and enter network epoch.  Then, we proceed for the rest of
the function.  In case if we need to modify PCB hash, we would take
write lock on it for a short piece of code.

We could exit the epoch before allocating an mbuf, but with this
patch we are keeping it all the way into ip_output()/ip6_output().
Today this creates an epoch recursion, since ip_output() enters epoch
itself.  However, once all protocols are reviewed, ip_output() and
ip6_output() would require epoch instead of entering it.

Note: I'm not 100% sure that in udp6_output() the epoch is required.
We don't do PCB hash lookup for a bound socket.  And all branches of
in6_select_src() don't require epoch, at least they lack assertions.
Today inet6 address list is protected by rmlock, although it is CKLIST.
AFAIU, the future plan is to protect it by network epoch.  That would
require epoch in in6_select_src().  Anyway, in future ip6_output()
would require epoch, udp6_output() would need to enter it.
2019-11-07 21:01:36 +00:00