6350 Commits

Author SHA1 Message Date
Michael Tuexen
776cd558f0 Separate out SCTP related dtrace code.
This is based on work done by markj@.

Discussed with:		markj@
MFC after:		3 days
2019-10-14 20:32:11 +00:00
Randall Stewart
8ee1cf039e if_hw_tsomaxsegsize needs to be initialized to zero, just
like in bbr.c and tcp_output.c
2019-10-14 13:10:29 +00:00
Michael Tuexen
fcfd8ad537 Rename sctp_dtrace_declare.h to sctp_kdtrace.h for consistentcy.
MFC after:		3 days
2019-10-14 13:02:49 +00:00
Gleb Smirnoff
5df91cbe02 Revert r353313. It is not needed with r353357 and is actually incorrect. 2019-10-14 04:10:00 +00:00
Michael Tuexen
d6e23cf0cf Use an event handler to notify the SCTP about IP address changes
instead of calling an SCTP specific function from the IP code.
This is a requirement of supporting SCTP as a kernel loadable module.
This patch was developed by markj@, I tweaked a bit the SCTP related
code.

Submitted by:		markj@
MFC after:		3 days
2019-10-13 18:17:08 +00:00
Mark Johnston
671d68fad9 Move SCTP DTrace probe definitions into a .c file.
Previously they were defined in a header which was included exactly
once.  Change this to follow the usual practice of putting definitions
in C files.  No functional change intended.

Discussed with:	tuexen
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2019-10-13 16:14:04 +00:00
Michael Tuexen
1b6ddd9404 Ensure that local variables are reset to their initial value when
dealing with error cases in a loop over all remote addresses.
This issue was found and reported by OSS_Fuzz in:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18080
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18086
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18121
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18163

MFC after:		3 days
2019-10-12 17:57:03 +00:00
Gleb Smirnoff
1e5db73d07 The divert(4) module must always be running in network epoch, thus
call to if_addr_rlock() isn't needed.
2019-10-10 23:48:42 +00:00
Warner Losh
b23b156e2e Fix casting error from newer gcc
Cast the pointers to (uintptr_t) before assigning to type
uint64_t. This eliminates an error from gcc when we cast the pointer
to a larger integer type.
2019-10-09 21:02:06 +00:00
Hans Petter Selasky
eabddb25a3 Factor out TCP rateset destruction code.
Ensure the epoch_call() function is not called more than one time
before the callback has been executed, by always checking the
RS_FUNERAL_SCHD flag before invoking epoch_call().

The "rs_number_dead" is balanced again after r353353.

Discussed with:	rrs@
Sponsored by:	Mellanox Technologies
2019-10-09 17:08:40 +00:00
Gleb Smirnoff
0732ac0eff Revert most of the multicast changes from r353292. This needs a more
accurate approach.
2019-10-09 17:03:20 +00:00
Hans Petter Selasky
24be13533b Fix locking order reversal in the TCP ratelimit code by moving
destructors outside the rsmtx mutex.

Witness message:
lock order reversal: (sleepable after non-sleepable)
   1st tcp_rs_mtx (rsmtx) @ sys/netinet/tcp_ratelimit.c:242
   2nd sysctl lock (sysctl lock) @ sys/kern/kern_sysctl.c:607

Backtrace:
witness_debugger
witness_checkorder
_rm_wlock_debug
sysctl_ctx_free
rs_destroy
epoch_call_task
gtaskqueue_run_locked
gtaskqueue_thread_loop

Discussed with:	rrs@
Sponsored by:	Mellanox Technologies
2019-10-09 16:48:48 +00:00
John Baldwin
9e14430d46 Add a TOE KTLS mode and a TOE hook for allocating TLS sessions.
This adds the glue to allocate TLS sessions and invokes it from
the TLS enable socket option handler.  This also adds some counters
for active TOE sessions.

The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when
TOE KTLS is in use on a socket, but cannot be set via setsockopt().

To simplify various checks, a TLS session now includes an explicit
'mode' member set to the value returned by TLSTX_TLS_MODE.  Various
places that used to check 'sw_encrypt' against NULL to determine
software vs ifnet (NIC) TLS now check 'mode' instead.

Reviewed by:	np, gallatin
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21891
2019-10-08 21:34:06 +00:00
Gleb Smirnoff
e4c40a8a71 Quickly plug another regression from r353292. Again, multicast locking needs
lots of work...

Reported by:	pho
2019-10-08 16:59:17 +00:00
Michael Tuexen
953b78bed9 Validate length before use it, not vice versa.
r353060 should have contained this...
This fixes
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18070
MFC after:		3 days
2019-10-08 11:07:16 +00:00
Gleb Smirnoff
b8a6e03fac Widen NET_EPOCH coverage.
When epoch(9) was introduced to network stack, it was basically
dropped in place of existing locking, which was mutexes and
rwlocks. For the sake of performance mutex covered areas were
as small as possible, so became epoch covered areas.

However, epoch doesn't introduce any contention, it just delays
memory reclaim. So, there is no point to minimise epoch covered
areas in sense of performance. Meanwhile entering/exiting epoch
also has non-zero CPU usage, so doing this less often is a win.

Not the least is also code maintainability. In the new paradigm
we can assume that at any stage of processing a packet, we are
inside network epoch. This makes coding both input and output
path way easier.

On output path we already enter epoch quite early - in the
ip_output(), in the ip6_output().

This patch does the same for the input path. All ISR processing,
network related callouts, other ways of packet injection to the
network stack shall be performed in net_epoch. Any leaf function
that walks network configuration now asserts epoch.

Tricky part is configuration code paths - ioctls, sysctls. They
also call into leaf functions, so some need to be changed.

This patch would introduce more epoch recursions (see EPOCH_TRACE)
than we had before. They will be cleaned up separately, as several
of them aren't trivial. Note, that unlike a lock recursion the
epoch recursion is safe and just wastes a bit of resources.

Reviewed by:	gallatin, hselasky, cy, adrian, kristof
Differential Revision:	https://reviews.freebsd.org/D19111
2019-10-07 22:40:05 +00:00
Michael Tuexen
746c7ae563 In r343587 a simple port filter as sysctl tunable was added to siftr.
The new sysctl was not added to the siftr.4 man page at the time.
This updates the man page, and removes one left over trailing whitespace.

Submitted by:		Richard Scheffenegger
Reviewed by:		bcr@
MFC after:		3 days
Differential Revision:	https://reviews.freebsd.org/D21619
2019-10-07 20:35:04 +00:00
Randall Stewart
5b63b22075 Brad Davis identified a problem with the new LRO code, VLAN's
no longer worked. The problem was that the defines used the
same space as the VLAN id. This commit does three things.
1) Move the LRO used fields to the PH_per fields. This is
   safe since the entire PH_per is used for IP reassembly
   which LRO code will not hit.
2) Remove old unused pace fields that are not used in mbuf.h
3) The VLAN processing is not in the mbuf queueing code. Consequently
   if a VLAN submits to Rack or BBR we need to bypass the mbuf queueing
   for now until rack_bbr_common is updated to handle the VLAN properly.

Reported by:	Brad Davis
2019-10-06 22:29:02 +00:00
Michael Tuexen
63fb39ba7b Plumb an mbuf leak in a code path that should not be taken. Also avoid
that this path is taken by setting the tail pointer correctly.
There is still bug related to handling unordered unfragmented messages
which were delayed in deferred handling.
This issue was found by OSS-Fuzz testing the usrsctp stack and reported in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17794

MFC after:		3 days
2019-10-06 08:47:10 +00:00
Michael Tuexen
2560cb1eb8 Fix a use after free bug when removing remote addresses.
This bug was found by OSS-Fuzz and reported in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18004

MFC after:		3 days
2019-10-05 13:28:01 +00:00
Michael Tuexen
0941b9dc37 Plumb an mbuf leak found by Mark Wodrich from Google by fuzz testing the
userland stack and reporting it in:
https://github.com/sctplab/usrsctp/issues/396

MFC after:		3 days
2019-10-05 12:34:50 +00:00
Michael Tuexen
44f788d793 Fix the adding of padding to COOKIE-ECHO chunks.
Thanks to Mark Wodrich who found this issue while fuzz testing the
usrsctp stack and reported the issue in
https://github.com/sctplab/usrsctp/issues/382

MFC after:		3 days
2019-10-05 09:46:11 +00:00
Michael Tuexen
aac50dab6d When skipping the address parameter, take the padding into account.
MFC after:		3 days
2019-10-03 20:47:57 +00:00
Michael Tuexen
5989470c37 Cleanup sctp_asconf_error_response() and ensure that the parameter
is padded as required. This fixes the followig bug reported by
OSS-Fuzz for the usersctp stack:
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17790

MFC after:		3 days
2019-10-03 20:39:17 +00:00
Michael Tuexen
967e1a5333 Add missing input validation. This could result in reading from
uninitialized memory.
The issue was found by OSS-Fuzz for usrsctp  and reported in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17780

MFC after:		3 days
2019-10-03 18:36:54 +00:00
Michael Tuexen
2974e263c3 Don't use stack memory which is not initialized.
Thanks to Mark Wodrich for reporting this issue for the userland stack in
https://github.com/sctplab/usrsctp/issues/380
This issue was also found for usrsctp by OSS-fuzz in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17778

MFC after:		3 days
2019-09-30 12:06:57 +00:00
Michael Tuexen
12a43d0d5d RFC 7112 requires a host to put the complete IP header chain
including the TCP header in the first IP packet.
Enforce this in tcp_output(). In addition make sure that at least
one byte payload fits in the TCP segement to allow making progress.
Without this check, a kernel with INVARIANTS will panic.
This issue was found by running an instance of syzkaller.

Reviewed by:		jtl@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21665
2019-09-29 10:45:13 +00:00
Michael Tuexen
71e85612a9 Replacing MD5 by SipHash improves the performance of the TCP time stamp
initialisation, which is important when the host is dealing with a
SYN flood.
This affects the computation of the initial TCP sequence number for
the client side.
This has been discussed with secteam@.

Reviewed by:		gallatin@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21616
2019-09-28 13:13:23 +00:00
Michael Tuexen
79c2a2a07b Ensure that the INP lock is released before leaving [gs]etsockopt()
for RACK specific socket options.
These issues were found by a syzkaller instance.
Reviewed by:		rrs@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21825
2019-09-28 13:05:37 +00:00
Jonathan T. Looney
0b18fb0798 Add new functionality to switch to using cookies exclusively when we the
syn cache overflows. Whether this is due to an attack or due to the system
having more legitimate connections than the syn cache can hold, this
situation can quickly impact performance.

To make the system perform better during these periods, the code will now
switch to exclusively using cookies until the syn cache stops overflowing.
In order for this to occur, the system must be configured to use the syn
cache with syn cookie fallback. If syn cookies are completely disabled,
this change should have no functional impact.

When the system is exclusively using syn cookies (either due to
configuration or the overflow detection enabled by this change), the
code will now skip acquiring a lock on the syn cache bucket. Additionally,
the code will now skip lookups in several places (such as when the system
receives a RST in response to a SYN|ACK frame).

Reviewed by:	rrs, gallatin (previous version)
Discussed with:	tuexen
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21644
2019-09-26 15:18:57 +00:00
Jonathan T. Looney
0bee4d631a Access the syncache secret directly from the V_tcp_syncache variable,
rather than indirectly through the backpointer to the tcp_syncache
structure stored in the hashtable bucket.

This also allows us to remove the requirement in syncookie_generate()
and syncookie_lookup() that the syncache hashtable bucket must be
locked.

Reviewed by:	gallatin, rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21644
2019-09-26 15:06:46 +00:00
Jonathan T. Looney
867e98f8ee Remove the unused sch parameter to the syncache_respond() function. The
use of this parameter was removed in r313330. This commit now removes
passing this now-unused parameter.

Reviewed by:	gallatin, rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21644
2019-09-26 15:02:34 +00:00
Randall Stewart
ac7bd23a7a lets put (void) in a couple of functions to keep older platforms that
are stuck with gcc happy (ppc). The changes are needed in both bbr and
rack.

Obtained from:	Michael Tuexen (mtuexen@)
2019-09-24 20:36:43 +00:00
Randall Stewart
da99b33b17 don't call in_ratelmit detach when RATELIMIT is not
compiled in the kernel.
2019-09-24 20:11:55 +00:00
Randall Stewart
2f1cc984db Fix the ifdefs in tcp_ratelimit.h. They were reversed so
that instead of functions only being inside the _KERNEL and
the absence of RATELIMIT causing us to have NULL/error returning
interfaces we ended up with non-kernel getting the error path.
opps..
2019-09-24 20:04:31 +00:00
Randall Stewart
35c7bb3407 This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This
is a completely separate TCP stack (tcp_bbr.ko) that will be built only if
you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option
TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that
supports hardware pacing, BBR understands how to use such a feature.

Note that this commit also adds in a general purpose time-filter which
allows you to have a min-filter or max-filter. A filter allows you to
have a low (or high) value for some period of time and degrade slowly
to another value has time passes. You can find out the details of
BBR by looking at the original paper at:

https://queue.acm.org/detail.cfm?id=3022184

or consult many other web resources you can find on the web
referenced by "BBR congestion control". It should be noted that
BBRv1 (which this is) does tend to unfairness in cases of small
buffered paths, and it will usually get less bandwidth in the case
of large BDP paths(when competing with new-reno or cubic flows). BBR
is still an active research area and we do plan on  implementing V2
of BBR to see if it is an improvement over V1.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D21582
2019-09-24 18:18:11 +00:00
Michael Tuexen
2b861c1538 Plumb a memory leak.
Thnanks to Felix Weinrank for finding this issue using fuzz testing
and reporting it for the userland stack:
https://github.com/sctplab/usrsctp/issues/378

MFC after:		3 days
2019-09-24 13:15:24 +00:00
Michael Tuexen
1325a0de13 Don't hold the info lock when calling sctp_select_a_tag().
This avoids a double lock bug in the NAT colliding state processing
of SCTP. Thanks to Felix Weinrank for finding and reporting this issue in
https://github.com/sctplab/usrsctp/issues/374
He found this bug using fuzz testing.

MFC after:		3 days
2019-09-22 11:11:01 +00:00
Michael Tuexen
44f2a3272e Cleanup the RTO calculation and perform some consistency checks
before computing the RTO.
This should fix an overflow issue reported by Felix Weinrank in
https://github.com/sctplab/usrsctp/issues/375
for the userland stack and found by running a fuzz tester.

MFC after:		3 days
2019-09-22 10:40:15 +00:00
Michael Tuexen
e6b3bd22d8 Fix the handling of invalid parameters in ASCONF chunks.
Thanks to Mark Wodrich from Google for reproting the issue in
https://github.com/sctplab/usrsctp/issues/376
for the userland stack.

MFC after:		3 days
2019-09-20 08:20:20 +00:00
Michael Tuexen
dd3121a895 When the RACK stack computes the space for user data in a TCP segment,
it wasn't taking the IP level options into account. This patch fixes this.
In addition, it also corrects a KASSERT and adds protection code to assure
that the IP header chain and the TCP head fit in the first fragment as
required by RFC 7112.

Reviewed by:		rrs@
MFC after:		3 days
Sponsored by:		Nertflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21666
2019-09-19 10:27:47 +00:00
Michael Tuexen
3c19311544 Only allow a SCTP-AUTH shared key to be updated by the application
if it is not deactivated and not used.
This avoids a use-after-free problem.

Reported by:		da_cheng_shao@yeah.net
MFC after:		3 days
2019-09-17 09:46:42 +00:00
Michael Tuexen
5b66b7f11b Don't write to memory outside of the allocated array for SACK blocks.
Obtained from:		rrs@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
2019-09-16 08:18:05 +00:00
Michael Tuexen
52f6a8090e When the IP layer calls back into the SCTP layer to perform the SCTP
checksum computation, do not assume that the IP header chain and the
SCTP common header are in contiguous memory although the SCTP lays
out the mbuf chains that way. If there are IP-level options inserted
by the IP layer, the constraint is not fulfilled anymore.

This issues was found by running syzkaller. Thanks to markj@ who is
running an instance which also provides kernel dumps. This allowed me
to find this issue.

MFC after:		3 days
2019-09-15 18:29:45 +00:00
Andrew Gallatin
d2e6258258 Avoid unneeded call to arc4random() in syncache_add()
Don't call arc4random() unconditionally to initialize sc_iss, and
then when syncookies are enabled, just overwrite it with the
return value from from syncookie_generate(). Instead, only call
arc4random() to initialize sc_iss when syncookies are not
enabled.

Note that on a system under a syn flood attack, arc4random()
becomes quite expensive, and the chacha_poly crypto that it calls
is one of the more expensive things happening on the
system. Removing this unneeded arc4random() call reduces CPU from
about 40% to about 35% in my test scenario (Broadwell Xeon, 6Mpps
syn flood attack).

Reviewed by:	rrs, tuxen, bz
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21591
2019-09-11 18:48:26 +00:00
Randall Stewart
6f32ca1936 With the recent commit of ktls, we no longer have a
sb_tls_flags, its just the sb_flags. Also the ratelimit
code, now that the defintion is in sockbuf.h, does not
need the ktls.h file (or its predecessor).

Sponsored by:	Netflix Inc
2019-09-11 15:41:36 +00:00
Michael Tuexen
6d261981ee Only update SACK/DSACK lists when a non-empty segment was received.
This fixes hitting a KASSERT with a valid packet exchange.

Reviewed by:		rrs@, Richard Scheffenegger
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21567
2019-09-09 16:07:47 +00:00
Conrad Meyer
373013b05f Fix build after r351934
tcp_queue_pkts() is only used if TCPHPTS is defined (and it is not by
default).

Reported by:	gcc
2019-09-06 18:33:39 +00:00
Randall Stewart
af9b9e0d9f This adds in the missing counter initialization which
I had forgotten to bring over.. opps.

Differential Revision: https://reviews.freebsd.org/D21127
2019-09-06 18:29:48 +00:00
Warner Losh
82e837f803 Initialize if_hw_tsomaxsegsize to 0 to appease gcc's flow analysis as a
fail-safe.
2019-09-06 18:25:42 +00:00