6257 Commits

Author SHA1 Message Date
tuexen
5f8f65d380 Don't use stack memory which is not initialized.
Thanks to Mark Wodrich for reporting this issue for the userland stack in
https://github.com/sctplab/usrsctp/issues/380
This issue was also found for usrsctp by OSS-fuzz in
https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17778

MFC after:		3 days
2019-09-30 12:06:57 +00:00
tuexen
5f6d43232d RFC 7112 requires a host to put the complete IP header chain
including the TCP header in the first IP packet.
Enforce this in tcp_output(). In addition make sure that at least
one byte payload fits in the TCP segement to allow making progress.
Without this check, a kernel with INVARIANTS will panic.
This issue was found by running an instance of syzkaller.

Reviewed by:		jtl@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21665
2019-09-29 10:45:13 +00:00
tuexen
dc3d530b6c Replacing MD5 by SipHash improves the performance of the TCP time stamp
initialisation, which is important when the host is dealing with a
SYN flood.
This affects the computation of the initial TCP sequence number for
the client side.
This has been discussed with secteam@.

Reviewed by:		gallatin@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21616
2019-09-28 13:13:23 +00:00
tuexen
09c42850dd Ensure that the INP lock is released before leaving [gs]etsockopt()
for RACK specific socket options.
These issues were found by a syzkaller instance.
Reviewed by:		rrs@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21825
2019-09-28 13:05:37 +00:00
jtl
9b94fc3ad9 Add new functionality to switch to using cookies exclusively when we the
syn cache overflows. Whether this is due to an attack or due to the system
having more legitimate connections than the syn cache can hold, this
situation can quickly impact performance.

To make the system perform better during these periods, the code will now
switch to exclusively using cookies until the syn cache stops overflowing.
In order for this to occur, the system must be configured to use the syn
cache with syn cookie fallback. If syn cookies are completely disabled,
this change should have no functional impact.

When the system is exclusively using syn cookies (either due to
configuration or the overflow detection enabled by this change), the
code will now skip acquiring a lock on the syn cache bucket. Additionally,
the code will now skip lookups in several places (such as when the system
receives a RST in response to a SYN|ACK frame).

Reviewed by:	rrs, gallatin (previous version)
Discussed with:	tuexen
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21644
2019-09-26 15:18:57 +00:00
jtl
3497c0f2a2 Access the syncache secret directly from the V_tcp_syncache variable,
rather than indirectly through the backpointer to the tcp_syncache
structure stored in the hashtable bucket.

This also allows us to remove the requirement in syncookie_generate()
and syncookie_lookup() that the syncache hashtable bucket must be
locked.

Reviewed by:	gallatin, rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21644
2019-09-26 15:06:46 +00:00
jtl
7ca5024880 Remove the unused sch parameter to the syncache_respond() function. The
use of this parameter was removed in r313330. This commit now removes
passing this now-unused parameter.

Reviewed by:	gallatin, rrs
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21644
2019-09-26 15:02:34 +00:00
rrs
804541b4e4 lets put (void) in a couple of functions to keep older platforms that
are stuck with gcc happy (ppc). The changes are needed in both bbr and
rack.

Obtained from:	Michael Tuexen (mtuexen@)
2019-09-24 20:36:43 +00:00
rrs
5c5e5b0ec5 don't call in_ratelmit detach when RATELIMIT is not
compiled in the kernel.
2019-09-24 20:11:55 +00:00
rrs
b84f1569d1 Fix the ifdefs in tcp_ratelimit.h. They were reversed so
that instead of functions only being inside the _KERNEL and
the absence of RATELIMIT causing us to have NULL/error returning
interfaces we ended up with non-kernel getting the error path.
opps..
2019-09-24 20:04:31 +00:00
rrs
7648feb4d9 This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This
is a completely separate TCP stack (tcp_bbr.ko) that will be built only if
you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option
TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that
supports hardware pacing, BBR understands how to use such a feature.

Note that this commit also adds in a general purpose time-filter which
allows you to have a min-filter or max-filter. A filter allows you to
have a low (or high) value for some period of time and degrade slowly
to another value has time passes. You can find out the details of
BBR by looking at the original paper at:

https://queue.acm.org/detail.cfm?id=3022184

or consult many other web resources you can find on the web
referenced by "BBR congestion control". It should be noted that
BBRv1 (which this is) does tend to unfairness in cases of small
buffered paths, and it will usually get less bandwidth in the case
of large BDP paths(when competing with new-reno or cubic flows). BBR
is still an active research area and we do plan on  implementing V2
of BBR to see if it is an improvement over V1.

Sponsored by:	Netflix Inc.
Differential Revision:	https://reviews.freebsd.org/D21582
2019-09-24 18:18:11 +00:00
tuexen
eceade6db5 Plumb a memory leak.
Thnanks to Felix Weinrank for finding this issue using fuzz testing
and reporting it for the userland stack:
https://github.com/sctplab/usrsctp/issues/378

MFC after:		3 days
2019-09-24 13:15:24 +00:00
tuexen
0de4a3a2bf Don't hold the info lock when calling sctp_select_a_tag().
This avoids a double lock bug in the NAT colliding state processing
of SCTP. Thanks to Felix Weinrank for finding and reporting this issue in
https://github.com/sctplab/usrsctp/issues/374
He found this bug using fuzz testing.

MFC after:		3 days
2019-09-22 11:11:01 +00:00
tuexen
fcac53bb44 Cleanup the RTO calculation and perform some consistency checks
before computing the RTO.
This should fix an overflow issue reported by Felix Weinrank in
https://github.com/sctplab/usrsctp/issues/375
for the userland stack and found by running a fuzz tester.

MFC after:		3 days
2019-09-22 10:40:15 +00:00
tuexen
f8c515c35e Fix the handling of invalid parameters in ASCONF chunks.
Thanks to Mark Wodrich from Google for reproting the issue in
https://github.com/sctplab/usrsctp/issues/376
for the userland stack.

MFC after:		3 days
2019-09-20 08:20:20 +00:00
tuexen
5d8f9a3242 When the RACK stack computes the space for user data in a TCP segment,
it wasn't taking the IP level options into account. This patch fixes this.
In addition, it also corrects a KASSERT and adds protection code to assure
that the IP header chain and the TCP head fit in the first fragment as
required by RFC 7112.

Reviewed by:		rrs@
MFC after:		3 days
Sponsored by:		Nertflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21666
2019-09-19 10:27:47 +00:00
tuexen
9ba53a7f96 Only allow a SCTP-AUTH shared key to be updated by the application
if it is not deactivated and not used.
This avoids a use-after-free problem.

Reported by:		da_cheng_shao@yeah.net
MFC after:		3 days
2019-09-17 09:46:42 +00:00
tuexen
8246248351 Don't write to memory outside of the allocated array for SACK blocks.
Obtained from:		rrs@
MFC after:		3 days
Sponsored by:		Netflix, Inc.
2019-09-16 08:18:05 +00:00
tuexen
e7003e64b0 When the IP layer calls back into the SCTP layer to perform the SCTP
checksum computation, do not assume that the IP header chain and the
SCTP common header are in contiguous memory although the SCTP lays
out the mbuf chains that way. If there are IP-level options inserted
by the IP layer, the constraint is not fulfilled anymore.

This issues was found by running syzkaller. Thanks to markj@ who is
running an instance which also provides kernel dumps. This allowed me
to find this issue.

MFC after:		3 days
2019-09-15 18:29:45 +00:00
gallatin
b799855a78 Avoid unneeded call to arc4random() in syncache_add()
Don't call arc4random() unconditionally to initialize sc_iss, and
then when syncookies are enabled, just overwrite it with the
return value from from syncookie_generate(). Instead, only call
arc4random() to initialize sc_iss when syncookies are not
enabled.

Note that on a system under a syn flood attack, arc4random()
becomes quite expensive, and the chacha_poly crypto that it calls
is one of the more expensive things happening on the
system. Removing this unneeded arc4random() call reduces CPU from
about 40% to about 35% in my test scenario (Broadwell Xeon, 6Mpps
syn flood attack).

Reviewed by:	rrs, tuxen, bz
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21591
2019-09-11 18:48:26 +00:00
rrs
364ba42474 With the recent commit of ktls, we no longer have a
sb_tls_flags, its just the sb_flags. Also the ratelimit
code, now that the defintion is in sockbuf.h, does not
need the ktls.h file (or its predecessor).

Sponsored by:	Netflix Inc
2019-09-11 15:41:36 +00:00
tuexen
f0fb0ad52b Only update SACK/DSACK lists when a non-empty segment was received.
This fixes hitting a KASSERT with a valid packet exchange.

Reviewed by:		rrs@, Richard Scheffenegger
MFC after:		3 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21567
2019-09-09 16:07:47 +00:00
cem
74b4315b69 Fix build after r351934
tcp_queue_pkts() is only used if TCPHPTS is defined (and it is not by
default).

Reported by:	gcc
2019-09-06 18:33:39 +00:00
rrs
7b6425ea8b This adds in the missing counter initialization which
I had forgotten to bring over.. opps.

Differential Revision: https://reviews.freebsd.org/D21127
2019-09-06 18:29:48 +00:00
imp
f62ee993b4 Initialize if_hw_tsomaxsegsize to 0 to appease gcc's flow analysis as a
fail-safe.
2019-09-06 18:25:42 +00:00
rrs
0bfa32c032 This adds the final tweaks to LRO that will now allow me
to add BBR. These changes make it so you can get an
array of timestamps instead of a compressed ack/data segment.
BBR uses this to aid with its delivery estimates. We also
now (via Drew's suggestions) will not go to the expense of
the tcb lookup if no stack registers to want this feature. If
HPTS is not present the feature is not present either and you
just get the compressed behavior.

Sponsored by:	Netflix Inc
Differential Revision: https://reviews.freebsd.org/D21127
2019-09-06 14:25:41 +00:00
tuexen
5c20a329a1 Fix the SACK block generation in the base TCP stack by bringing it in
sync with the RACK stack.

Reviewed by:		rrs@
MFC after:		5 days
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21513
2019-09-04 04:38:31 +00:00
tuexen
97c1567f72 Fix two TCP RACK issues:
* Convert the TCP delayed ACK timer from ms to ticks as required.
  This fixes the timer on platforms with hz != 1000.
* Don't delay acknowledgements which report duplicate data using
  DSACKs.

Reviewed by:		rrs@
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21512
2019-09-03 19:48:02 +00:00
tuexen
040eca9a1d This patch improves the DSACK handling to conform with RFC 2883.
The lowest SACK block is used when multiple Blocks would be elegible as
DSACK blocks ACK blocks get reordered - while maintaining the ordering of
SACK blocks not relevant in the DSACK context is maintained.

Reviewed by:		rrs@, tuexen@
Obtained from:		Richard Scheffenegger
MFC after:		1 week
Differential Revision:	https://reviews.freebsd.org/D21038
2019-09-02 19:04:02 +00:00
tuexen
a1d2b51873 Fix initialization of top_fsn.
MFC after:		3 days
2019-09-01 10:39:16 +00:00
tuexen
209e505321 Improve the handling of state cookie parameters in INIT-ACK chunks.
This fixes problem with parameters indicating a zero length or partial
parameters after an unknown parameter indicating to stop processing. It
also fixes a problem with state cookie parameters after unknown
parametes indicating to stop porcessing.
Thanks to Mark Wodrich from Google for finding two of these issues
by fuzz testing the userland stack and reporting them in
https://github.com/sctplab/usrsctp/issues/355
and
https://github.com/sctplab/usrsctp/issues/352

MFC after:		3 days
2019-09-01 10:09:53 +00:00
tuexen
f17d8ccdcd Improve function definition.
MFC after:		3 days
2019-08-31 13:13:40 +00:00
tuexen
d0f0e21769 Improve the handling of illegal sequence number combinations in received
data chunks. Abort the association if there are data chunks with larger
fragement sequence numbers than the fragement sequence of the last
fragment.
Thanks to Mark Wodrich from Google who found this issue by fuzz testing
the userland stack and reporting this issue in
https://github.com/sctplab/usrsctp/issues/355

MFC after:		3 days
2019-08-31 08:18:49 +00:00
jhb
1cf31620c8 Add kernel-side support for in-kernel TLS.
KTLS adds support for in-kernel framing and encryption of Transport
Layer Security (1.0-1.2) data on TCP sockets.  KTLS only supports
offload of TLS for transmitted data.  Key negotation must still be
performed in userland.  Once completed, transmit session keys for a
connection are provided to the kernel via a new TCP_TXTLS_ENABLE
socket option.  All subsequent data transmitted on the socket is
placed into TLS frames and encrypted using the supplied keys.

Any data written to a KTLS-enabled socket via write(2), aio_write(2),
or sendfile(2) is assumed to be application data and is encoded in TLS
frames with an application data type.  Individual records can be sent
with a custom type (e.g. handshake messages) via sendmsg(2) with a new
control message (TLS_SET_RECORD_TYPE) specifying the record type.

At present, rekeying is not supported though the in-kernel framework
should support rekeying.

KTLS makes use of the recently added unmapped mbufs to store TLS
frames in the socket buffer.  Each TLS frame is described by a single
ext_pgs mbuf.  The ext_pgs structure contains the header of the TLS
record (and trailer for encrypted records) as well as references to
the associated TLS session.

KTLS supports two primary methods of encrypting TLS frames: software
TLS and ifnet TLS.

Software TLS marks mbufs holding socket data as not ready via
M_NOTREADY similar to sendfile(2) when TLS framing information is
added to an unmapped mbuf in ktls_frame().  ktls_enqueue() is then
called to schedule TLS frames for encryption.  In the case of
sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving
the mbufs marked M_NOTREADY until encryption is completed.  For other
writes (vn_sendfile when pages are available, write(2), etc.), the
PRUS_NOTREADY is set when invoking pru_send() along with invoking
ktls_enqueue().

A pool of worker threads (the "KTLS" kernel process) encrypts TLS
frames queued via ktls_enqueue().  Each TLS frame is temporarily
mapped using the direct map and passed to a software encryption
backend to perform the actual encryption.

(Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if
someone wished to make this work on architectures without a direct
map.)

KTLS supports pluggable software encryption backends.  Internally,
Netflix uses proprietary pure-software backends.  This commit includes
a simple backend in a new ktls_ocf.ko module that uses the kernel's
OpenCrypto framework to provide AES-GCM encryption of TLS frames.  As
a result, software TLS is now a bit of a misnomer as it can make use
of hardware crypto accelerators.

Once software encryption has finished, the TLS frame mbufs are marked
ready via pru_ready().  At this point, the encrypted data appears as
regular payload to the TCP stack stored in unmapped mbufs.

ifnet TLS permits a NIC to offload the TLS encryption and TCP
segmentation.  In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS)
is allocated on the interface a socket is routed over and associated
with a TLS session.  TLS records for a TLS session using ifnet TLS are
not marked M_NOTREADY but are passed down the stack unencrypted.  The
ip_output_send() and ip6_output_send() helper functions that apply
send tags to outbound IP packets verify that the send tag of the TLS
record matches the outbound interface.  If so, the packet is tagged
with the TLS send tag and sent to the interface.  The NIC device
driver must recognize packets with the TLS send tag and schedule them
for TLS encryption and TCP segmentation.  If the the outbound
interface does not match the interface in the TLS send tag, the packet
is dropped.  In addition, a task is scheduled to refresh the TLS send
tag for the TLS session.  If a new TLS send tag cannot be allocated,
the connection is dropped.  If a new TLS send tag is allocated,
however, subsequent packets will be tagged with the correct TLS send
tag.  (This latter case has been tested by configuring both ports of a
Chelsio T6 in a lagg and failing over from one port to another.  As
the connections migrated to the new port, new TLS send tags were
allocated for the new port and connections resumed without being
dropped.)

ifnet TLS can be enabled and disabled on supported network interfaces
via new '[-]txtls[46]' options to ifconfig(8).  ifnet TLS is supported
across both vlan devices and lagg interfaces using failover, lacp with
flowid enabled, or lacp with flowid enabled.

Applications may request the current KTLS mode of a connection via a
new TCP_TXTLS_MODE socket option.  They can also use this socket
option to toggle between software and ifnet TLS modes.

In addition, a testing tool is available in tools/tools/switch_tls.
This is modeled on tcpdrop and uses similar syntax.  However, instead
of dropping connections, -s is used to force KTLS connections to
switch to software TLS and -i is used to switch to ifnet TLS.

Various sysctls and counters are available under the kern.ipc.tls
sysctl node.  The kern.ipc.tls.enable node must be set to true to
enable KTLS (it is off by default).  The use of unmapped mbufs must
also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS.

KTLS is enabled via the KERN_TLS kernel option.

This patch is the culmination of years of work by several folks
including Scott Long and Randall Stewart for the original design and
implementation; Drew Gallatin for several optimizations including the
use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records
awaiting software encryption, and pluggable software crypto backends;
and John Baldwin for modifications to support hardware TLS offload.

Reviewed by:	gallatin, hselasky, rrs
Obtained from:	Netflix
Sponsored by:	Netflix, Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D21277
2019-08-27 00:01:56 +00:00
tuexen
06332e8121 Don't hold the rs_mtx lock while calling malloc().
Reviewed by:		rrs@
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D21416
2019-08-26 16:23:47 +00:00
rrs
6afcaa6dcb Fix an issue when TSO and Rack play together. Basically
an retransmission of the initial SYN (with data) would
cause us to strip the SYN and decrement/increase offset/len
which then caused us a -1 offset and a panic.

Reported by:	Larry Rosenman
(Michael Tuexen helped me debug this at the IETF)
2019-08-21 10:45:28 +00:00
markj
d9c996b3fc Fix netdump buffering after r348473.
nd_buf is used to buffer headers (for both the kernel dump itself and
for EKCD) before the final call to netdump_dumper(), which flushes
residual data in nd_buf.  As a result, a small portion of the residual
data would be corrupted.  This manifests when kernel dump compression
is enabled since both zstd and zlib detect the corruption during
decompression.

Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D21294
2019-08-19 16:29:51 +00:00
ae
cdfc67a5b0 Save ip_ttl value and restore it after checksum calculation.
Since ipvoly is used for checksum calculation, part of original IP
header is zeroed. This part includes ip_ttl field, that can be used
later in IP_MINTTL socket option handling.

PR:		239799
MFC after:	1 week
2019-08-13 12:47:53 +00:00
rrs
ce02104e38 Place back in the dependency on HPTS via module depends versus
a fatal error in compiling. This was taken out by mistake
when I mis-merged from the 18q22p2 sources of rack in NF. Opps.

Reported by:	sbruno
2019-08-13 12:41:15 +00:00
thj
e262ca6be6 Rename IPPROTO 33 from SEP to DCCP
IPPROTO 33 is DCCP in the IANA Registry:
https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml

IPPROTO_SEP was added about 20 years ago in r33804. The entries were added
straight from RFC1700, without regard to whether they were used.

The reference in RFC1700 for SEP is '[JC120] <mystery contact>', this is an
indication that the protocol number was probably in use in a private network.

As RFC1700 is no longer the authoritative list of internet numbers and that
IANA assinged 33 to DCCP in RFC4340, change the header to the actual
authoritative source.

Reviewed by:	Richard Scheffenegger, bz
Approved by:	bz (mentor)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D21178
2019-08-08 11:43:09 +00:00
tuexen
8822cfdea8 Fix a typo.
Submitted by:		Thomas Dreibholz
MFC after:		1 week
2019-08-08 08:23:27 +00:00
tuexen
0d946942e2 Fix a locking issue in sctp_accept.
PR:			238520
Reported by:		pho@
MFC after:		1 week
2019-08-06 10:29:19 +00:00
tuexen
622372367b Fix build issues for the userland stack on Raspbian. 2019-08-06 08:33:21 +00:00
tuexen
d76ac3bf32 Improve consistency. No functional change.
MFC after:		3 days
2019-08-05 13:22:15 +00:00
delphij
67daa950c9 Fix !INET build. 2019-08-02 22:43:09 +00:00
rrs
1ef9414634 Fix one more atomic for i86
Obtained from:	mtuexen@freebsd.org
2019-08-02 11:17:07 +00:00
bz
f66d5bcdd2 IPv6 cleanup: kernel
Finish what was started a few years ago and harmonize IPv6 and IPv4
kernel names.  We are down to very few places now that it is feasible
to do the change for everything remaining with causing too much disturbance.

Remove "aliases" for IPv6 names which confusingly could indicate
that we are talking about a different data structure or field or
have two fields, one for each address family.
Try to follow common conventions used in FreeBSD.

* Rename sin6p to sin6 as that is how it is spelt in most places.
* Remove "aliases" (#defines) for:
  - in6pcb which really is an inpcb and nothing separate
  - sotoin6pcb which is sotoinpcb (as per above)
  - in6p_sp which is inp_sp
  - in6p_flowinfo which is inp_flow
* Try to use ia6 for in6_addr rather than in6p.
* With all these gone  also rename the in6p variables to inp as
  that is what we call it in most of the network stack including
  parts of netinet6.

The reasons behind this cleanup are that we try to further
unify netinet and netinet6 code where possible and that people
will less ignore one or the other protocol family when doing
code changes as they may not have spotted places due to different
names for the same thing.

No functional changes.

Discussed with:		tuexen (SCTP changes)
MFC after:		3 months
Sponsored by:		Netflix
2019-08-02 07:41:36 +00:00
rrs
5f0d15ff5a Opps use fetchadd_u64 not long to keep old 32 bit platforms
happy.
2019-08-01 20:26:27 +00:00
tuexen
922cab2076 Fix the reporting of multiple unknown parameters in an received INIT
chunk. This also plugs an potential mbuf leak.
Thanks to Felix Weinrank for reporting this issue found by fuzz-testing
the userland stack.

MFC after:		3 days
2019-08-01 19:45:34 +00:00
tuexen
bc4fcf2068 When responding with an ABORT to an INIT chunk containing a
HOSTNAME parameter or a parameter with an illegal length, only
include an error cause indicating why the ABORT was sent.
This also fixes an mbuf leak which could occur.

MFC after:		3 days
2019-08-01 17:36:15 +00:00