Commit Graph

6556 Commits

Author SHA1 Message Date
Randall Stewart
e570d231f4 This change does a small prepratory step in getting the
latest rack and bbr in from the NF repo. When those come
in the OOB data handling will be fixed where Skyzaller crashes.

Differential Revision:	https://reviews.freebsd.org/D24575
2020-04-27 16:30:29 +00:00
Alexander V. Chernikov
55f57ca9ac Convert debugnet to the new routing KPI.
Introduce new fib[46]_lookup_debugnet() functions serving as a
special interface for the crash-time operations. Underlying
implementation will try to return lookup result if
datastructures are not corrupted, avoding locking.

Convert debugnet to use fib4_lookup_debugnet() and switch it
to use nexthops instead of rtentries.

Reviewed by:	cem
Differential Revision:	https://reviews.freebsd.org/D24555
2020-04-26 18:42:38 +00:00
Alexander V. Chernikov
17cb6ddba8 Fix order of arguments in fib[46]_lookup calls in SCTP.
r360292 introduced the wrong order, resulting in returned
 nhops not being referenced, despite the fact that references
 were requested. That lead to random GPF after using SCTP sockets.

Special defined macro like IPV[46]_SCOPE_GLOBAL will be introduced
 soon to reduce the chance of putting arguments in wrong order.

Reported-by: syzbot+5c813c01096363174684@syzkaller.appspotmail.com
2020-04-26 13:02:42 +00:00
Alexander V. Chernikov
454d389645 Fix LINT build #2 after r360292.
Pointyhat to: melifaro
2020-04-25 11:35:38 +00:00
Alexander V. Chernikov
ac99fd86d4 Fix LINT build broken by r360292. 2020-04-25 10:31:56 +00:00
Alexander V. Chernikov
983066f05b Convert route caching to nexthop caching.
This change is build on top of nexthop objects introduced in r359823.

Nexthops are separate datastructures, containing all necessary information
 to perform packet forwarding such as gateway interface and mtu. Nexthops
 are shared among the routes, providing more pre-computed cache-efficient
 data while requiring less memory. Splitting the LPM code and the attached
 data solves multiple long-standing problems in the routing layer,
 drastically reduces the coupling with outher parts of the stack and allows
 to transparently introduce faster lookup algorithms.

Route caching was (re)introduced to minimise (slow) routing lookups, allowing
 for notably better performance for large TCP senders. Caching works by
 acquiring rtentry reference, which is protected by per-rtentry mutex.
 If the routing table is changed (checked by comparing the rtable generation id)
 or link goes down, cache record gets withdrawn.

Nexthops have the same reference counting interface, backed by refcount(9).
This change merely replaces rtentry with the actual forwarding nextop as a
 cached object, which is mostly mechanical. Other moving parts like cache
 cleanup on rtable change remains the same.

Differential Revision:	https://reviews.freebsd.org/D24340
2020-04-25 09:06:11 +00:00
Alexander V. Chernikov
9e88f47c8f Unbreak LINT-NOINET[6] builds broken in r360191.
Reported by:	np
2020-04-23 06:55:33 +00:00
Michael Tuexen
8262311cbe Improve input validation when processing AUTH chunks.
Thanks to Natalie Silvanovich from Google for finding and reporting the
issue found by her in the SCTP userland stack.

MFC after:		3 days
X-MFC with:		https://svnweb.freebsd.org/changeset/base/360193
2020-04-22 21:22:33 +00:00
Michael Tuexen
97feba891d Improve input validation when processing AUTH chunks.
Thanks to Natalie Silvanovich from Google for finding and reporting the
issue found by her in the SCTP userland stack.

MFC after:		3 days
2020-04-22 12:47:46 +00:00
Alexander V. Chernikov
8d6708ba80 Convert TOE routing lookups to the new routing KPI.
Reviewed by:	np
Differential Revision:	https://reviews.freebsd.org/D24388
2020-04-22 07:53:43 +00:00
Richard Scheffenegger
bb410f9ff2 revert rS360143 - Correctly set up initial cwnd
due to syzkaller panics found

Reported by:	tuexen
Approved by:	tuexen (mentor)
Sponsored by:	NetApp, Inc.
2020-04-22 00:16:42 +00:00
Richard Scheffenegger
73b7696693 Correctly set up the initial TCP congestion window
in all cases, by adjust snd_una right after the
connection initialization, to include the one byte
in sequence space occupied by the SYN bit.

This does not change the regular ACK processing,
while making the BYTES_THIS_ACK macro to work properly.

PR:		235256
Reviewed by:	tuexen (mentor), rgrimes (mentor)
Approved by:	tuexen (mentor), rgrimes (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D19000
2020-04-21 13:05:44 +00:00
Jonathan T. Looney
5d6e356cb0 Avoid calling protocol drain routines more than once per reclamation event.
mb_reclaim() calls the protocol drain routines for each protocol in each
domain. Some protocols exist in more than one domain and share drain
routines. In the case of SCTP, it also uses the same drain routine for
its SOCK_SEQPACKET and SOCK_STREAM entries in the same domain.

On systems with INET, INET6, and SCTP all defined, mb_reclaim() calls
sctp_drain() four times. On systems with INET and INET6 defined,
mb_reclaim() calls tcp_drain() twice. mb_reclaim() is the only in-tree
caller of the pr_drain protocol entry.

Eliminate this duplication by ensuring that each pr_drain routine is only
specified for one protocol entry in one domain.

Reviewed by:	tuexen
MFC after:	2 weeks
Sponsored by:	Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24418
2020-04-16 20:17:24 +00:00
Alexander V. Chernikov
539642a29d Add nhop parameter to rti_filter callback.
One of the goals of the new routing KPI defined in r359823 is to
 entirely hide`struct rtentry` from the consumers. It will allow to
 improve routing subsystem internals and deliver more features much faster.
This change is one of the ongoing changes to eliminate direct
 struct rtentry field accesses.

Additionally, with the followup multipath changes, single rtentry can point
 to multiple nexthops.

With that in mind, convert rti_filter callback used when traversing the
 routing table to accept pair (rt, nhop) instead of nexthop.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24440
2020-04-16 17:20:18 +00:00
Richard Scheffenegger
d7ca3f780d Reduce default TCP delayed ACK timeout to 40ms.
Reviewed by:	kbowling, tuexen
Approved by:	tuexen (mentor)
MFC after:	2 weeks
Sponsored by:	NetApp, Inc.
Differential Revision:	https://reviews.freebsd.org/D23281
2020-04-16 15:59:23 +00:00
Alexander V. Chernikov
9ac7c6cfed Convert IP/IPv6 forwarding, ICMP processing and IP PCB laddr selection to
the new routing KPI.

Reviewed by:	ae
Differential Revision:	https://reviews.freebsd.org/D24245
2020-04-14 23:06:25 +00:00
Michael Tuexen
b89af8e16d Improve the TCP blackhole detection. The principle is to reduce the
MSS in two steps and try each candidate two times. However, if two
candidates are the same (which is the case in TCP/IPv6), this candidate
was tested four times. This patch ensures that each candidate actually
reduced the MSS and is only tested 2 times. This reduces the time window
of missclassifying a temporary outage as an MTU issue.

Reviewed by:		jtl
MFC after:		1 week
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24308
2020-04-14 16:35:05 +00:00
Andrew Gallatin
23feb56348 KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.
While the original implementation of unmapped mbufs was a large
step forward in terms of reducing cache misses by enabling mbufs
to carry more than a single page for sendfile, they are rather
cache unfriendly when accessing the ext_pgs metadata and
data. This is because the ext_pgs part of the mbuf is allocated
separately, and almost guaranteed to be cold in cache.

This change takes advantage of the fact that unmapped mbufs
are never used at the same time as pkthdr mbufs. Given this
fact, we can overlap the ext_pgs metadata with the mbuf
pkthdr, and carry the ext_pgs meta directly in the mbuf itself.
Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array
of pages) directly after the existing m_ext.

In order to be able to carry 5 pages (which is the minimum
required for a 16K TLS record which is not perfectly aligned) on
LP64, I've had to steal ext_arg2. The only user of this in the
xmit path is sendfile, and I've adjusted it to use arg1 when
using unmapped mbufs.

This change is almost entirely mechanical, except that we
change mb_alloc_ext_pgs() to no longer allow allocating
pkthdrs, the change to avoid ext_arg2 as mentioned above,
and the removal of the ext_pgs zone,

This change saves roughly 2% "raw" CPU (~59% -> 57%), or over
3% "scaled" CPU on a Netflix 100% software kTLS workload at
90+ Gb/s on Broadwell Xeons.

In a follow-on commit, I plan to remove some hacks to avoid
access ext_pgs fields of mbufs, since they will now be in
cache.

Many thanks to glebius for helping to make this better in
the Netflix tree.

Reviewed by:	hselasky, jhb, rrs, glebius (early version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D24213
2020-04-14 14:46:06 +00:00
Alexander V. Chernikov
6722086045 Plug netmask NULL check during route addition causing kernel panic.
This bug was introduced by the r359823.

Reported by:	hselasky
2020-04-14 13:12:22 +00:00
Kristof Provost
1d126e9b94 carp: Widen epoch coverage
Fix panics related to calling code which expects to be running inside
the NET_EPOCH from outside that epoch.

This leads to panics (with INVARIANTS) such as this one:

    panic: Assertion in_epoch(net_epoch_preempt) failed at /usr/src/sys/netinet/if_ether.c:373
    cpuid = 7
    time = 1586095719
    KDB: stack backtrace:
    db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0090819700
    vpanic() at vpanic+0x182/frame 0xfffffe0090819750
    panic() at panic+0x43/frame 0xfffffe00908197b0
    arprequest_internal() at arprequest_internal+0x59e/frame 0xfffffe00908198c0
    arp_announce_ifaddr() at arp_announce_ifaddr+0x20/frame 0xfffffe00908198e0
    carp_master_down_locked() at carp_master_down_locked+0x10d/frame 0xfffffe0090819910
    carp_master_down() at carp_master_down+0x79/frame 0xfffffe0090819940
    softclock_call_cc() at softclock_call_cc+0x13f/frame 0xfffffe00908199f0
    softclock() at softclock+0x7c/frame 0xfffffe0090819a20
    ithread_loop() at ithread_loop+0x279/frame 0xfffffe0090819ab0
    fork_exit() at fork_exit+0x80/frame 0xfffffe0090819af0
    fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0090819af0
    --- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Widen the NET_EPOCH to cover the relevant (callback / task) code.

Differential Revision:	https://reviews.freebsd.org/D24302
2020-04-12 16:09:21 +00:00
Alexander V. Chernikov
a666325282 Introduce nexthop objects and new routing KPI.
This is the foundational change for the routing subsytem rearchitecture.
 More details and goals are available in https://reviews.freebsd.org/D24141 .

This patch introduces concept of nexthop objects and new nexthop-based
 routing KPI.

Nexthops are objects, containing all necessary information for performing
 the packet output decision. Output interface, mtu, flags, gw address goes
 there. For most of the cases, these objects will serve the same role as
 the struct rtentry is currently serving.
Typically there will be low tens of such objects for the router even with
 multiple BGP full-views, as these objects will be shared between routing
 entries. This allows to store more information in the nexthop.

New KPI:

struct nhop_object *fib4_lookup(uint32_t fibnum, struct in_addr dst,
  uint32_t scopeid, uint32_t flags, uint32_t flowid);
struct nhop_object *fib6_lookup(uint32_t fibnum, const struct in6_addr *dst6,
  uint32_t scopeid, uint32_t flags, uint32_t flowid);

These 2 function are intended to replace all all flavours of
 <in_|in6_>rtalloc[1]<_ign><_fib>, mpath functions  and the previous
 fib[46]-generation functions.

Upon successful lookup, they return nexthop object which is guaranteed to
 exist within current NET_EPOCH. If longer lifetime is desired, one can
 specify NHR_REF as a flag and get a referenced version of the nexthop.
 Reference semantic closely resembles rtentry one, allowing sed-style conversion.

Additionally, another 2 functions are introduced to support uRPF functionality
 inside variety of our firewalls. Their primary goal is to hide the multipath
 implementation details inside the routing subsystem, greatly simplifying
 firewalls implementation:

int fib4_lookup_urpf(uint32_t fibnum, struct in_addr dst, uint32_t scopeid,
  uint32_t flags, const struct ifnet *src_if);
int fib6_lookup_urpf(uint32_t fibnum, const struct in6_addr *dst6, uint32_t scopeid,
  uint32_t flags, const struct ifnet *src_if);

All functions have a separate scopeid argument, paving way to eliminating IPv6 scope
 embedding and allowing to support IPv4 link-locals in the future.

Structure changes:
 * rtentry gets new 'rt_nhop' pointer, slightly growing the overall size.
 * rib_head gets new 'rnh_preadd' callback pointer, slightly growing overall sz.

Old KPI:
During the transition state old and new KPI will coexists. As there are another 4-5
 decent-sized conversion patches, it will probably take a couple of weeks.
To support both KPIs, fields not required by the new KPI (most of rtentry) has to be
 kept, resulting in the temporary size increase.
Once conversion is finished, rtentry will notably shrink.

More details:
* architectural overview: https://reviews.freebsd.org/D24141
* list of the next changes: https://reviews.freebsd.org/D24232

Reviewed by:	ae,glebius(initial version)
Differential Revision:	https://reviews.freebsd.org/D24232
2020-04-12 14:30:00 +00:00
Michael Tuexen
07ddae2822 Revert https://svnweb.freebsd.org/changeset/base/359809
The intended change was
	sp->next.tqe_next = NULL;
	sp->next.tqe_prev = NULL;
which doesn't fix the issue I'm seeing and the committed fix is
not the intended fix due to copy-and-paste.

Thanks a lot to Conrad Meyer for making me aware of the problem.

Reported by:		cem
2020-04-12 09:31:36 +00:00
Michael Tuexen
9803dbb3ea Zero out pointers for consistency.
This was found by running syzkaller on an INVARIANTS kernel.

MFC after:		3 days
2020-04-11 20:36:54 +00:00
Alexander V. Chernikov
4684d3cbcb Remove per-AF radix_mpath initializtion functions.
Split their functionality by moving random seed allocation
 to SYSINIT and calling (new) generic multipath function from
 standard IPv4/IPv5 RIB init handlers.

Differential Revision:	https://reviews.freebsd.org/D24356
2020-04-11 07:37:08 +00:00
Warner Losh
28540ab153 Fix copyright year and eliminate the obsolete all rights reserved line.
Reviewed by: rrs@
2020-04-08 17:55:45 +00:00
Michael Tuexen
f4cb790a35 Do more argument validation under INVARIANTS when starting/stopping
an SCTP timer.

MFC after:		1 week
2020-04-06 13:58:13 +00:00
Alexander V. Chernikov
66bc03d415 Use interface fib for proxyarp checks.
Before the change, proxyarp checks for src and dst addresses
 were performed using default fib, breaking multi-fib scenario.

PR:		245181
Submitted by:	Scott Aitken (original version)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D24244
2020-04-02 20:06:37 +00:00
Michael Tuexen
413c3db101 Allow the TCP backhole detection to be disabled at all, enabled only
for IPv4, enabled only for IPv6, and enabled for IPv4 and IPv6.
The current blackhole detection might classify a temporary outage as
an MTU issue and reduces permanently the MSS. Since the consequences of
such a reduction due to a misclassification are much more drastically
for IPv4 than for IPv6, allow the administrator to enable it for IPv6 only.

Reviewed by:		bcr@ (man page), Richard Scheffenegger
Sponsored by:		Netflix, Inc.
Differential Revision:	https://reviews.freebsd.org/D24219
2020-03-31 15:54:54 +00:00
Mark Johnston
9b1d850be8 Remove the "config" taskqgroup and its KPIs.
Equivalent functionality is already provided by taskqueue(9), just use
that instead.

MFC after:	2 weeks
Sponsored by:	The FreeBSD Foundation
2020-03-30 14:24:03 +00:00
Michael Tuexen
9aca687811 Small cleanup by using a variable just assigned.
MFC after:		1 week
2020-03-28 22:35:04 +00:00
Michael Tuexen
25ec355353 Handle integer overflows correctly when converting msecs and secs to
ticks and vice versa.
These issues were caught by recently added panic() calls on INVARIANTS
systems.

Reported by:		syzbot+b44787b4be7096cd1590@syzkaller.appspotmail.com
Reported by:		syzbot+35f82d22805c1e899685@syzkaller.appspotmail.com
MFC after:		1 week
2020-03-28 20:25:45 +00:00
Ed Maste
c012cfe68a sys/netinet: remove spurious doubled ;s 2020-03-27 23:10:18 +00:00
Michael Tuexen
d5d190f2f9 Some more uint32_t cleanups, no functional change.
MFC after:		1 week
2020-03-27 21:48:52 +00:00
Michael Tuexen
239e5865df Use uint32_t where it is expected to be used. No functional change.
MFC after:		1 week
2020-03-27 11:08:11 +00:00
Michael Tuexen
7c63520c42 Remove an optimization, which was incorrect a couple of times and
therefore doesn't seem worth to be there.
In this case COOKIE where not retransmitted anymore, when the
socket was already closed.

MFC after:		1 week
2020-03-25 18:20:37 +00:00
Michael Tuexen
37686ccf08 Improve consistency in debug output.
MFC after:		1 week
2020-03-25 18:14:12 +00:00
Michael Tuexen
24187cfe72 Revert https://svnweb.freebsd.org/changeset/base/357829
This introduces a regression reported by koobs@ when running a pyhton
test suite on a loaded system.

This patch resulted in a failing accept() call, when the association
was setup and gracefully shutdown by the peer before accept was called.
So the following packetdrill script would fail:

+0.0 socket(..., SOCK_STREAM, IPPROTO_SCTP) = 3
+0.0 bind(3, ..., ...) = 0
+0.0 listen(3, 1) = 0
+0.0 < sctp: INIT[flgs=0, tag=1, a_rwnd=15000, os=1, is=1, tsn=1]
+0.0 > sctp: INIT_ACK[flgs=0, tag=2, a_rwnd=..., os=..., is=..., tsn=1, ...]
+0.1 < sctp: COOKIE_ECHO[flgs=0, len=..., val=...]
+0.0 > sctp: COOKIE_ACK[flgs=0]
+0.0 < sctp: DATA[flgs=BE, len=116, tsn=1, sid=0, ssn=0, ppid=0]
+0.0 > sctp: SACK[flgs=0, cum_tsn=1, a_rwnd=..., gaps=[], dups=[]]
+0.0 < sctp: SHUTDOWN[flgs=0, cum_tsn=0]
+0.0 > sctp: SHUTDOWN_ACK[flgs=0]
+0.0 < sctp: SHUTDOWN_COMPLETE[flgs=0]
+0.0 accept(3, ..., ...) = 4
+0.0 close(3) = 0
+0.0 recv(4, ..., 4096, 0) = 100
+0.0 recv(4, ..., 4096, 0) = 0
+0.0 close(4) = 0

Reported by:		koops@
2020-03-25 15:29:01 +00:00
Michael Tuexen
23e3c0880d Use consistent debug output.
MFC after:		1 week
2020-03-25 13:19:41 +00:00
Michael Tuexen
e056fafd92 Don't restore the vnet too early in error cases.
MFC after:		1 week
2020-03-25 13:18:37 +00:00
Michael Tuexen
7522682e5e Only call panic when building with INVARIANTS.
MFC after:		1 week
2020-03-24 23:04:07 +00:00
Michael Tuexen
a412576e36 Another cleanup of the timer code. Also be more pedantic about the
parameters of the timer start and stop routines. Several inconsistencies
have been fixed in earlier commits. Now they will be catched when running
an INVARIANTS system.

MFC after:	1 week
2020-03-24 22:44:36 +00:00
Michael Tuexen
d084818d9d Cleanup the file and add two ASSERT variants for locks, which will be
used shortly.

MFC after:		1 week
2020-03-23 12:17:13 +00:00
Michael Tuexen
a57fb68b92 More timer cleanups, no functional change.
MFC after:		1 week
2020-03-21 16:12:19 +00:00
Michael Tuexen
fa8ceba9ca Remove a set, but unused variable.
MFC after:		1 week
2020-03-20 14:49:44 +00:00
Michael Tuexen
2bdebd0ce3 A a missing NET_EPOCH_ENTER/NET_EPOCH_EXIT pair. This was affecting
implicit connection setups via sendmsg().

Reported by:		syzbot+febbe3383a0e9b700c1b@syzkaller.appspotmail.com
Reported by:		syzbot+dca98631455d790223ca@syzkaller.appspotmail.com
Reported by:		syzbot+5a71a7760d6bcf11b8cd@syzkaller.appspotmail.com
Reported by:		syzbot+da64217e140444c49f00@syzkaller.appspotmail.com
2020-03-19 23:07:52 +00:00
Michael Tuexen
6fb7b4fbdb Consistently provide arguments for timer start and stop routines.
This is another step in cleaning up timer handling.
MFC after:		1 week
2020-03-19 21:01:16 +00:00
Michael Tuexen
e95b3d7faf Cleanup the stream reset and asconf timer.
MFC after:		1 week
2020-03-19 18:55:54 +00:00
Michael Tuexen
42078d5ada The MTU candidates MUST be a multiple of 4, so make them so.
MFC after:		1 week
2020-03-19 14:37:28 +00:00
Michael Tuexen
0554e01d8b Handle the timers in a consistent sequence according to the definition
of the timer type.
Just a cleanup, no functional change intended.

MFC after:		1 week
2020-03-17 19:20:12 +00:00
Andrew Gallatin
ee7a9e506e Avoid a cache miss accessing an mbuf ext_pgs pointer when doing SW kTLS.
For a Netflix 90Gb/s 100% TLS software kTLS workload, this reduces
the CPI of tcp_m_copym() from ~3.5 to ~2.5 as reported by vtune.

Reviewed by:	jtl, rrs
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D23998
2020-03-16 14:03:27 +00:00