freebsd-dev

Author	SHA1	Message	Date
Alexander V. Chernikov	4043ee3cd7	Convert rtalloc_mpath_fib() users to the new KPI. New fib[46]_lookup() functions support multipath transparently. Given that, switch the last rtalloc_mpath_fib() calls to dib4_lookup() and eliminate the function itself. Note: proper flowid generation (especially for the outbound traffic) is a bigger topic and will be handled in a separate review. This change leaves flowid generation intact. Differential Revision: https://reviews.freebsd.org/D24595	2020-04-28 08:06:56 +00:00
Alexander V. Chernikov	1b0051bada	Eliminate now-unused parts of old routing KPI. r360292 switched most of the remaining routing customers to a new KPI, leaving a bunch of wrappers for old routing lookup functions unused. Remove them from the tree as a part of routing cleanup. Differential Revision: https://reviews.freebsd.org/D24569	2020-04-28 07:25:34 +00:00
John Baldwin	f1f9347546	Initial support for kernel offload of TLS receive. - Add a new TCP_RXTLS_ENABLE socket option to set the encryption and authentication algorithms and keys as well as the initial sequence number. - When reading from a socket using KTLS receive, applications must use recvmsg(). Each successful call to recvmsg() will return a single TLS record. A new TCP control message, TLS_GET_RECORD, will contain the TLS record header of the decrypted record. The regular message buffer passed to recvmsg() will receive the decrypted payload. This is similar to the interface used by Linux's KTLS RX except that Linux does not return the full TLS header in the control message. - Add plumbing to the TOE KTLS interface to request either transmit or receive KTLS sessions. - When a socket is using receive KTLS, redirect reads from soreceive_stream() into soreceive_generic(). - Note that this interface is currently only defined for TLS 1.1 and 1.2, though I believe we will be able to reuse the same interface and structures for 1.3.	2020-04-27 23:17:19 +00:00
John Baldwin	ec1db6e13d	Add the initial sequence number to the TLS enable socket option. This will be needed for KTLS RX. Reviewed by: gallatin Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D24451	2020-04-27 22:31:42 +00:00
Randall Stewart	e570d231f4	This change does a small prepratory step in getting the latest rack and bbr in from the NF repo. When those come in the OOB data handling will be fixed where Skyzaller crashes. Differential Revision: https://reviews.freebsd.org/D24575	2020-04-27 16:30:29 +00:00
Alexander V. Chernikov	55f57ca9ac	Convert debugnet to the new routing KPI. Introduce new fib[46]_lookup_debugnet() functions serving as a special interface for the crash-time operations. Underlying implementation will try to return lookup result if datastructures are not corrupted, avoding locking. Convert debugnet to use fib4_lookup_debugnet() and switch it to use nexthops instead of rtentries. Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D24555	2020-04-26 18:42:38 +00:00
Alexander V. Chernikov	17cb6ddba8	Fix order of arguments in fib[46]_lookup calls in SCTP. r360292 introduced the wrong order, resulting in returned nhops not being referenced, despite the fact that references were requested. That lead to random GPF after using SCTP sockets. Special defined macro like IPV[46]_SCOPE_GLOBAL will be introduced soon to reduce the chance of putting arguments in wrong order. Reported-by: syzbot+5c813c01096363174684@syzkaller.appspotmail.com	2020-04-26 13:02:42 +00:00
Alexander V. Chernikov	454d389645	Fix LINT build #2 after r360292. Pointyhat to: melifaro	2020-04-25 11:35:38 +00:00
Alexander V. Chernikov	ac99fd86d4	Fix LINT build broken by r360292.	2020-04-25 10:31:56 +00:00
Alexander V. Chernikov	983066f05b	Convert route caching to nexthop caching. This change is build on top of nexthop objects introduced in r359823. Nexthops are separate datastructures, containing all necessary information to perform packet forwarding such as gateway interface and mtu. Nexthops are shared among the routes, providing more pre-computed cache-efficient data while requiring less memory. Splitting the LPM code and the attached data solves multiple long-standing problems in the routing layer, drastically reduces the coupling with outher parts of the stack and allows to transparently introduce faster lookup algorithms. Route caching was (re)introduced to minimise (slow) routing lookups, allowing for notably better performance for large TCP senders. Caching works by acquiring rtentry reference, which is protected by per-rtentry mutex. If the routing table is changed (checked by comparing the rtable generation id) or link goes down, cache record gets withdrawn. Nexthops have the same reference counting interface, backed by refcount(9). This change merely replaces rtentry with the actual forwarding nextop as a cached object, which is mostly mechanical. Other moving parts like cache cleanup on rtable change remains the same. Differential Revision: https://reviews.freebsd.org/D24340	2020-04-25 09:06:11 +00:00
Alexander V. Chernikov	9e88f47c8f	Unbreak LINT-NOINET[6] builds broken in r360191. Reported by: np	2020-04-23 06:55:33 +00:00
Michael Tuexen	8262311cbe	Improve input validation when processing AUTH chunks. Thanks to Natalie Silvanovich from Google for finding and reporting the issue found by her in the SCTP userland stack. MFC after: 3 days X-MFC with: https://svnweb.freebsd.org/changeset/base/360193	2020-04-22 21:22:33 +00:00
Michael Tuexen	97feba891d	Improve input validation when processing AUTH chunks. Thanks to Natalie Silvanovich from Google for finding and reporting the issue found by her in the SCTP userland stack. MFC after: 3 days	2020-04-22 12:47:46 +00:00
Alexander V. Chernikov	8d6708ba80	Convert TOE routing lookups to the new routing KPI. Reviewed by: np Differential Revision: https://reviews.freebsd.org/D24388	2020-04-22 07:53:43 +00:00
Richard Scheffenegger	bb410f9ff2	revert rS360143 - Correctly set up initial cwnd due to syzkaller panics found Reported by: tuexen Approved by: tuexen (mentor) Sponsored by: NetApp, Inc.	2020-04-22 00:16:42 +00:00
Richard Scheffenegger	73b7696693	Correctly set up the initial TCP congestion window in all cases, by adjust snd_una right after the connection initialization, to include the one byte in sequence space occupied by the SYN bit. This does not change the regular ACK processing, while making the BYTES_THIS_ACK macro to work properly. PR: 235256 Reviewed by: tuexen (mentor), rgrimes (mentor) Approved by: tuexen (mentor), rgrimes (mentor) MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D19000	2020-04-21 13:05:44 +00:00
Jonathan T. Looney	5d6e356cb0	Avoid calling protocol drain routines more than once per reclamation event. mb_reclaim() calls the protocol drain routines for each protocol in each domain. Some protocols exist in more than one domain and share drain routines. In the case of SCTP, it also uses the same drain routine for its SOCK_SEQPACKET and SOCK_STREAM entries in the same domain. On systems with INET, INET6, and SCTP all defined, mb_reclaim() calls sctp_drain() four times. On systems with INET and INET6 defined, mb_reclaim() calls tcp_drain() twice. mb_reclaim() is the only in-tree caller of the pr_drain protocol entry. Eliminate this duplication by ensuring that each pr_drain routine is only specified for one protocol entry in one domain. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D24418	2020-04-16 20:17:24 +00:00
Alexander V. Chernikov	539642a29d	Add nhop parameter to rti_filter callback. One of the goals of the new routing KPI defined in r359823 is to entirely hide`struct rtentry` from the consumers. It will allow to improve routing subsystem internals and deliver more features much faster. This change is one of the ongoing changes to eliminate direct struct rtentry field accesses. Additionally, with the followup multipath changes, single rtentry can point to multiple nexthops. With that in mind, convert rti_filter callback used when traversing the routing table to accept pair (rt, nhop) instead of nexthop. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24440	2020-04-16 17:20:18 +00:00
Richard Scheffenegger	d7ca3f780d	Reduce default TCP delayed ACK timeout to 40ms. Reviewed by: kbowling, tuexen Approved by: tuexen (mentor) MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D23281	2020-04-16 15:59:23 +00:00
Alexander V. Chernikov	9ac7c6cfed	Convert IP/IPv6 forwarding, ICMP processing and IP PCB laddr selection to the new routing KPI. Reviewed by: ae Differential Revision: https://reviews.freebsd.org/D24245	2020-04-14 23:06:25 +00:00
Michael Tuexen	b89af8e16d	Improve the TCP blackhole detection. The principle is to reduce the MSS in two steps and try each candidate two times. However, if two candidates are the same (which is the case in TCP/IPv6), this candidate was tested four times. This patch ensures that each candidate actually reduced the MSS and is only tested 2 times. This reduces the time window of missclassifying a temporary outage as an MTU issue. Reviewed by: jtl MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D24308	2020-04-14 16:35:05 +00:00
Andrew Gallatin	23feb56348	KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself. While the original implementation of unmapped mbufs was a large step forward in terms of reducing cache misses by enabling mbufs to carry more than a single page for sendfile, they are rather cache unfriendly when accessing the ext_pgs metadata and data. This is because the ext_pgs part of the mbuf is allocated separately, and almost guaranteed to be cold in cache. This change takes advantage of the fact that unmapped mbufs are never used at the same time as pkthdr mbufs. Given this fact, we can overlap the ext_pgs metadata with the mbuf pkthdr, and carry the ext_pgs meta directly in the mbuf itself. Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array of pages) directly after the existing m_ext. In order to be able to carry 5 pages (which is the minimum required for a 16K TLS record which is not perfectly aligned) on LP64, I've had to steal ext_arg2. The only user of this in the xmit path is sendfile, and I've adjusted it to use arg1 when using unmapped mbufs. This change is almost entirely mechanical, except that we change mb_alloc_ext_pgs() to no longer allow allocating pkthdrs, the change to avoid ext_arg2 as mentioned above, and the removal of the ext_pgs zone, This change saves roughly 2% "raw" CPU (~59% -> 57%), or over 3% "scaled" CPU on a Netflix 100% software kTLS workload at 90+ Gb/s on Broadwell Xeons. In a follow-on commit, I plan to remove some hacks to avoid access ext_pgs fields of mbufs, since they will now be in cache. Many thanks to glebius for helping to make this better in the Netflix tree. Reviewed by: hselasky, jhb, rrs, glebius (early version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24213	2020-04-14 14:46:06 +00:00
Alexander V. Chernikov	6722086045	Plug netmask NULL check during route addition causing kernel panic. This bug was introduced by the r359823. Reported by: hselasky	2020-04-14 13:12:22 +00:00
Kristof Provost	1d126e9b94	carp: Widen epoch coverage Fix panics related to calling code which expects to be running inside the NET_EPOCH from outside that epoch. This leads to panics (with INVARIANTS) such as this one: panic: Assertion in_epoch(net_epoch_preempt) failed at /usr/src/sys/netinet/if_ether.c:373 cpuid = 7 time = 1586095719 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0090819700 vpanic() at vpanic+0x182/frame 0xfffffe0090819750 panic() at panic+0x43/frame 0xfffffe00908197b0 arprequest_internal() at arprequest_internal+0x59e/frame 0xfffffe00908198c0 arp_announce_ifaddr() at arp_announce_ifaddr+0x20/frame 0xfffffe00908198e0 carp_master_down_locked() at carp_master_down_locked+0x10d/frame 0xfffffe0090819910 carp_master_down() at carp_master_down+0x79/frame 0xfffffe0090819940 softclock_call_cc() at softclock_call_cc+0x13f/frame 0xfffffe00908199f0 softclock() at softclock+0x7c/frame 0xfffffe0090819a20 ithread_loop() at ithread_loop+0x279/frame 0xfffffe0090819ab0 fork_exit() at fork_exit+0x80/frame 0xfffffe0090819af0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0090819af0 --- trap 0, rip = 0, rsp = 0, rbp = 0 --- Widen the NET_EPOCH to cover the relevant (callback / task) code. Differential Revision: https://reviews.freebsd.org/D24302	2020-04-12 16:09:21 +00:00
Alexander V. Chernikov	a666325282	Introduce nexthop objects and new routing KPI. This is the foundational change for the routing subsytem rearchitecture. More details and goals are available in https://reviews.freebsd.org/D24141 . This patch introduces concept of nexthop objects and new nexthop-based routing KPI. Nexthops are objects, containing all necessary information for performing the packet output decision. Output interface, mtu, flags, gw address goes there. For most of the cases, these objects will serve the same role as the struct rtentry is currently serving. Typically there will be low tens of such objects for the router even with multiple BGP full-views, as these objects will be shared between routing entries. This allows to store more information in the nexthop. New KPI: struct nhop_object fib4_lookup(uint32_t fibnum, struct in_addr dst, uint32_t scopeid, uint32_t flags, uint32_t flowid); struct nhop_object fib6_lookup(uint32_t fibnum, const struct in6_addr dst6, uint32_t scopeid, uint32_t flags, uint32_t flowid); These 2 function are intended to replace all all flavours of <in_\|in6_>rtalloc[1]<_ign><_fib>, mpath functions and the previous fib[46]-generation functions. Upon successful lookup, they return nexthop object which is guaranteed to exist within current NET_EPOCH. If longer lifetime is desired, one can specify NHR_REF as a flag and get a referenced version of the nexthop. Reference semantic closely resembles rtentry one, allowing sed-style conversion. Additionally, another 2 functions are introduced to support uRPF functionality inside variety of our firewalls. Their primary goal is to hide the multipath implementation details inside the routing subsystem, greatly simplifying firewalls implementation: int fib4_lookup_urpf(uint32_t fibnum, struct in_addr dst, uint32_t scopeid, uint32_t flags, const struct ifnet src_if); int fib6_lookup_urpf(uint32_t fibnum, const struct in6_addr dst6, uint32_t scopeid, uint32_t flags, const struct ifnet src_if); All functions have a separate scopeid argument, paving way to eliminating IPv6 scope embedding and allowing to support IPv4 link-locals in the future. Structure changes: * rtentry gets new 'rt_nhop' pointer, slightly growing the overall size. * rib_head gets new 'rnh_preadd' callback pointer, slightly growing overall sz. Old KPI: During the transition state old and new KPI will coexists. As there are another 4-5 decent-sized conversion patches, it will probably take a couple of weeks. To support both KPIs, fields not required by the new KPI (most of rtentry) has to be kept, resulting in the temporary size increase. Once conversion is finished, rtentry will notably shrink. More details: * architectural overview: https://reviews.freebsd.org/D24141 * list of the next changes: https://reviews.freebsd.org/D24232 Reviewed by: ae,glebius(initial version) Differential Revision: https://reviews.freebsd.org/D24232	2020-04-12 14:30:00 +00:00
Michael Tuexen	07ddae2822	Revert https://svnweb.freebsd.org/changeset/base/359809 The intended change was sp->next.tqe_next = NULL; sp->next.tqe_prev = NULL; which doesn't fix the issue I'm seeing and the committed fix is not the intended fix due to copy-and-paste. Thanks a lot to Conrad Meyer for making me aware of the problem. Reported by: cem	2020-04-12 09:31:36 +00:00
Michael Tuexen	9803dbb3ea	Zero out pointers for consistency. This was found by running syzkaller on an INVARIANTS kernel. MFC after: 3 days	2020-04-11 20:36:54 +00:00
Alexander V. Chernikov	4684d3cbcb	Remove per-AF radix_mpath initializtion functions. Split their functionality by moving random seed allocation to SYSINIT and calling (new) generic multipath function from standard IPv4/IPv5 RIB init handlers. Differential Revision: https://reviews.freebsd.org/D24356	2020-04-11 07:37:08 +00:00
Warner Losh	28540ab153	Fix copyright year and eliminate the obsolete all rights reserved line. Reviewed by: rrs@	2020-04-08 17:55:45 +00:00
Michael Tuexen	f4cb790a35	Do more argument validation under INVARIANTS when starting/stopping an SCTP timer. MFC after: 1 week	2020-04-06 13:58:13 +00:00
Alexander V. Chernikov	66bc03d415	Use interface fib for proxyarp checks. Before the change, proxyarp checks for src and dst addresses were performed using default fib, breaking multi-fib scenario. PR: 245181 Submitted by: Scott Aitken (original version) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D24244	2020-04-02 20:06:37 +00:00
Michael Tuexen	413c3db101	Allow the TCP backhole detection to be disabled at all, enabled only for IPv4, enabled only for IPv6, and enabled for IPv4 and IPv6. The current blackhole detection might classify a temporary outage as an MTU issue and reduces permanently the MSS. Since the consequences of such a reduction due to a misclassification are much more drastically for IPv4 than for IPv6, allow the administrator to enable it for IPv6 only. Reviewed by: bcr@ (man page), Richard Scheffenegger Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D24219	2020-03-31 15:54:54 +00:00
Mark Johnston	9b1d850be8	Remove the "config" taskqgroup and its KPIs. Equivalent functionality is already provided by taskqueue(9), just use that instead. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2020-03-30 14:24:03 +00:00
Michael Tuexen	9aca687811	Small cleanup by using a variable just assigned. MFC after: 1 week	2020-03-28 22:35:04 +00:00
Michael Tuexen	25ec355353	Handle integer overflows correctly when converting msecs and secs to ticks and vice versa. These issues were caught by recently added panic() calls on INVARIANTS systems. Reported by: syzbot+b44787b4be7096cd1590@syzkaller.appspotmail.com Reported by: syzbot+35f82d22805c1e899685@syzkaller.appspotmail.com MFC after: 1 week	2020-03-28 20:25:45 +00:00
Ed Maste	c012cfe68a	sys/netinet: remove spurious doubled ;s	2020-03-27 23:10:18 +00:00
Michael Tuexen	d5d190f2f9	Some more uint32_t cleanups, no functional change. MFC after: 1 week	2020-03-27 21:48:52 +00:00
Michael Tuexen	239e5865df	Use uint32_t where it is expected to be used. No functional change. MFC after: 1 week	2020-03-27 11:08:11 +00:00
Michael Tuexen	7c63520c42	Remove an optimization, which was incorrect a couple of times and therefore doesn't seem worth to be there. In this case COOKIE where not retransmitted anymore, when the socket was already closed. MFC after: 1 week	2020-03-25 18:20:37 +00:00
Michael Tuexen	37686ccf08	Improve consistency in debug output. MFC after: 1 week	2020-03-25 18:14:12 +00:00
Michael Tuexen	24187cfe72	Revert https://svnweb.freebsd.org/changeset/base/357829 This introduces a regression reported by koobs@ when running a pyhton test suite on a loaded system. This patch resulted in a failing accept() call, when the association was setup and gracefully shutdown by the peer before accept was called. So the following packetdrill script would fail: +0.0 socket(..., SOCK_STREAM, IPPROTO_SCTP) = 3 +0.0 bind(3, ..., ...) = 0 +0.0 listen(3, 1) = 0 +0.0 < sctp: INIT[flgs=0, tag=1, a_rwnd=15000, os=1, is=1, tsn=1] +0.0 > sctp: INIT_ACK[flgs=0, tag=2, a_rwnd=..., os=..., is=..., tsn=1, ...] +0.1 < sctp: COOKIE_ECHO[flgs=0, len=..., val=...] +0.0 > sctp: COOKIE_ACK[flgs=0] +0.0 < sctp: DATA[flgs=BE, len=116, tsn=1, sid=0, ssn=0, ppid=0] +0.0 > sctp: SACK[flgs=0, cum_tsn=1, a_rwnd=..., gaps=[], dups=[]] +0.0 < sctp: SHUTDOWN[flgs=0, cum_tsn=0] +0.0 > sctp: SHUTDOWN_ACK[flgs=0] +0.0 < sctp: SHUTDOWN_COMPLETE[flgs=0] +0.0 accept(3, ..., ...) = 4 +0.0 close(3) = 0 +0.0 recv(4, ..., 4096, 0) = 100 +0.0 recv(4, ..., 4096, 0) = 0 +0.0 close(4) = 0 Reported by: koops@	2020-03-25 15:29:01 +00:00
Michael Tuexen	23e3c0880d	Use consistent debug output. MFC after: 1 week	2020-03-25 13:19:41 +00:00
Michael Tuexen	e056fafd92	Don't restore the vnet too early in error cases. MFC after: 1 week	2020-03-25 13:18:37 +00:00
Michael Tuexen	7522682e5e	Only call panic when building with INVARIANTS. MFC after: 1 week	2020-03-24 23:04:07 +00:00
Michael Tuexen	a412576e36	Another cleanup of the timer code. Also be more pedantic about the parameters of the timer start and stop routines. Several inconsistencies have been fixed in earlier commits. Now they will be catched when running an INVARIANTS system. MFC after: 1 week	2020-03-24 22:44:36 +00:00
Michael Tuexen	d084818d9d	Cleanup the file and add two ASSERT variants for locks, which will be used shortly. MFC after: 1 week	2020-03-23 12:17:13 +00:00
Michael Tuexen	a57fb68b92	More timer cleanups, no functional change. MFC after: 1 week	2020-03-21 16:12:19 +00:00
Michael Tuexen	fa8ceba9ca	Remove a set, but unused variable. MFC after: 1 week	2020-03-20 14:49:44 +00:00
Michael Tuexen	2bdebd0ce3	A a missing NET_EPOCH_ENTER/NET_EPOCH_EXIT pair. This was affecting implicit connection setups via sendmsg(). Reported by: syzbot+febbe3383a0e9b700c1b@syzkaller.appspotmail.com Reported by: syzbot+dca98631455d790223ca@syzkaller.appspotmail.com Reported by: syzbot+5a71a7760d6bcf11b8cd@syzkaller.appspotmail.com Reported by: syzbot+da64217e140444c49f00@syzkaller.appspotmail.com	2020-03-19 23:07:52 +00:00
Michael Tuexen	6fb7b4fbdb	Consistently provide arguments for timer start and stop routines. This is another step in cleaning up timer handling. MFC after: 1 week	2020-03-19 21:01:16 +00:00
Michael Tuexen	e95b3d7faf	Cleanup the stream reset and asconf timer. MFC after: 1 week	2020-03-19 18:55:54 +00:00
Michael Tuexen	42078d5ada	The MTU candidates MUST be a multiple of 4, so make them so. MFC after: 1 week	2020-03-19 14:37:28 +00:00
Michael Tuexen	0554e01d8b	Handle the timers in a consistent sequence according to the definition of the timer type. Just a cleanup, no functional change intended. MFC after: 1 week	2020-03-17 19:20:12 +00:00
Andrew Gallatin	ee7a9e506e	Avoid a cache miss accessing an mbuf ext_pgs pointer when doing SW kTLS. For a Netflix 90Gb/s 100% TLS software kTLS workload, this reduces the CPI of tcp_m_copym() from ~3.5 to ~2.5 as reported by vtune. Reviewed by: jtl, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D23998	2020-03-16 14:03:27 +00:00
Michael Tuexen	7ca6e2963f	Use KMOD_TCPSTAT_INC instead of TCPSTAT_INC for RACK and BBR, since these are kernel modules. Also add a KMOD_TCPSTAT_ADD and use that instead of TCPSTAT_ADD. Reviewed by: jtl@, rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D23904	2020-03-12 15:37:41 +00:00
Andrew Gallatin	98085bae8c	make lacp's use_numa hashing aware of send tags When I did the use_numa support, I missed the fact that there is a separate hash function for send tag nic selection. So when use_numa is enabled, ktls offload does not work properly, as it does not reliably allocate a send tag on the proper egress nic since different egress nics are selected for send-tag allocation and packet transmit. To fix this, this change: - refectors lacp_select_tx_port_by_hash() and lacp_select_tx_port() to make lacp_select_tx_port_by_hash() always called by lacp_select_tx_port() - pre-shifts flowids to convert them to hashes when calling lacp_select_tx_port_by_hash() - adds a numa_domain field to if_snd_tag_alloc_params - plumbs the numa domain into places where we allocate send tags In testing with NIC TLS setup on a NUMA machine, I see thousands of output errors before the change when enabling kern.ipc.tls.ifnet.permitted=1. After the change, I see no errors, and I see the NIC sysctl counters showing active TLS offload sessions. Reviewed by: rrs, hselasky, jhb Sponsored by: Netflix	2020-03-09 13:44:51 +00:00
Hiroki Sato	d726e6331b	Fix an issue of net.inet.igmp.stats handler. The header of (struct igmpstat) could be cleared by sysctl(3). This can be reproduced by "netstat -s -z -p igmp". PR: 244584 MFC after: 1 week	2020-03-07 08:41:10 +00:00
Michael Tuexen	9c04fdfd34	When using automatically generated flow labels and using TCP SYN cookies, use the same flow label for the segments sent during the handshake and after the handshake. This fixes a bug by making sure that sc_flowlabel is always stored in network byte order. Reviewed by: bz@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D23957	2020-03-04 16:41:25 +00:00
Bjoern A. Zeeb	d2b8fd0da1	Add new ICMPv6 counters for Anti-DoS limits. Add four new counters for ND6 related Anti-DoS measures. We split these out into a separate upfront commit so that we only change the struct size one time. Implementations using them will follow. PR: 157410 Reviewed by: melifaro MFC after: 2 weeks X-MFC: cannot really MFC this without breaking netstat Sponsored by: Netflix (initially) Differential Revision: https://reviews.freebsd.org/D22711	2020-03-04 16:20:59 +00:00
Michael Tuexen	6605e5791f	Don't send an uninitilised traffic class in the IPv6 header, when sending a TCP segment from the TCP SYN cache (like a SYN-ACK). This fix initialises it to zero. This is correct for the ECN bits, but is does not honor the DSCP what an application might have set via the IPPROTO_IPV6 level socket options IPV6_TCLASS. That will be fixed separately. Reviewed by: Richard Scheffenegger MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D23900	2020-03-04 12:22:53 +00:00
Bjoern A. Zeeb	4e1a3ff884	tcp_hpts: make RSS kernel compile again. Add proper #includes, and #ifdefs and some style fixes to make RSS kernels compile again. There are still possible issues with uin16_t vs. uint_t cpuid which I am not going near. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D23726	2020-03-03 14:15:30 +00:00
Michael Tuexen	7e1e491f60	Remove stale definitions. The removed definitions are not used right now and are incompatible with the correct ones in RFC 3168. Submitted by: Richard Scheffenegger Differential Revision: https://reviews.freebsd.org/D23903	2020-03-01 12:34:27 +00:00
Pawel Biernacki	7029da5c36	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718	2020-02-26 14:26:36 +00:00
Randall Stewart	d7313dc6f5	This commit expands tcp_ratelimit to be able to handle cards like the mlx-c5 and c6 that require a "setup" routine before the tcp_ratelimit code can declare and use a rate. I add the setup routine to if_var as well as fix tcp_ratelimit to call it. I also revisit the rates so that in the case of a mlx card of type c5/6 we will use about 100 rates concentrated in the range where the most gain can be had (1-200Mbps). Note that I have tested these on a c5 and they work and perform well. In fact in an unloaded system they pace right to the correct rate (great job mlx!). There will be a further commit here from Hans that will add the respective changes to the mlx driver to support this work (which I was testing with). Sponsored by: Netflix Inc. Differential Revision: ttps://reviews.freebsd.org/D23647	2020-02-26 13:48:33 +00:00
Pawel Biernacki	295a18d184	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (14 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Approved by: kib (mentor, blanket) Differential Revision: https://reviews.freebsd.org/D23639	2020-02-24 10:47:18 +00:00
Pawel Biernacki	10b49b2302	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (6 of many) r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. Mark all nodes in pf, pfsync and carp as MPSAFE. Reviewed by: kp Approved by: kib (mentor, blanket) Differential Revision: https://reviews.freebsd.org/D23634	2020-02-21 16:23:00 +00:00
Michael Tuexen	64f29eb1df	Remove an unused timer type. MFC after: 1 week	2020-02-20 15:37:44 +00:00
Michael Tuexen	868b51f234	Epochify SCTP.	2020-02-18 21:25:17 +00:00
Michael Tuexen	ba0d525006	Remove unused function.	2020-02-18 19:41:55 +00:00
Michael Tuexen	a610bb2120	Fix the non-default stream schedulers such that do not interleave user messages when it is now allowed. Thanks to Christian Wright for reporting the issue for the userland stack and providing a fix for the priority scheduler. MFC after: 1 week	2020-02-17 18:05:03 +00:00
Michael Tuexen	6b8fba3c5c	Don't use uninitialised stack memory if the sysctl variable net.inet.tcp.hostcache.enable is set to 0. The bug resulted in using possibly a too small MSS value or wrong initial retransmission timer settings. Possibly the value used for ssthresh was also wrong. Submitted by: Richard Scheffenegger Reviewed by: Cheng Cui, rgrimes@, tuexen@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23687	2020-02-17 14:54:21 +00:00
Hans Petter Selasky	bacb11c9ed	Fix kernel panic while trying to read multicast stream. When VIMAGE is enabled make sure the "m_pkthdr.rcvif" pointer is set for all mbufs being input by the IGMP/MLD6 code. Else there will be a NULL-pointer dereference in the netisr code when trying to set the VNET based on the incoming mbuf. Add an assert to catch this when queueing mbufs on a netisr to make debugging of similar cases easier. Found by: Vladislav V. Prodan PR: 244002 Reviewed by: bz@ MFC after: 1 week Sponsored by: Mellanox Technologies	2020-02-17 09:46:32 +00:00
Mateusz Guzik	6b25673f3f	sctp: use new capsicum helpers	2020-02-15 01:29:40 +00:00
Michael Tuexen	a357466592	sack_newdata and snd_recover hold the same value. Therefore, use only a single instance: use snd_recover also where sack_newdata was used. Submitted by: Richard Scheffenegger Differential Revision: https://reviews.freebsd.org/D18811	2020-02-13 15:14:46 +00:00
Michael Tuexen	33f8cfdfe4	Whitespace cleanup. No functional change. Sponsored by: Netflix, Inc.	2020-02-13 13:58:34 +00:00
Michael Tuexen	56ccb48fd6	Don't panic under INVARIANTS when we can't allocate memory for storing a vtag in time wait. This issue was found by running syzkaller. MFC after: 1 week	2020-02-12 17:05:10 +00:00
Michael Tuexen	ca3de626ec	Mark the socket as disconnected when freeing the association the first time. This issue was found by running syzkaller. MFC after: 1 week	2020-02-12 17:02:15 +00:00
Randall Stewart	348404bce1	Lets get the real correct version.. gessh. I need more coffee evidently. Sponsored by: Netflix	2020-02-12 15:26:56 +00:00
Randall Stewart	b8f8a6b719	Opps committed the wrong ratelimit version in the whitespace cleanup.. Restore it to the proper version. Sponsored by: Netfilx Inc.	2020-02-12 13:37:53 +00:00
Randall Stewart	481be5de9d	White space cleanup -- remove trailing tab's or spaces from any line. Sponsored by: Netflix Inc.	2020-02-12 13:31:36 +00:00
Randall Stewart	df341f5986	Whitespace, remove from three files trailing white space (leftover presents from emacs). Sponsored by: Netflix Inc.	2020-02-12 13:07:09 +00:00
Randall Stewart	596ae436ef	This small fix makes it so we properly follow the RFC and only enable ECN when both the CWR and ECT bits our set within the SYN packet. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D23645	2020-02-12 13:04:19 +00:00
Randall Stewart	3fba40d9f2	Remove all trailing white space from the BBR/Rack fold. Bits left around by emacs (thanks emacs).	2020-02-12 12:40:06 +00:00
Randall Stewart	d2517ab04b	Now that all of the stats framework is in FreeBSD the bits that disabled stats when netflix-stats is not defined is no longer needed. Lets remove these bits so that we will properly use stats per its definition in BBR and Rack. Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D23088	2020-02-12 12:36:55 +00:00
Michael Tuexen	8803350d6d	Revert https://svnweb.freebsd.org/changeset/base/357761 This was suggested by cem@	2020-02-11 20:02:20 +00:00
Michael Tuexen	9803f01cdb	Don't start an SCTP timer using a net, which has been removed. Submitted by: Taylor Brandstetter MFC after: 1 week	2020-02-11 18:15:57 +00:00
Michael Tuexen	95d27478d2	Use an int instead of a bool variable, since bool is not supported on all platforms the stack is running on in userland.	2020-02-11 14:00:27 +00:00
Michael Tuexen	6a34ec63ab	Stop the PMTU and HB timer when removing a net, not when freeing it. Submitted by: Taylor Brandstetter MFC after: 1 week	2020-02-09 22:40:05 +00:00
Michael Tuexen	5555400aa5	Cleanup timer handling. Submitted by: Taylor Brandstetter MFC after: 1 week	2020-02-09 22:05:41 +00:00
Ed Maste	5aa0576b33	Miscellaneous typo fixes Submitted by: Gordon Bergling <gbergling_gmail.com> Differential Revision: https://reviews.freebsd.org/D23453	2020-02-07 19:53:07 +00:00
Michael Tuexen	f799ff82fb	Remove unused timer. Submitted by: Taylor Brandstetter	2020-02-04 14:01:07 +00:00
Michael Tuexen	bbf9f080e9	Improve numbering of debug information. Submitted by: Taylor Brandstetter MFC after: 1 week	2020-02-04 12:34:16 +00:00
Conrad Meyer	8e6b06be14	netinet/libalias: Fix typo in debug message No functional change. PR: 243831 Submitted by: Neel Chauhan <neel AT neelc DOT org> Differential Revision: https://reviews.freebsd.org/D23365	2020-02-03 05:19:44 +00:00
Gleb Smirnoff	42ce79378d	Fix missing NET_EPOCH_ENTER() when compiled with TCP_OFFLOAD. Reported by: Coverity CID: 1413162	2020-01-29 22:48:18 +00:00
Michael Tuexen	dc13edbc7d	Fix build issues for the userland stack on 32-bit platforms. Reported by: Felix Weinrank MFC after: 1 week	2020-01-28 10:09:05 +00:00
Alexander V. Chernikov	75831a1c95	Fix NOINET6 build after r357038. Reported by: AN <andy at neu.net>	2020-01-26 11:54:21 +00:00
Michael Tuexen	9cc711c9ff	Sending CWR after an RTO is according to RFC 3168 generally required and not only for the DCTCP congestion control. Submitted by: Richard Scheffenegger Reviewed by: rgrimes, tuexen@, Cheng Cui MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23119	2020-01-25 13:45:10 +00:00
Michael Tuexen	47e2c17c12	Don't set the ECT codepoint on retransmitted packets during SACK loss recovery. This is required by RFC 3168. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, tuexen@, Cheng Cui MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23118	2020-01-25 13:34:29 +00:00
Michael Tuexen	a2d59694be	As a TCP client only enable ECN when the corresponding sysctl variable indicates that ECN should be negotiated for the client side. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, tuexen@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23228	2020-01-25 13:11:14 +00:00
Michael Tuexen	ee97681e5c	Don't delay the ACK for a TCP segment with the CWR flag set. This allows the data sender to increase the CWND faster. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, tuexen@, Cheng Cui MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D22670	2020-01-24 22:50:23 +00:00
Michael Tuexen	8f63a52bdb	The server side of TCP fast open relies on the delayed ACK timer to allow including user data in the SYN-ACK. When DSACK support was added in r347382, an immediate ACK was sent even for the received SYN with user data. This patch fixes that and allows again to send user data with the SYN-ACK. Reported by: Jeremy Harris Reviewed by: Richard Scheffenegger, rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D23212	2020-01-24 22:37:53 +00:00
Gleb Smirnoff	e1d2b46953	Enter the network epoch when rack_output() is called in setsockopt(2).	2020-01-24 21:56:10 +00:00
Alexander V. Chernikov	75b893375f	Add support for RFC 6598/Carrier Grade NAT subnets. to libalias and ipfw. In libalias, a new flag PKT_ALIAS_UNREGISTERED_RFC6598 is added. This is like PKT_ALIAS_UNREGISTERED_ONLY, but also is RFC 6598 aware. Also, we add a new NAT option to ipfw called unreg_cgn, which is like unreg_only, but also is RFC 6598-aware. The reason for the new flags/options is to avoid breaking existing networks, especially those which rely on RFC 6598 as an external address. Submitted by: Neel Chauhan <neel AT neelc DOT org> MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D22877	2020-01-24 20:35:41 +00:00
Alexander V. Chernikov	ab15488f12	Bring indentation back to normal after r357038. No functional changes. MFC after: 3 weeks	2020-01-23 09:46:45 +00:00
Alexander V. Chernikov	5533ec4806	Fix epoch-related panic in ipdivert, ensuring in_broadcast() is called within epoch. Simplify gigantic div_output() by splitting it into 3 functions, handling preliminary setup, remote "ip[6]_output" case and local "netisr" case. Leave original indenting in most parts to ease diff comparison. Indentation will be fixed by a followup commit. Reported by: Nick Hibma <nick at van-laarhoven.org> Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D23317	2020-01-23 09:14:28 +00:00
Gleb Smirnoff	a3b0db5b0a	Plug possible calls into ip6?_output() without network epoch from SCTP bluntly adding epoch entrance into the macro that SCTP uses to call ip6?_output(). This definitely will introduce several epoch recursions. Reported by: https://syzkaller.appspot.com/bug?id=79f03f574594a5be464997310896765c458ed80a Reported by: https://syzkaller.appspot.com/bug?id=07c6f52106cddbe356cc2b2f3664a1c51cc0dadf	2020-01-22 17:19:53 +00:00
Bjoern A. Zeeb	7754e281c0	Fix NOINET kernels after r356983. All gotos to the label are within the #ifdef INET section, which leaves us with an unused label. Cover the label under #ifdef INET as well to avoid the warning and compile time error.	2020-01-22 15:06:59 +00:00
Alexander V. Chernikov	34a5582c47	Bring back redirect route expiration. Redirect (and temporal) route expiration was broken a while ago. This change brings route expiration back, with unified IPv4/IPv6 handling code. It introduces net.inet.icmp.redirtimeout sysctl, allowing to set an expiration time for redirected routes. It defaults to 10 minutes, analogues with net.inet6.icmp6.redirtimeout. Implementation uses separate file, route_temporal.c, as route.c is already bloated with tons of different functions. Internally, expiration is implemented as an per-rnh callout scheduled when route with non-zero rt_expire time is added or rt_expire is changed. It does not add any overhead when no temporal routes are present. Callout traverses entire routing tree under wlock, scheduling expired routes for deletion and calculating the next time it needs to be run. The rationale for such implemention is the following: typically workloads requiring large amount of routes have redirects turned off already, while the systems with small amount of routes will not inhibit large overhead during tree traversal. This changes also fixes netstat -rn display of route expiration time, which has been broken since the conversion from kread() to sysctl. Reviewed by: bz MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D23075	2020-01-22 13:53:18 +00:00
Gleb Smirnoff	c1604fe4d2	Make in_pcbladdr() require network epoch entered by its callers. Together with this widen network epoch coverage up to tcp_connect() and udp_connect(). Revisions from r356974 and up to this revision cover D23187. Differential Revision: https://reviews.freebsd.org/D23187	2020-01-22 06:10:41 +00:00
Gleb Smirnoff	e2636f0a78	Remove extraneous NET_EPOCH_ASSERT - the full function is covered.	2020-01-22 06:07:27 +00:00
Gleb Smirnoff	3fed74e90f	Re-absorb tcp_detach() back into tcp_usr_detach() as the comment suggests. Not a functional change.	2020-01-22 06:06:27 +00:00
Gleb Smirnoff	5fc8df3c49	Don't enter network epoch in tcp_usr_detach. A PCB removal doesn't require that.	2020-01-22 06:04:56 +00:00
Gleb Smirnoff	5c722e2ad3	The network epoch changes in the TCP stack combined with old r286227, actually make removal of a PCB not needing ipi_lock in any form. The ipi_list_lock is sufficient.	2020-01-22 06:03:45 +00:00
Gleb Smirnoff	7669c586da	tcp_usr_attach() doesn't need network epoch. in_pcbfree() and in_pcbdetach() perform all necessary synchronization themselves.	2020-01-22 06:01:26 +00:00
Gleb Smirnoff	6a2954a17d	Relax locking requirements for in_pcballoc(). All pcbinfo fields modified by this function are protected by the PCB list lock that is acquired inside the function. This could have been done even before epoch changes, after r286227.	2020-01-22 05:58:29 +00:00
Gleb Smirnoff	0f6385e705	Inline tcp_attach() into tcp_usr_attach(). Not a functional change.	2020-01-22 05:54:58 +00:00
Gleb Smirnoff	109eb549e1	Make tcp_output() require network epoch. Enter the epoch before calling into tcp_output() from those functions, that didn't do that before. This eliminates a bunch of epoch recursions in TCP.	2020-01-22 05:53:16 +00:00
Gleb Smirnoff	b955545386	Make ip6_output() and ip_output() require network epoch. All callers that before may called into these functions without network epoch now must enter it.	2020-01-22 05:51:22 +00:00
Gleb Smirnoff	0452a1f3ef	Add documenting NET_EPOCH_ASSERT() to tcp_drop().	2020-01-22 02:38:46 +00:00
Gleb Smirnoff	bab98355f9	Add some documenting NET_EPOCH_ASSERTs.	2020-01-22 02:37:47 +00:00
Michael Tuexen	6745815d25	Remove debug code not needed anymore. Submitted by: Richard Scheffenegger Reviewed by: tuexen@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D23208	2020-01-16 17:15:06 +00:00
Gleb Smirnoff	ed0282f46a	A miss from r356754.	2020-01-15 06:12:39 +00:00
Gleb Smirnoff	2a4bd982d0	Introduce NET_EPOCH_CALL() macro and use it everywhere where we free data based on the network epoch. The macro reverses the argument order of epoch_call(9) - first function, then its argument. NFC	2020-01-15 06:05:20 +00:00
Gleb Smirnoff	b1328235b4	Use official macro to enter/exit the network epoch. NFC	2020-01-15 05:48:36 +00:00
Gleb Smirnoff	97168be809	Mechanically substitute assertion of in_epoch(net_epoch_preempt) to NET_EPOCH_ASSERT(). NFC	2020-01-15 05:45:27 +00:00
Gleb Smirnoff	fae994f636	Stop header pollution and don't include if_var.h via in_pcb.h.	2020-01-15 03:41:15 +00:00
Gleb Smirnoff	8fd73e9160	Since this code dereferences struct ifnet, it must include if_var.h explicitly, not via header pollution. While here move TCPSTATES declaration right above the include that is going to make use of it.	2020-01-15 03:40:32 +00:00
Gleb Smirnoff	9cdc43b16e	The non-preemptible network epoch identified by net_epoch isn't used. This code definitely meant net_epoch_preempt.	2020-01-15 03:30:33 +00:00
Gleb Smirnoff	4c69f60a8e	Fix yet another regression from r354484. Error code from cr_cansee() aliases with hard error from other operations. Reported by: flo	2020-01-13 21:12:10 +00:00
Michael Tuexen	fe1274ee39	Fix race when accepting TCP connections. When expanding a SYN-cache entry to a socket/inp a two step approach was taken: 1) The local address was filled in, then the inp was added to the hash table. 2) The remote address was filled in and the inp was relocated in the hash table. Before the epoch changes, a write lock was held when this happens and the code looking up entries was holding a corresponding read lock. Since the read lock is gone away after the introduction of the epochs, the half populated inp was found during lookup. This resulted in processing TCP segments in the context of the wrong TCP connection. This patch changes the above procedure in a way that the inp is fully populated before inserted into the hash table. Thanks to Paul <devgs@ukr.net> for reporting the issue on the net@ mailing list and for testing the patch! Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D22971	2020-01-12 17:52:32 +00:00
Michael Tuexen	fc0eb7637c	Fix division by zero issue. Thanks to Stas Denisov for reporting the issue for the userland stack and providing a fix. MFC after: 3 days	2020-01-12 15:45:27 +00:00
Mateusz Guzik	879e0604ee	Add KERNEL_PANICKED macro for use in place of direct panicstr tests	2020-01-12 06:07:54 +00:00
Alexander V. Chernikov	ead85fe415	Add fibnum, family and vnet pointer to each rib head. Having metadata such as fibnum or vnet in the struct rib_head is handy as it eases building functionality in the routing space. This change is required to properly bring back route redirect support. Reviewed by: bz MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D23047	2020-01-09 17:21:00 +00:00
Bjoern A. Zeeb	334fc5822b	vnet: virtualise more network stack sysctls. Virtualise tcp_always_keepalive, TCP and UDP log_in_vain. All three are set in the netoptions startup script, which we would love to run for VNETs as well [1]. While virtualising the log_in_vain sysctls seems pointles at first for as long as the kernel message buffer is not virtualised, it at least allows an administrator to debug the base system or an individual jail if needed without turning the logging on for all jails running on a system. PR: 243193 [1] MFC after: 2 weeks	2020-01-08 23:30:26 +00:00
Ed Maste	ee92463aca	Do not define TCPOUTFLAGS in rack_bbr_common tcp_outflags isn't used in this source file and compilation failed with external GCC on sparc64. I'm not sure why only that case failed (perhaps inconsistent -Werror config) but it is a legitimate issue to fix. Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D23068	2020-01-07 17:57:08 +00:00
Randall Stewart	4ad2473790	This catches rack up in the recent changes to ECN and also commonizes the functions that both the freebsd and rack stack uses. Sponsored by:Netflix Inc Differential Revision: https://reviews.freebsd.org/D23052	2020-01-06 15:29:14 +00:00
Randall Stewart	a9a08eced6	This change adds a small feature to the tcp logging code. Basically a connection can now have a separate tag added to the id. Obtained from: Lawrence Stewart Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D22866	2020-01-06 12:48:06 +00:00
Michael Tuexen	97a8ab398e	Don't make the sendall iterator as being up if it could not be started. MFC after: 1 week	2020-01-05 14:08:01 +00:00
Michael Tuexen	4b66d476b3	Return -1 consistently if an error occurs. MFC after: 1 week	2020-01-05 14:06:40 +00:00
Michael Tuexen	397b1c945f	Ensure that we don't miss a trigger for kicking off the SCTP iterator. Reported by: nwhitehorn@ MFC after: 1 week	2020-01-05 13:56:32 +00:00
Michael Tuexen	ae7cc6c9f8	Make the message size limit used for SCTP_SENDALL configurable via a sysctl variable instead of a compiled in constant. This is based on a patch provided by nwhitehorn@.	2020-01-04 20:33:12 +00:00
Mark Johnston	31069f383a	Take the ifnet's address lock in igmp_v3_cancel_link_timers(). inm_rele_locked() may remove the multicast address associated with inm. Reported by: syzbot+871c5d1fd5fac6c28f52@syzkaller.appspotmail.com Reviewed by: hselasky MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23009	2020-01-03 17:03:10 +00:00
Michael Tuexen	0eeb0d180e	Remove empty line which was added in r356270 by accident. MFC after: 1 week	2020-01-02 14:04:16 +00:00
Michael Tuexen	ac1d75d23a	Improve input validation of the spp_pathmtu field in the SCTP_PEER_ADDR_PARAMS socket option. The code in the stack assumes sane values for the MTU. This issue was found by running an instance of syzkaller. MFC after: 1 week	2020-01-02 13:55:10 +00:00
Gleb Smirnoff	8d5c56dab1	In r343631 error code for a packet blocked by a firewall was changed from EACCES to EPERM. This change was not intentional, so fix that. Return EACCESS if a firewall forbids sending. Noticed by: ae	2020-01-01 17:31:43 +00:00
Alexander Motin	8c3fbf3c20	Relax locking of carp_forus(). This fixes deadlock between CARP and bridge. Bridge calls this function taking CARP lock while holding bridge lock. Same time CARP tries to send its announcements via the bridge while holding CARP lock. Use of CARP_LOCK() here does not solve anything, since sc_addr is constant while race on sc_state is harmless and use of the lock does not close it. Reviewed by: glebius MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-12-31 18:58:29 +00:00
Michael Tuexen	e11c9783e1	Fix delayed ACK generation for DCTCP. Submitted by: Richard Scheffenegger Reviewed by: chengc@netapp.com, rgrimes@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22644	2019-12-31 16:15:47 +00:00
Michael Tuexen	493c98c6d2	Add flags for upcoming patches related to improved ECN handling. No functional change. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22429	2019-12-31 14:32:48 +00:00
Michael Tuexen	83a2839fb9	Clear the flag indicating that the last received packet was marked CE also in the case where a packet not marked was received. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, tuexen@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D19143	2019-12-31 14:23:52 +00:00
Michael Tuexen	7d87664a04	Add curly braces missed in https://svnweb.freebsd.org/changeset/base/354773 Sponsored by: Netflix, Inc. CID: 1407649	2019-12-31 12:29:01 +00:00
Michael Tuexen	6088175a18	Improve input validation for some parameters having a too small reported length. Thanks to Natalie Silvanovich from Google for finding one of these issues in the SCTP userland stack and reporting it. MFC after: 1 week	2019-12-20 15:25:08 +00:00
Alexander V. Chernikov	bdb214a4a4	Remove useless code from in6_rmx.c The code in questions walks IPv6 tree every 60 seconds and looks into the routes with non-zero expiration time (typically, redirected routes). For each such route it sets RTF_PROBEMTU flag at the expiration time. No other part of the kernel checks for RTF_PROBEMTU flag. RTF_PROBEMTU was defined 21 years ago, 30 Jun 1999, as RTF_PROTO1. RTF_PROTO1 is a de-facto standard indication of a route installed by a routing daemon for a last decade. Reviewed by: bz, ae MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D22865	2019-12-18 22:10:56 +00:00
Hans Petter Selasky	a4c5668d12	Leave multicast group before reaping and committing state for both IPv4 and IPv6. This fixes a regression issue after r349369. When trying to exit a multicast group before closing the socket, a multicast leave packet should be sent. Differential Revision: https://reviews.freebsd.org/D22848 PR: 242677 Reviewed by: bz (network) Tested by: Aleksandr Fedorov <aleksandr.fedorov@itglobal.com> MFC after: 1 week Sponsored by: Mellanox Technologies	2019-12-18 12:06:34 +00:00
Randall Stewart	1cf55767b8	This commit is a bit of a re-arrange of deck chairs. It gets both rack and bbr ready for the completion of the STATs framework in FreeBSD. For now if you don't have both NF_stats and stats on it disables them. As soon as the rest of the stats framework lands we can remove that restriction and then just uses stats when defined. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D22479	2019-12-17 16:08:07 +00:00
John Baldwin	5773ac113c	Use callout_func_t instead of the deprecated timeout_t. Reviewed by: kib, imp Differential Revision: https://reviews.freebsd.org/D22752	2019-12-10 22:06:53 +00:00
Bjoern A. Zeeb	646e34881c	Remove the extra epoch tracker change sneaked into r355449 and was not part of the originally reviewed or described change. Pointyhat to: bz Reported by: glebius	2019-12-06 22:20:26 +00:00
Bjoern A. Zeeb	dad68fc301	carp: replace caddr_t with char * Change the remaining caddr_t usages to char * following the removal of the KAME macros No functional change. Requested by: glebius Reviewed by: glebius MFC after: 2 weeks Sponsored by: Netflix (originally) Differential Revision: https://reviews.freebsd.org/D22399	2019-12-06 16:35:48 +00:00
Gleb Smirnoff	e5a084d020	Fix regression from r354484. Don't leak pcb lock if cr_canseeinpcb() returns non-zero. PR: 242415	2019-12-04 22:41:52 +00:00
Bjoern A. Zeeb	0700d2c3f0	Make icmp6_reflect() static. icmp6_reflect() is not used anywhere outside icmp6.c, no reason to export it. Sponsored by: Netflix	2019-12-03 14:46:38 +00:00
Hans Petter Selasky	5b64b824b9	Use refcount from "in_joingroup_locked()" when joining multicast groups. Do not acquire additional references. This makes the IPv4 IGMP code in line with the IPv6 MLD code. Background: The IPv4 multicast code puts an extra reference on the in_multi struct when joining groups. This becomes visible when using daemons like igmpproxy from ports, that multicast entries do not disappear from the output of ifmcstat(8) when multicast streams are disconnected. This fixes a regression issue after r349762. While at it factor the ip_mfilter_insert() and ip6_mfilter_insert() calls to avoid repeated "is_new" check. Differential Revision: https://reviews.freebsd.org/D22595 Tested by: Guido van Rooij <guido@gvr.org> Reviewed by: rgrimes (network) MFC after: 1 week Sponsored by: Mellanox Technologies	2019-12-03 08:46:59 +00:00
Edward Tomasz Napierala	adc56f5a38	Make use of the stats(3) framework in the TCP stack. This makes it possible to retrieve per-connection statistical information such as the receive window size, RTT, or goodput, using a newly added TCP_STATS getsockopt(3) option, and extract them using the stats_voistat_fetch(3) API. See the net/tcprtt port for an example consumer of this API. Compared to the existing TCP_INFO system, the main differences are that this mechanism is easy to extend without breaking ABI, and provides statistical information instead of raw "snapshots" of values at a given point in time. stats(3) is more generic and can be used in both userland and the kernel. Reviewed by: thj Tested by: thj Obtained from: Netflix Relnotes: yes Sponsored by: Klara Inc, Netflix Differential Revision: https://reviews.freebsd.org/D20655	2019-12-02 20:58:04 +00:00
Michael Tuexen	3cf38784e2	Move all ECN related flags from the flags to the flags2 field. This allows adding more ECN related flags in the future. No functional change intended. Submitted by: Richard Scheffenegger Reviewed by: rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22497	2019-12-01 21:01:33 +00:00
Michael Tuexen	77aabfd94f	Make the TF_* flags easier readable by humans by adding leading zeroes to make them aligned. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22428	2019-12-01 20:45:48 +00:00
Michael Tuexen	b72e56e758	This is an initial step in implementing the new congestion window validation as specified in RFC 7661. Submitted by: Richard Scheffenegger Reviewed by: rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D21798	2019-12-01 20:35:41 +00:00
Michael Tuexen	8df12ffcc2	Make the IPTOS value available to all substate handlers. This will allow to add support for L4S or SCE, which require processing of the IP TOS field. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22426	2019-12-01 18:47:53 +00:00
Michael Tuexen	fa49a96419	In order for the TCP Handshake to support ECN++, and further ECN-related improvements, the ECN bits need to be exposed to the TCP SYNcache. This change is a minimal modification to the function headers, without any functional change intended. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22436	2019-12-01 18:05:02 +00:00
Michael Tuexen	669a285ffb	When changing the MTU of an SCTP path, not only cancel all ongoing RTT measurements, but also scheldule new ones for the future. Submitted by: Julius Flohr MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D22547	2019-12-01 17:35:36 +00:00
Bjoern A. Zeeb	a4adf6cc65	Fix m_pullup() problem after removing PULLDOWN_TESTs and KAME EXT_*macros. r354748-354750 replaced the KAME macros with m_pulldown() calls. Contrary to the rest of the network stack m_len checks before m_pulldown() were not put in placed (see r354748). Put these m_len checks in place for now (to go along with the style of the network stack since the initial commits). These are not put in for performance but to avoid an error scenario (even though it also will help performance at the moment as it avoid allocating an extra mbuf; not because of the unconditional function call). The observed error case went like this: (1) an mbuf with M_EXT arrives and we call m_pullup() unconditionally on it. (2) m_pullup() will call m_get() unless the requested length is larger than MHLEN (in which case it'll m_freem() the perfectly fine mbuf) and migrate the requested length of data and pkthdr into the new mbuf. (3) If m_get() succeeds, a further m_pullup() call going over MHLEN will fail. This was observed with failing auto-configuration as an RA packet of 200 bytes exceeded MHLEN and the m_pullup() called from nd6_ra_input() dropped the mbuf. (Re-)adding the m_len checks before m_pullup() calls avoids this problems with mbufs using external storage for now. MFC after: 3 weeks Sponsored by: Netflix	2019-12-01 00:22:04 +00:00
Michael Tuexen	f727fee546	Really ignore the SCTP association identifier on 1-to-1 style sockets as requiresd by the socket API specification. Thanks to Inaki Baz Castillo, who found this bug running the userland stack with valgrind and reported the issue in https://github.com/sctplab/usrsctp/issues/408 MFC after: 1 week	2019-11-28 12:50:25 +00:00
Michael Tuexen	645f3a1cd1	Plug two mbuf leaks during INIT-ACK handling. One leak happens when there is not enough memory to allocate the the resources for streams. The other leak happens if the are unknown parameters in the received INIT-ACK chunk which require reporting and the INIT-ACK requires sending an ABORT due to illegal parameter combinations. Hopefully this fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=19083 MFC after: 1 week	2019-11-27 19:32:29 +00:00
Ryan Libby	200f3ac6f7	in_mcast.c: need if_addr_lock around inm_release_deferred Apply a similar fix as for in6_mcast.c. Reviewed by: hselasky Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20740	2019-11-25 22:25:34 +00:00
Bjoern A. Zeeb	1a117215c7	Reduce the vnet_set module size of ip_mroute to allow loading as a module. With VIMAGE kernels modules get special treatment as they need to also keep the original values and make copies for each instance. For that a few pages of vnet modspace are provided and the kernel-linker and the VNET framework know how to deal with things. When the modspace is (almost) full, other modules which would overflow the modspace cannot be loaded and kldload will fail. ip_mroute uses a lot of variable space, mostly be four big arrays: set_vnet 0000000000000510 vnet_entry_multicast_register_if set_vnet 0000000000000700 vnet_entry_viftable set_vnet 0000000000002000 vnet_entry_bw_meter_timers set_vnet 0000000000002800 vnet_entry_bw_upcalls Dynamically malloc the three big ones for each instance we need and free them again on vnet teardown (the 4th is an ifnet). That way they only need module space for a single pointer and allow a lot more modules using virtualized variables to be loaded on a VNET kernel. PR: 206583 Reviewed by: hselasky, kp MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D22443	2019-11-19 15:38:55 +00:00
Michael Tuexen	c968c769af	Add boundary and overflow checks to the formulas used in the TCP CUBIC congestion control module. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@ Differential Revision: https://reviews.freebsd.org/D19118	2019-11-16 12:00:22 +00:00
Michael Tuexen	b0c1a13e4e	Improve TCP CUBIC specific after idle reaction. The adjustments are inspired by the Linux stack, which has had a functionally equivalent implementation for more than a decade now. Submitted by: Richard Scheffenegger Reviewed by: Cheng Cui Differential Revision: https://reviews.freebsd.org/D18982	2019-11-16 11:57:12 +00:00
Michael Tuexen	35cd141b4b	Implement a tCP CUBIC-specific after idle reaction. This patch addresses a very common case of frequent application stalls, where TCP runs idle and looses the state of the network. Submitted by: Richard Scheffenegger Reviewed by: Cheng Cui Differential Revision: https://reviews.freebsd.org/D18954	2019-11-16 11:37:26 +00:00
Michael Tuexen	453e633384	Revert https://svnweb.freebsd.org/changeset/base/354708 I used the wrong Differential Revision, so back it out and do it right in a follow-up commit.	2019-11-16 11:10:09 +00:00
Bjoern A. Zeeb	b141dd5ddf	Remove now unused IPv6 macros and update docs. After r354748-354750 all uses of the IP6_EXTHDR_CHECK() and IP6_EXTHDR_GET() macros are gone from the kernel. IP6_EXTHDR_GET0() was unused. Remove the macros and update the documentation. Sponsored by: Netflix	2019-11-15 21:55:41 +00:00
Bjoern A. Zeeb	4e619b17c5	IP6_EXTHDR_CHECK(): remove the last instances While r354748 removed almost all IP6_EXTHDR_CHECK() calls, these are not part of the PULLDOWN_TESTS. Equally convert these IP6_EXTHDR_CHECK()s here to m_pullup() and remove the extra check and m_pullup() in tcp_input() under isipv6 given tcp6_input() has done exactly that pullup already. MFC after: 8 weeks Sponsored by: Netflix	2019-11-15 21:51:43 +00:00
Bjoern A. Zeeb	63abacc204	netinet*: replace IP6_EXTHDR_GET() In a few places we have IP6_EXTHDR_GET() left in upper layer protocols. The IP6_EXTHDR_GET() macro might perform an m_pulldown() in case the data fragment is not contiguous. Convert these last remaining instances into m_pullup()s instead. In CARP, for example, we will a few lines later call m_pullup() anyway, the IPsec code coming from OpenBSD would otherwise have done the m_pullup() and are copying the data a bit later anyway, so pulling it in seems no better or worse. Note: this leaves very few m_pulldown() cases behind in the tree and we might want to consider removing them as well to make mbuf management easier again on a path to variable size mbufs, especially given m_pulldown() still has an issue not re-checking M_WRITEABLE(). Reviewed by: gallatin MFC after: 8 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22335	2019-11-15 21:44:17 +00:00
Michael Tuexen	730cbbc10d	For idle TCP sessions using the CUBIC congestio control, reset ssthresh to the higher of the previous ssthresh or 3/4 of the prior cwnd. Submitted by: Richard Scheffenegger Reviewed by: Cheng Cui Differential Revision: https://reviews.freebsd.org/D18982	2019-11-14 16:28:02 +00:00
Bjoern A. Zeeb	a8fe77d877	netinet: update mp to pass the proper value back In ip6_[direct_]input() we are looping over the extension headers to deal with the next header. We pass a pointer to an mbuf pointer to the handling functions. In certain cases the mbuf can be updated there and we need to pass the new one back. That missing in dest6_input() and route6_input(). In tcp6_input() we should also update it before we call tcp_input(). In addition to that mark the mbuf NULL all the times when we return that we are done with handling the packet and no next header should be checked (IPPROTO_DONE). This will eventually allow us to assert proper behaviour and catch the above kind of errors more easily, expecting *mp to always be set. This change is extracted from a larger patch and not an exhaustive change across the entire stack yet. PR: 240135 Reported by: prabhakar.lakhera gmail.com MFC after: 3 weeks Sponsored by: Netflix	2019-11-12 15:46:28 +00:00
Gleb Smirnoff	273b2e4c55	Remove now unused INP_INFO_RLOCK macros.	2019-11-07 22:26:54 +00:00
Gleb Smirnoff	43e8b279b8	In TCP HPTS enter the epoch in tcp_hpts_thread() and assert it in the leaf functions.	2019-11-07 21:30:27 +00:00
Gleb Smirnoff	aed553598d	Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER() in TCP timewait manipulation leaf functions.	2019-11-07 21:29:38 +00:00
Gleb Smirnoff	a81e3ecf46	Since pfslowtimo() runs in the network epoch, tcp_slowtimo() also does. This allows to simplify tcp_tw_2msl_scan() and always require the network epoch in it.	2019-11-07 21:28:46 +00:00
Gleb Smirnoff	032677ceb5	Now that there is no R/W lock on PCB list the pcblist sysctls handlers can be greatly simplified. All the previous double cycling and complex locking was added to avoid these functions holding global PCB locks for extended period of time, preventing addition of new entries.	2019-11-07 21:27:32 +00:00
Gleb Smirnoff	d40c0d47cd	Now that all of the tcp_input() and all its branches are executed in the network epoch, we can greatly simplify synchronization. Remove all unneccesary epoch enters hidden under INP_INFO_RLOCK macro. Remove some unneccesary assertions and convert necessary ones into the NET_EPOCH_ASSERT macro.	2019-11-07 21:23:07 +00:00
Gleb Smirnoff	80577e5583	Remove unnecessary recursive epoch enter via INP_INFO_RLOCK macro in udp_input(). It shall always run in the network epoch.	2019-11-07 21:08:49 +00:00
Gleb Smirnoff	5015a05f0b	Remove now unused INP_HASH_RLOCK() macros.	2019-11-07 21:03:15 +00:00
Gleb Smirnoff	2435e507de	Now with epoch synchronized PCB lookup tables we can greatly simplify locking in udp_output() and udp6_output(). First, we select if we need read or write lock in PCB itself, we take the lock and enter network epoch. Then, we proceed for the rest of the function. In case if we need to modify PCB hash, we would take write lock on it for a short piece of code. We could exit the epoch before allocating an mbuf, but with this patch we are keeping it all the way into ip_output()/ip6_output(). Today this creates an epoch recursion, since ip_output() enters epoch itself. However, once all protocols are reviewed, ip_output() and ip6_output() would require epoch instead of entering it. Note: I'm not 100% sure that in udp6_output() the epoch is required. We don't do PCB hash lookup for a bound socket. And all branches of in6_select_src() don't require epoch, at least they lack assertions. Today inet6 address list is protected by rmlock, although it is CKLIST. AFAIU, the future plan is to protect it by network epoch. That would require epoch in in6_select_src(). Anyway, in future ip6_output() would require epoch, udp6_output() would need to enter it.	2019-11-07 21:01:36 +00:00
Gleb Smirnoff	5a1264335d	Add INP_UNLOCK() which will do whatever R/W unlock is required.	2019-11-07 20:57:51 +00:00
Gleb Smirnoff	d797164a86	Since r353292 on input path we are always in network epoch, when we lookup PCBs. Thus, do not enter epoch recursively in in_pcblookup_hash() and in6_pcblookup_hash(). Same applies to tcp_ctlinput() and tcp6_ctlinput(). This leaves several sysctl(9) handlers that return PCB credentials unprotected. Add epoch enter/exit to all of them. Differential Revision: https://reviews.freebsd.org/D22197	2019-11-07 20:49:56 +00:00
Gleb Smirnoff	de537d63c2	Remove unnecessary recursive epoch enter via INP_INFO_RLOCK macro in divert_packet(). This function is called only from pfil(9) filters, which in their place always run in the network epoch.	2019-11-07 20:44:34 +00:00
Gleb Smirnoff	f42347c39a	Remove unnecessary recursive epoch enter via INP_INFO_RLOCK macro in raw input functions for IPv4 and IPv6. They shall always run in the network epoch.	2019-11-07 20:40:44 +00:00
Bjoern A. Zeeb	503f4e4736	netinet*: variable cleanup In preparation for another change factor out various variable cleanups. These mainly include: (1) do not assign values to variables during declaration: this makes the code more readable and does allow for better grouping of variable declarations, (2) do not assign values to variables before need; e.g., if a variable is only used in the 2nd half of a function and we have multiple return paths before that, then do not set it before it is needed, and (3) try to avoid assigning the same value multiple times. MFC after: 3 weeks Sponsored by: Netflix	2019-11-07 18:29:51 +00:00
Gleb Smirnoff	58d94bd0d9	TCP timers are executed in callout context, so they need to enter network epoch to look into PCB lists. Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER(). No functional change here.	2019-11-07 00:27:23 +00:00
Gleb Smirnoff	97a95ee134	Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER() in TCP functions that are executed in syscall context. No functional change here.	2019-11-07 00:10:14 +00:00
Gleb Smirnoff	1a49612526	Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER(). Remove few outdated comments and extraneous assertions. No functional change here.	2019-11-07 00:08:34 +00:00
Bjoern A. Zeeb	6e6b5143f5	Properly set VNET when nuking recvif from fragment queues. In theory the eventhandler invoke should be in the same VNET as the the current interface. We however cannot guarantee that for all cases in the future. So before checking if the fragmentation handling for this VNET is active, switch the VNET to the VNET of the interface to always get the one we want. Reviewed by: hselasky MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22153	2019-10-25 18:54:06 +00:00
Michael Tuexen	4a91aa8fc9	Ensure that the flags indicating IPv4/IPv6 are not changed by failing bind() calls. This would lead to inconsistent state resulting in a panic. A fix for stable/11 was committed in https://svnweb.freebsd.org/base?view=revision&revision=338986 An accelerated MFC is planned as discussed with emaste@. Reported by: syzbot+2609a378d89264ff5a42@syzkaller.appspotmail.com Obtained from: jtl@ MFC after: 1 day Sponsored by: Netflix, Inc.	2019-10-24 20:05:10 +00:00
Michael Tuexen	9f36ec8bba	Store a handle for the event handler. This will be used when unloading the SCTP as a module. Obtained from: markj@	2019-10-24 09:22:23 +00:00
Randall Stewart	9992c365b6	Fix a small bug in bbr when running under a VM. Basically what happens is we are more delayed in the pacer calling in so we remove the stack from the pacer and recalculate how much time is left after all data has been acknowledged. However the comparision was backwards so we end up with a negative value in the last_pacing_delay time which causes us to add in a huge value to the next pacing time thus stalling the connection. Reported by: vm2.finance@gmail.com	2019-10-24 05:54:30 +00:00
Michael Tuexen	4ad8cb6813	Fix compile issues when building a kernel without the VIMAGE option. Thanks to cem@ for discussing the issue which resulted in this patch. Reviewed by: cem@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D22089	2019-10-19 20:48:53 +00:00
Conrad Meyer	e9c6962599	debugnet(4): Add optional full-duplex mode It remains unattached to any client protocol. Netdump is unaffected (remaining half-duplex). The intended consumer is NetGDB. Submitted by: John Reimer <john.reimer AT emc.com> (earlier version) Discussed with: markj Differential Revision: https://reviews.freebsd.org/D21541	2019-10-17 20:25:15 +00:00
Conrad Meyer	fde2cf65ce	debugnet(4): Infer non-server connection parameters Loosen requirements for connecting to debugnet-type servers. Only require a destination address; the rest can theoretically be inferred from the routing table. Relax corresponding constraints in netdump(4) and move ifp validation to debugnet connection time. Submitted by: John Reimer <john.reimer AT emc.com> (earlier version) Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D21482	2019-10-17 20:10:32 +00:00
Conrad Meyer	8270d35eca	Add ddb(4) 'netdump' command to netdump a core without preconfiguration Add a 'X -s <server> -c <client> [-g <gateway>] -i <interface>' subroutine to the generic debugnet code. The imagined use is both netdump, shown here, and NetGDB (vaporware). It uses the ddb(4) lexer, with some new extensions, to parse out IPv4 addresses. 'Netdump' uses the generic debugnet routine to load a configuration and start a dump, without any netdump configuration prior to panic. Loosely derived from work by: John Reimer <john.reimer AT emc.com> Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D21460	2019-10-17 19:49:20 +00:00
Gleb Smirnoff	c0ebee428e	Quickly fix up r353683: enter the epoch before calling into netisr_dispatch().	2019-10-17 17:02:50 +00:00
Conrad Meyer	7790c8c199	Split out a more generic debugnet(4) from netdump(4) Debugnet is a simplistic and specialized panic- or debug-time reliable datagram transport. It can drive a single connection at a time and is currently unidirectional (debug/panic machine transmit to remote server only). It is mostly a verbatim code lift from netdump(4). Netdump(4) remains the only consumer (until the rest of this patch series lands). The INET-specific logic has been extracted somewhat more thoroughly than previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as much as possible as is protocol-independent, remains in debugnet.c. The separation is not perfect and future improvement is welcome. Supporting INET6 is a long-term goal. Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to 'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the generic module would be more confusing than the refactoring. The only functional change here is the mbuf allocation / tracking. Instead of initiating solely on netdump-configured interface(s) at dumpon(8) configuration time, we watch for any debugnet-enabled NIC for link activation and query it for mbuf parameters at that time. If they exceed the existing high-water mark allocation, we re-allocate and track the new high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone. In a future patch in this series, this will allow initiating netdump from panic ddb(4) without pre-panic configuration. No other functional change intended. Reviewed by: markj (earlier version) Some discussion with: emaste, jhb Objection from: marius Differential Revision: https://reviews.freebsd.org/D21421	2019-10-17 16:23:03 +00:00
Gleb Smirnoff	756368b68b	igmp_v1v2_queue_report() doesn't require epoch.	2019-10-17 16:02:34 +00:00
Hans Petter Selasky	a55383e720	Fix panic in network stack due to use after free when receiving partial fragmented packets before a network interface is detached. When sending IPv4 or IPv6 fragmented packets and a fragment is lost before the network device is freed, the mbuf making up the fragment will remain in the temporary hashed fragment list and cause a panic when it times out due to accessing a freed network interface structure. 1) Make sure the m_pkthdr.rcvif always points to a valid network interface. Else the rcvif field should be set to NULL. 2) Use the rcvif of the last received fragment as m_pkthdr.rcvif for the fully defragged packet, instead of the first received fragment. Panic backtrace for IPv6: panic() icmp6_reflect() # tries to access rcvif->if_afdata[AF_INET6]->xxx icmp6_error() frag6_freef() frag6_slowtimo() pfslowtimo() softclock_call_cc() softclock() ithread_loop() Reviewed by: bz Differential Revision: https://reviews.freebsd.org/D19622 MFC after: 1 week Sponsored by: Mellanox Technologies	2019-10-16 09:11:49 +00:00
Michael Tuexen	776cd558f0	Separate out SCTP related dtrace code. This is based on work done by markj@. Discussed with: markj@ MFC after: 3 days	2019-10-14 20:32:11 +00:00
Randall Stewart	8ee1cf039e	if_hw_tsomaxsegsize needs to be initialized to zero, just like in bbr.c and tcp_output.c	2019-10-14 13:10:29 +00:00
Michael Tuexen	fcfd8ad537	Rename sctp_dtrace_declare.h to sctp_kdtrace.h for consistentcy. MFC after: 3 days	2019-10-14 13:02:49 +00:00
Gleb Smirnoff	5df91cbe02	Revert r353313. It is not needed with r353357 and is actually incorrect.	2019-10-14 04:10:00 +00:00
Michael Tuexen	d6e23cf0cf	Use an event handler to notify the SCTP about IP address changes instead of calling an SCTP specific function from the IP code. This is a requirement of supporting SCTP as a kernel loadable module. This patch was developed by markj@, I tweaked a bit the SCTP related code. Submitted by: markj@ MFC after: 3 days	2019-10-13 18:17:08 +00:00
Mark Johnston	671d68fad9	Move SCTP DTrace probe definitions into a .c file. Previously they were defined in a header which was included exactly once. Change this to follow the usual practice of putting definitions in C files. No functional change intended. Discussed with: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-10-13 16:14:04 +00:00
Michael Tuexen	1b6ddd9404	Ensure that local variables are reset to their initial value when dealing with error cases in a loop over all remote addresses. This issue was found and reported by OSS_Fuzz in: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18080 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18086 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18121 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18163 MFC after: 3 days	2019-10-12 17:57:03 +00:00
Gleb Smirnoff	1e5db73d07	The divert(4) module must always be running in network epoch, thus call to if_addr_rlock() isn't needed.	2019-10-10 23:48:42 +00:00
Warner Losh	b23b156e2e	Fix casting error from newer gcc Cast the pointers to (uintptr_t) before assigning to type uint64_t. This eliminates an error from gcc when we cast the pointer to a larger integer type.	2019-10-09 21:02:06 +00:00
Hans Petter Selasky	eabddb25a3	Factor out TCP rateset destruction code. Ensure the epoch_call() function is not called more than one time before the callback has been executed, by always checking the RS_FUNERAL_SCHD flag before invoking epoch_call(). The "rs_number_dead" is balanced again after r353353. Discussed with: rrs@ Sponsored by: Mellanox Technologies	2019-10-09 17:08:40 +00:00
Gleb Smirnoff	0732ac0eff	Revert most of the multicast changes from r353292. This needs a more accurate approach.	2019-10-09 17:03:20 +00:00
Hans Petter Selasky	24be13533b	Fix locking order reversal in the TCP ratelimit code by moving destructors outside the rsmtx mutex. Witness message: lock order reversal: (sleepable after non-sleepable) 1st tcp_rs_mtx (rsmtx) @ sys/netinet/tcp_ratelimit.c:242 2nd sysctl lock (sysctl lock) @ sys/kern/kern_sysctl.c:607 Backtrace: witness_debugger witness_checkorder _rm_wlock_debug sysctl_ctx_free rs_destroy epoch_call_task gtaskqueue_run_locked gtaskqueue_thread_loop Discussed with: rrs@ Sponsored by: Mellanox Technologies	2019-10-09 16:48:48 +00:00
John Baldwin	9e14430d46	Add a TOE KTLS mode and a TOE hook for allocating TLS sessions. This adds the glue to allocate TLS sessions and invokes it from the TLS enable socket option handler. This also adds some counters for active TOE sessions. The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when TOE KTLS is in use on a socket, but cannot be set via setsockopt(). To simplify various checks, a TLS session now includes an explicit 'mode' member set to the value returned by TLSTX_TLS_MODE. Various places that used to check 'sw_encrypt' against NULL to determine software vs ifnet (NIC) TLS now check 'mode' instead. Reviewed by: np, gallatin Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21891	2019-10-08 21:34:06 +00:00
Gleb Smirnoff	e4c40a8a71	Quickly plug another regression from r353292. Again, multicast locking needs lots of work... Reported by: pho	2019-10-08 16:59:17 +00:00
Michael Tuexen	953b78bed9	Validate length before use it, not vice versa. r353060 should have contained this... This fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18070 MFC after: 3 days	2019-10-08 11:07:16 +00:00
Gleb Smirnoff	b8a6e03fac	Widen NET_EPOCH coverage. When epoch(9) was introduced to network stack, it was basically dropped in place of existing locking, which was mutexes and rwlocks. For the sake of performance mutex covered areas were as small as possible, so became epoch covered areas. However, epoch doesn't introduce any contention, it just delays memory reclaim. So, there is no point to minimise epoch covered areas in sense of performance. Meanwhile entering/exiting epoch also has non-zero CPU usage, so doing this less often is a win. Not the least is also code maintainability. In the new paradigm we can assume that at any stage of processing a packet, we are inside network epoch. This makes coding both input and output path way easier. On output path we already enter epoch quite early - in the ip_output(), in the ip6_output(). This patch does the same for the input path. All ISR processing, network related callouts, other ways of packet injection to the network stack shall be performed in net_epoch. Any leaf function that walks network configuration now asserts epoch. Tricky part is configuration code paths - ioctls, sysctls. They also call into leaf functions, so some need to be changed. This patch would introduce more epoch recursions (see EPOCH_TRACE) than we had before. They will be cleaned up separately, as several of them aren't trivial. Note, that unlike a lock recursion the epoch recursion is safe and just wastes a bit of resources. Reviewed by: gallatin, hselasky, cy, adrian, kristof Differential Revision: https://reviews.freebsd.org/D19111	2019-10-07 22:40:05 +00:00
Michael Tuexen	746c7ae563	In r343587 a simple port filter as sysctl tunable was added to siftr. The new sysctl was not added to the siftr.4 man page at the time. This updates the man page, and removes one left over trailing whitespace. Submitted by: Richard Scheffenegger Reviewed by: bcr@ MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D21619	2019-10-07 20:35:04 +00:00
Randall Stewart	5b63b22075	Brad Davis identified a problem with the new LRO code, VLAN's no longer worked. The problem was that the defines used the same space as the VLAN id. This commit does three things. 1) Move the LRO used fields to the PH_per fields. This is safe since the entire PH_per is used for IP reassembly which LRO code will not hit. 2) Remove old unused pace fields that are not used in mbuf.h 3) The VLAN processing is not in the mbuf queueing code. Consequently if a VLAN submits to Rack or BBR we need to bypass the mbuf queueing for now until rack_bbr_common is updated to handle the VLAN properly. Reported by: Brad Davis	2019-10-06 22:29:02 +00:00
Michael Tuexen	63fb39ba7b	Plumb an mbuf leak in a code path that should not be taken. Also avoid that this path is taken by setting the tail pointer correctly. There is still bug related to handling unordered unfragmented messages which were delayed in deferred handling. This issue was found by OSS-Fuzz testing the usrsctp stack and reported in https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17794 MFC after: 3 days	2019-10-06 08:47:10 +00:00
Michael Tuexen	2560cb1eb8	Fix a use after free bug when removing remote addresses. This bug was found by OSS-Fuzz and reported in https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18004 MFC after: 3 days	2019-10-05 13:28:01 +00:00
Michael Tuexen	0941b9dc37	Plumb an mbuf leak found by Mark Wodrich from Google by fuzz testing the userland stack and reporting it in: https://github.com/sctplab/usrsctp/issues/396 MFC after: 3 days	2019-10-05 12:34:50 +00:00
Michael Tuexen	44f788d793	Fix the adding of padding to COOKIE-ECHO chunks. Thanks to Mark Wodrich who found this issue while fuzz testing the usrsctp stack and reported the issue in https://github.com/sctplab/usrsctp/issues/382 MFC after: 3 days	2019-10-05 09:46:11 +00:00
Michael Tuexen	aac50dab6d	When skipping the address parameter, take the padding into account. MFC after: 3 days	2019-10-03 20:47:57 +00:00
Michael Tuexen	5989470c37	Cleanup sctp_asconf_error_response() and ensure that the parameter is padded as required. This fixes the followig bug reported by OSS-Fuzz for the usersctp stack: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17790 MFC after: 3 days	2019-10-03 20:39:17 +00:00
Michael Tuexen	967e1a5333	Add missing input validation. This could result in reading from uninitialized memory. The issue was found by OSS-Fuzz for usrsctp and reported in https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17780 MFC after: 3 days	2019-10-03 18:36:54 +00:00
Michael Tuexen	2974e263c3	Don't use stack memory which is not initialized. Thanks to Mark Wodrich for reporting this issue for the userland stack in https://github.com/sctplab/usrsctp/issues/380 This issue was also found for usrsctp by OSS-fuzz in https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17778 MFC after: 3 days	2019-09-30 12:06:57 +00:00
Michael Tuexen	12a43d0d5d	RFC 7112 requires a host to put the complete IP header chain including the TCP header in the first IP packet. Enforce this in tcp_output(). In addition make sure that at least one byte payload fits in the TCP segement to allow making progress. Without this check, a kernel with INVARIANTS will panic. This issue was found by running an instance of syzkaller. Reviewed by: jtl@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21665	2019-09-29 10:45:13 +00:00
Michael Tuexen	71e85612a9	Replacing MD5 by SipHash improves the performance of the TCP time stamp initialisation, which is important when the host is dealing with a SYN flood. This affects the computation of the initial TCP sequence number for the client side. This has been discussed with secteam@. Reviewed by: gallatin@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21616	2019-09-28 13:13:23 +00:00
Michael Tuexen	79c2a2a07b	Ensure that the INP lock is released before leaving [gs]etsockopt() for RACK specific socket options. These issues were found by a syzkaller instance. Reviewed by: rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21825	2019-09-28 13:05:37 +00:00
Jonathan T. Looney	0b18fb0798	Add new functionality to switch to using cookies exclusively when we the syn cache overflows. Whether this is due to an attack or due to the system having more legitimate connections than the syn cache can hold, this situation can quickly impact performance. To make the system perform better during these periods, the code will now switch to exclusively using cookies until the syn cache stops overflowing. In order for this to occur, the system must be configured to use the syn cache with syn cookie fallback. If syn cookies are completely disabled, this change should have no functional impact. When the system is exclusively using syn cookies (either due to configuration or the overflow detection enabled by this change), the code will now skip acquiring a lock on the syn cache bucket. Additionally, the code will now skip lookups in several places (such as when the system receives a RST in response to a SYN\|ACK frame). Reviewed by: rrs, gallatin (previous version) Discussed with: tuexen Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21644	2019-09-26 15:18:57 +00:00
Jonathan T. Looney	0bee4d631a	Access the syncache secret directly from the V_tcp_syncache variable, rather than indirectly through the backpointer to the tcp_syncache structure stored in the hashtable bucket. This also allows us to remove the requirement in syncookie_generate() and syncookie_lookup() that the syncache hashtable bucket must be locked. Reviewed by: gallatin, rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21644	2019-09-26 15:06:46 +00:00
Jonathan T. Looney	867e98f8ee	Remove the unused sch parameter to the syncache_respond() function. The use of this parameter was removed in r313330. This commit now removes passing this now-unused parameter. Reviewed by: gallatin, rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21644	2019-09-26 15:02:34 +00:00
Randall Stewart	ac7bd23a7a	lets put (void) in a couple of functions to keep older platforms that are stuck with gcc happy (ppc). The changes are needed in both bbr and rack. Obtained from: Michael Tuexen (mtuexen@)	2019-09-24 20:36:43 +00:00
Randall Stewart	da99b33b17	don't call in_ratelmit detach when RATELIMIT is not compiled in the kernel.	2019-09-24 20:11:55 +00:00
Randall Stewart	2f1cc984db	Fix the ifdefs in tcp_ratelimit.h. They were reversed so that instead of functions only being inside the _KERNEL and the absence of RATELIMIT causing us to have NULL/error returning interfaces we ended up with non-kernel getting the error path. opps..	2019-09-24 20:04:31 +00:00
Randall Stewart	35c7bb3407	This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This is a completely separate TCP stack (tcp_bbr.ko) that will be built only if you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that supports hardware pacing, BBR understands how to use such a feature. Note that this commit also adds in a general purpose time-filter which allows you to have a min-filter or max-filter. A filter allows you to have a low (or high) value for some period of time and degrade slowly to another value has time passes. You can find out the details of BBR by looking at the original paper at: https://queue.acm.org/detail.cfm?id=3022184 or consult many other web resources you can find on the web referenced by "BBR congestion control". It should be noted that BBRv1 (which this is) does tend to unfairness in cases of small buffered paths, and it will usually get less bandwidth in the case of large BDP paths(when competing with new-reno or cubic flows). BBR is still an active research area and we do plan on implementing V2 of BBR to see if it is an improvement over V1. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D21582	2019-09-24 18:18:11 +00:00
Michael Tuexen	2b861c1538	Plumb a memory leak. Thnanks to Felix Weinrank for finding this issue using fuzz testing and reporting it for the userland stack: https://github.com/sctplab/usrsctp/issues/378 MFC after: 3 days	2019-09-24 13:15:24 +00:00
Michael Tuexen	1325a0de13	Don't hold the info lock when calling sctp_select_a_tag(). This avoids a double lock bug in the NAT colliding state processing of SCTP. Thanks to Felix Weinrank for finding and reporting this issue in https://github.com/sctplab/usrsctp/issues/374 He found this bug using fuzz testing. MFC after: 3 days	2019-09-22 11:11:01 +00:00
Michael Tuexen	44f2a3272e	Cleanup the RTO calculation and perform some consistency checks before computing the RTO. This should fix an overflow issue reported by Felix Weinrank in https://github.com/sctplab/usrsctp/issues/375 for the userland stack and found by running a fuzz tester. MFC after: 3 days	2019-09-22 10:40:15 +00:00
Michael Tuexen	e6b3bd22d8	Fix the handling of invalid parameters in ASCONF chunks. Thanks to Mark Wodrich from Google for reproting the issue in https://github.com/sctplab/usrsctp/issues/376 for the userland stack. MFC after: 3 days	2019-09-20 08:20:20 +00:00

... 3 4 5 6 7 ...

6760 Commits