freebsd-dev

Author	SHA1	Message	Date
Andrew Gallatin	086a35562f	tcp: enter network epoch when calling tfb_tcp_fb_fini We need to enter the network epoch when calling into tfb_tcp_fb_fini. I noticed this when I hit an assert running the latest rack Differential Revision: https://reviews.freebsd.org/D30407 Reviewed by: rrs, tuexen Sponsored by: Netflix	2021-05-25 13:45:37 -04:00
Randall Stewart	13c0e198ca	tcp: Fix bugs related to the PUSH bit and rack and an ack war Michaels testing with UDP tunneling found an issue with the push bit, which was only partly fixed in the last commit. The problem is the left edge gets transmitted before the adjustments are done to the send_map, this means that right edge bits must be considered to be added only if the entire RSM is being retransmitted. Now syzkaller also continued to find a crash, which Michael sent me the reproducer for. Turns out that the reproducer on default (freebsd) stack made the stack get into an ack-war with itself. After fixing the reference issues in rack the same ack-war was found in rack (and bbr). Basically what happens is we go into the reassembly code and lose the FIN bit. The trick here is we should not be going into the reassembly code if tlen == 0 i.e. the peer never sent you anything. That then gets the proper action on the FIN bit but then you end up in LAST_ACK with no timers running. This is because the usrclosed function gets called and the FIN's and such have already been exchanged. So when we should be entering FIN_WAIT2 (or even FIN_WAIT1) we get stuck in LAST_ACK. Fixing this means tweaking the usrclosed function so that we properly recognize the condition and drop into FIN_WAIT2 where a timer will allow at least TP_MAXIDLE before closing (to allow time for the peer to retransmit its FIN if the ack is lost). Setting the fast_finwait2 timer can speed this up in testing. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30451	2021-05-25 13:23:31 -04:00
Michael Tuexen	9bbd1a8fcb	tcp: fix a RACK socket buffer lock issue Fix a missing socket buffer unlocking of the socket receive buffer. Reviewed by: gallatin, rrs MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D30402	2021-05-24 20:31:23 +02:00
Randall Stewart	631449d5d0	tcp: Fix an issue with the PUSH bit as well as fill in the missing mtu change for fsb's The push bit itself was also not actually being properly moved to the right edge. The FIN bit was incorrectly on the left edge. We fix these two issues as well as plumb in the mtu_change for alternate stacks. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30413	2021-05-24 14:42:15 -04:00
Zhenlei Huang	03b0505b8f	ip_forward: Restore RFC reference Add RFC reference lost in `3d846e4822` PR: 255388 Reviewed By: rgrimes, donner, karels, marcus, emaste MFC after: 27 days Differential Revision: https://reviews.freebsd.org/D30374	2021-05-23 00:01:37 +02:00
Michael Tuexen	8923ce6304	tcp: Handle stack switch while processing socket options Handle the case where during socket option processing, the user switches a stack such that processing the stack specific socket option does not make sense anymore. Return an error in this case. MFC after: 1 week Reviewed by: markj Reported by: syzbot+a6e1d91f240ad5d72cd1@syzkaller.appspotmail.com Sponsored by: Netflix, Inc. Differential revision: https://reviews.freebsd.org/D30395	2021-05-22 14:39:36 +02:00
Richard Scheffenegger	3975688563	rack: honor prior socket buffer lock when doing the upcall While partially reverting D24237 with D29690, due to introducing some unintended effects for in-kernel TCP consumers, the preexisting lock on the socket send buffer was not considered properly. Found by: markj MFC after: 2 weeks Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D30390	2021-05-22 00:09:59 +02:00
Mark Johnston	7d2608a5d2	tcp: Make error handling in tcp_usr_send() more consistent - Free the input mbuf in a single place instead of in every error path. - Handle PRUS_NOTREADY consistently. - Flush the socket's send buffer if an implicit connect fails. At that point the mbuf has already been enqueued but we don't want to keep it in the send buffer. Reviewed by: gallatin, tuexen Discussed with: jhb MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30349	2021-05-21 17:45:18 -04:00
Richard Scheffenegger	032bf749fd	[tcp] Keep socket buffer locked until upcall r367492 would unlock the socket buffer before eventually calling the upcall. This leads to problematic interaction with NFS kernel server/client components (MP threads) accessing the socket buffer with potentially not correctly updated state. Reported by: rmacklem Reviewed By: tuexen, #transport Tested by: rmacklem, otis MFC after: 2 weeks Sponsored By: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29690	2021-05-21 11:07:51 +02:00
Michael Tuexen	500eb6dd80	tcp: Fix sending of TCP segments with IP level options When bringing in TCP over UDP support in https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605, the length of IP level options was considered when locating the transport header. This was incorrect and is fixed by this patch. X-MFC with: https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605 MFC after: 3 days Reviewed by: markj, rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D30358	2021-05-21 09:49:45 +02:00
Wojciech Macek	eedbbec3fd	ip_mroute: remove unused declarations fix build for non-x86 targets	2021-05-21 08:01:26 +02:00
Wojciech Macek	741afc6233	ip_mroute: refactor bw_meter API API should work as following: - periodicaly report Lower-or-EQual bandwidth (LEQ) connections over kernel socket, if user application registered for such per-flow notifications - report Grater-or-EQual (GEQ) bandwidth as soon as it reaches specified value in configured time window Custom implementation of callouts was removed. There is no point of doing calout-wheel here as generic callouts are doing exactly the same. The performance is not critical for such reporting, so the biggest concern should be to have a code which can be easily maintained. This is ia preparation for locking rework which is highly inefficient. Approved by: mw Sponsored by: Stormshield Obtained from: Semihalf Differential Revision: https://reviews.freebsd.org/D30210	2021-05-21 06:43:41 +02:00
Wojciech Macek	787845c0e8	Revert "ip_mroute: refactor bw_meter API" This reverts commit `d1cd99b147`.	2021-05-20 12:14:58 +02:00
Wojciech Macek	d1cd99b147	ip_mroute: refactor bw_meter API API should work as following: - periodicaly report Lower-or-EQual bandwidth (LEQ) connections over kernel socket, if user application registered for such per-flow notifications - report Grater-or-EQual (GEQ) bandwidth as soon as it reaches specified value in configured time window Custom implementation of callouts was removed. There is no point of doing calout-wheel here as generic callouts are doing exactly the same. The performance is not critical for such reporting, so the biggest concern should be to have a code which can be easily maintained. This is ia preparation for locking rework which is highly inefficient. Approved by: mw Sponsored by: Stormshield Obtained from: Semihalf Differential Revision: https://reviews.freebsd.org/D30210	2021-05-20 10:13:55 +02:00
Zhenlei Huang	3d846e4822	Do not forward datagrams originated by link-local addresses The current implement of ip_input() reject packets destined for 169.254.0.0/16, but not those original from 169.254.0.0/16 link-local addresses. Fix to fully respect RFC 3927 section 2.7. PR: 255388 Reviewed by: donner, rgrimes, karels MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D29968	2021-05-18 22:59:46 +02:00
Lutz Donnerhacke	2e6b07866f	libalias: Ensure ASSERT behind varable declarations At some places the ASSERT was inserted before variable declarations are finished. This is fixed now. Reported by: kib Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30282	2021-05-16 02:28:36 +02:00
Lutz Donnerhacke	189f8eea13	libalias: replace placeholder with static constant The field nullAddress in struct libalias is never set and never used. It exists as a placeholder for an unused argument only. Reviewed by: hselasky MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D30253	2021-05-15 09:05:30 +02:00
Lutz Donnerhacke	effc8e57fb	libalias: Style cleanup libalias is a convolut of various coding styles modified by a series of different editors enforcing interesting convetions on spacing and comments. This patch is a baseline to start with a perfomance rework of libalias. Upcoming patches should be focus on the code, not on the style. That's why most annoying style errors should be fixed beforehand. Reviewed by: hselasky Discussed by: emaste MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D30259	2021-05-15 08:57:55 +02:00
Randall Stewart	02cffbc250	tcp: Incorrect KASSERT causes a panic in rack Skyzall found an interesting panic in rack. When a SYN and FIN are both sent together a KASSERT gets tripped where it is validating that a mbuf pointer is in the sendmap. But a SYN and FIN often will not have a mbuf pointer. So the fix is two fold a) make sure that the SYN and FIN split the right way when cloning an RSM SYN on left edge and FIN on right. And also make sure the KASSERT properly accounts for the case that we have a SYN or FIN so we don't panic. Reviewed by: mtuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D30241	2021-05-13 07:36:04 -04:00
Michael Tuexen	eec6aed5b8	sctp: fix another locking bug in COOKIE handling Thanks to Tolya Korniltsev for reporting the issue for the userland stack and testing the fix. MFC after: 3 days	2021-05-12 23:05:28 +02:00
Mark Johnston	d8acd2681b	Fix mbuf leaks in various pru_send implementations The various protocol implementations are not very consistent about freeing mbufs in error paths. In general, all protocols must free both "m" and "control" upon an error, except if PRUS_NOTREADY is specified (this is only implemented by TCP and unix(4) and requires further work not handled in this diff), in which case "control" still must be freed. This diff plugs various leaks in the pru_send implementations. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30151	2021-05-12 13:00:09 -04:00
Michael Tuexen	251842c639	tcp rack: improve initialisation of retransmit timeout When the TCP is in the front states, don't take the slop variable into account. This improves consistency with the base stack. Reviewed by: rrs@ Differential Revision: https://reviews.freebsd.org/D30230 MFC after: 1 week Sponsored by: Netflix, Inc.	2021-05-12 18:02:21 +02:00
Michael Tuexen	12dda000ed	sctp: fix locking in case of error handling during a restart Thanks to Taylor Brandstetter for finding the issue and providing a patch for the userland stack. MFC after: 3 days	2021-05-12 15:29:06 +02:00
Randall Stewart	4b86a24a76	tcp: In rack, we must only convert restored rtt when the hostcache does restore them. Rack now after the previous commit is very careful to translate any value in the hostcache for srtt/rttvar into its proper format. However there is a snafu here in that if tp->srtt is 0 is the only time that the HC will actually restore the srtt. We need to then only convert the srtt restored when it is actually restored. We do this by making sure it was zero before the call to cc_conn_init and it is non-zero afterwards. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30213	2021-05-11 08:15:05 -04:00
Wojciech Macek	0b103f7237	mrouter: do not loopback packets unconditionally Looping back router multicast traffic signifficantly stresses network stack. Add possibility to disable or enable loopbacked based on sysctl value. Reported by: Daniel Deville Reviewed by: mw Differential Revision: https://reviews.freebsd.org/D29947	2021-05-11 12:36:07 +02:00
Wojciech Macek	65634ae748	mroute: fix race condition during mrouter shutting down There is a race condition between V_ip_mrouter de-init and ip_mforward handling. It might happen that mrouted is cleaned up after V_ip_mrouter check and before processing packet in ip_mforward. Use epoch call aproach, similar to IPSec which also handles such case. Reported by: Damien Deville Obtained from: Stormshield Reviewed by: mw Differential Revision: https://reviews.freebsd.org/D29946	2021-05-11 12:34:20 +02:00
Richard Scheffenegger	0471a8c734	tcp: SACK Lost Retransmission Detection (LRD) Recover from excessive losses without reverting to a retransmission timeout (RTO). Disabled by default, enable with sysctl net.inet.tcp.do_lrd=1 Reviewed By: #transport, rrs, tuexen, #manpages Sponsored by: Netapp, Inc. Differential Revision: https://reviews.freebsd.org/D28931	2021-05-10 19:06:20 +02:00
Randall Stewart	9867224bab	tcp:Host cache and rack ending up with incorrect values. The hostcache up to now as been updated in the discard callback but without checking if we are all done (the race where there are more than one calls and the counter has not yet reached zero). This means that when the race occurs, we end up calling the hc_upate more than once. Also alternate stacks can keep there srtt/rttvar in different formats (example rack keeps its values in microseconds). Since we call the hc_update before the stack fini() then the values will be in the wrong format. Rack on the other hand, needs to convert items pulled from the hostcache into its internal format else it may end up with very much incorrect values from the hostcache. In the process lets commonize the update mechanism for srtt/rttvar since we now have more than one place that needs to call it. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30172	2021-05-10 11:25:51 -04:00
Randall Stewart	5a4333a537	This takes Warners suggested approach to making it so that platforms that for whatever reason cannot include the RATELIMIT option can still work with rack. It adds two dummy functions that rack will call and find out that the highest hw supported b/w is 0 (which kinda makes sense and rack is already prepared to handle). Reviewed by: Michael Tuexen, Warner Losh Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30163	2021-05-07 17:32:32 -04:00
Mark Johnston	a1fadf7de2	divert: Fix mbuf ownership confusion in div_output() div_output_outbound() and div_output_inbound() relied on the caller to free the mbuf if an error occurred. However, this is contrary to the semantics of their callees, ip_output(), ip6_output() and netisr_queue_src(), which always consume the mbuf. So, if one of these functions returned an error, that would get propagated up to div_output(), resulting in a double free. Fix the problem by making div_output_outbound() and div_output_inbound() responsible for freeing the mbuf in all cases. Reported by: Michael Schmiedgen <schmiedgen@gmx.net> Tested by: Michael Schmiedgen Reviewed by: donner MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30129	2021-05-07 14:31:08 -04:00
Randall Stewart	a16cee0218	Fix a UDP tunneling issue with rack. Basically there are two issues. A) Not enough hdrlen was being calculated when a UDP tunnel is in place. and B) Not enough memory is allocated in racks fsb. We need to overbook the fsb to include a udphdr just in case. Submitted by: Peter Lei Reviewed by: Michael Tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30157	2021-05-07 14:06:43 -04:00
Gleb Smirnoff	be578b67b5	tcp_twcheck(): use correct unlock macro. This crippled in due to conflict between two last commits `1db08fbe3f` and `9e644c2300`. Submitted by: Peter Lei	2021-05-06 10:19:21 -07:00
Randall Stewart	5d8fd932e4	This brings into sync FreeBSD with the netflix versions of rack and bbr. This fixes several breakages (panics) since the tcp_lro code was committed that have been reported. Quite a few new features are now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the largest). There is also support for ack-war prevention. Documents comming soon on rack.. Sponsored by: Netflix Reviewed by: rscheff, mtuexen Differential Revision: https://reviews.freebsd.org/D30036	2021-05-06 11:22:26 -04:00
Michael Tuexen	d1cb8d11b0	sctp: improve consistency when handling chunks of wrong size MFC after: 3 days	2021-05-06 01:02:41 +02:00
Mark Johnston	6c34dde83e	igmp: Avoid an out-of-bounds access when zeroing counters When verifying, byte-by-byte, that the user-supplied counters are zero-filled, sysctl_igmp_stat() would check for zero before checking the loop bound. Perform the checks in the correct order. Reported by: KASAN MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-05-05 17:12:51 -04:00
Marko Zec	2aca58e16f	Introduce DXR as an IPv4 longest prefix matching / FIB module DXR maintains compressed lookup structures with a trivial search procedure. A two-stage trie is indexed by the more significant bits of the search key (IPv4 address), while the remaining bits are used for finding the next hop in a sorted array. The tradeoff between memory footprint and search speed depends on the split between the trie and the remaining binary search. The default of 20 bits of the key being used for trie indexing yields good performance (see below) with footprints of around 2.5 Bytes per prefix with current BGP snapshots. Rebuilding lookup structures takes some time, which is compensated for by batching several RIB change requests into a single FIB update, i.e. FIB synchronization with the RIB may be delayed for a fraction of a second. RIB to FIB synchronization, next-hop table housekeeping, and lockless lookup capability is provided by the FIB_ALGO infrastructure. DXR works well on modern CPUs with several MBytes of caches, especially in VMs, where is outperforms other currently available IPv4 FIB algorithms by a large margin. Synthetic single-thread LPM throughput test method: kldload test_lookup; kldload dpdk_lpm4; kldload fib_dxr sysctl net.route.test.run_lps_rnd=N sysctl net.route.test.run_lps_seq=N where N is the number of randomly generated keys (IPv4 addresses) which should be chosen so that each test iteration runs for several seconds. Each reported score represents the best of three runs, in million lookups per second (MLPS), for two bechmarks (RND & SEQ) with two FIBs: host: single interface address, local subnet route + default route BGP: snapshot from linx.routeviews.org, 887957 prefixes, 496 next hops Bhyve VM on an Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60 GHz: inet.algo host, RND host, SEQ BGP, RND BGP, SEQ bsearch4 40.6 20.2 N/A N/A radix4 7.8 3.8 1.2 0.6 radix4_lockless 18.0 9.0 1.6 0.8 dpdk_lpm4 14.4 5.0 14.6 5.0 dxr 70.3 34.7 43.0 19.5 Intel(R) Core(TM) i5-5300U CPU @ 2.30 GHz: inet.algo host, RND host, SEQ BGP, RND BGP, SEQ bsearch4 47.0 23.1 N/A N/A radix4 8.5 4.2 1.9 1.0 radix4_lockless 19.2 9.5 2.5 1.2 dpdk_lpm4 31.2 9.4 31.6 9.3 dxr 84.9 41.4 51.7 23.6 Intel(R) Core(TM) i7-4771 CPU @ 3.50 GHz: inet.algo host, RND host, SEQ BGP, RND BGP, SEQ bsearch4 59.5 29.4 N/A N/A radix4 10.8 5.5 2.5 1.3 radix4_lockless 24.7 12.0 3.1 1.6 dpdk_lpm4 29.1 9.0 30.2 9.1 dxr 101.3 49.9 69.8 32.5 AMD Ryzen 7 3700X 8-Core Processor @ 3.60 GHz: inet.algo host, RND host, SEQ BGP, RND BGP, SEQ bsearch4 70.8 35.4 N/A N/A radix4 14.4 7.2 2.8 1.4 radix4_lockless 30.2 15.1 3.7 1.8 dpdk_lpm4 29.9 9.0 30.0 8.9 dxr 163.3 81.5 99.5 44.4 AMD Ryzen 5 5600X 6-Core Processor @ 3.70 GHz: inet.algo host, RND host, SEQ BGP, RND BGP, SEQ bsearch4 93.6 46.7 N/A N/A radix4 18.9 9.3 4.3 2.1 radix4_lockless 37.2 18.6 5.3 2.7 dpdk_lpm4 51.8 15.1 51.6 14.9 dxr 218.2 103.3 114.0 49.0 Reviewed by: melifaro MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29821	2021-05-05 13:45:52 +02:00
Michael Tuexen	b621fbb1bf	sctp: drop packet with SHUTDOWN-ACK chunks with wrong vtags MFC after: 3 days	2021-05-04 18:43:31 +02:00
Mark Johnston	f161d294b9	Add missing sockaddr length and family validation to various protocols Several protocol methods take a sockaddr as input. In some cases the sockaddr lengths were not being validated, or were validated after some out-of-bounds accesses could occur. Add requisite checking to various protocol entry points, and convert some existing checks to assertions where appropriate. Reported by: syzkaller+KASAN Reviewed by: tuexen, melifaro MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29519	2021-05-03 13:35:19 -04:00
Michael Tuexen	8b3d0f6439	sctp: improve address list scanning If the alternate address has to be removed, force the stack to find a new one, if it is still needed. MFC after: 3 days	2021-05-03 02:50:05 +02:00
Michael Tuexen	a89481d328	sctp: improve restart handling This fixes in particular a possible use after free bug reported Anatoly Korniltsev and Taylor Brandstetter for the userland stack. MFC after: 3 days	2021-05-03 02:20:24 +02:00
Alexander Motin	655c200cc8	Fix build after `5f2e183505`.	2021-05-02 20:07:38 -04:00
Michael Tuexen	5f2e183505	sctp: improve error handling in INIT/INIT-ACK processing When processing INIT and INIT-ACK information, also during COOKIE processing, delete the current association, when it would end up in an inconsistent state. MFC after: 3 days	2021-05-02 22:41:35 +02:00
Michael Tuexen	e010d20032	sctp: update the vtag for INIT and INIT-ACK chunks This is needed in case of responding with an ABORT to an INIT-ACK.	2021-04-30 13:33:16 +02:00
Michael Tuexen	eb79855920	sctp: fix SCTP_PEER_ADDR_PARAMS socket option Ignore spp_pathmtu if it is 0, when setting the IPPROTO_SCTP level socket option SCTP_PEER_ADDR_PARAMS as required by RFC 6458. MFC after: 1 week	2021-04-30 12:31:09 +02:00
Michael Tuexen	eecdf5220b	sctp: use RTO.Initial of 1 second as specified in RFC 4960bis	2021-04-30 00:45:56 +02:00
Michael Tuexen	9de7354bb8	sctp: improve consistency in handling chunks with wrong size Just skip the chunk, if no other handling is required by the specification.	2021-04-28 18:11:06 +02:00
Richard Scheffenegger	48be5b976e	tcp: stop spurious rescue retransmissions and potential asserts Reported by: pho@ MFC after: 3 days Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29970	2021-04-28 15:01:10 +02:00
Michael Tuexen	059ec2225c	sctp: cleanup verification of INIT and INIT-ACK chunks	2021-04-27 12:45:43 +02:00
Michael Tuexen	c70d1ef15d	sctp: improve handling of illegal packets containing INIT chunks Stop further processing of a packet when detecting that it contains an INIT chunk, which is too small or is not the only chunk in the packet. Still allow to finish the processing of chunks before the INIT chunk. Thanks to Antoly Korniltsev and Taylor Brandstetter for reporting an issue with the userland stack, which made me aware of this issue. MFC after: 3 days	2021-04-26 10:43:58 +02:00
Michael Tuexen	163153c2a0	sctp: small cleanup, no functional change MFC: 3 days	2021-04-26 02:56:48 +02:00

1 2 3 4 5 ...

6932 Commits