freebsd-dev

Author	SHA1	Message	Date
Alexander V. Chernikov	8482aa7748	Use lltable calculated header when sending lle holdchain after successful lle resolution. Subscribers: imp, ae, bz Differential Revision: https://reviews.freebsd.org/D31391	2021-08-05 20:44:36 +00:00
Michael Tuexen	3f1f6b6ef7	tcp, udp: improve input validation in handling bind() Reported by: syzbot+24fcfd8057e9bc339295@syzkaller.appspotmail.com Reported by: syzbot+6e90ceb5c89285b2655b@syzkaller.appspotmail.com Reviewed by: markj, rscheff MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D31422	2021-08-05 13:48:44 +02:00
Alexander V. Chernikov	f3a3b06121	[lltable] Unify datapath feedback mechamism. Use newly-create llentry_request_feedback(), llentry_mark_used() and llentry_get_hittime() to request datapatch usage check and fetch the results in the same fashion both in IPv4 and IPv6. While here, simplify llentry_provide_feedback() wrapper by eliminating 1 condition check. MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D31390	2021-08-04 22:52:43 +00:00
Konstantin Kukushkin	a61c24ddb7	udp: Fix soroverflow SOCKBUF unlocking We hold the SOCKBUF_LOCK so use soroverflow_locked here. This bug may manifest as a non-killable process stuck in [*so_rcv]. Approved by: scottl Reviewed by: Roy Marples <roy@marples.name> Fixes: `7045b1603b` MFC after: 10 days Differential Revision: https://reviews.freebsd.org/D31374	2021-08-01 08:07:33 -07:00
Roy Marples	7045b1603b	socket: Implement SO_RERROR SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports. Reviewed by: philip (network), kbowling (transport), gbe (manpages) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D26652	2021-07-28 09:35:09 -07:00
Bryan Drewery	a573243370	netdump: Fix leaking debugnet state on errors. Reviewed by: cem, markj Sponsored by: Dell EMC Differential Revision: https://reviews.freebsd.org/D31319	2021-07-27 09:06:23 -07:00
Mark Johnston	ba21825202	rip: Add missing minimum length validation in rip_output() If the socket is configured such that the sender is expected to supply the IP header, then we need to verify that it actually did so. Reported by: syzkaller+KMSAN Reviewed by: donner MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D31302	2021-07-26 16:39:37 -04:00
Kristof Provost	8e1864ed07	pf: syncookie support Import OpenBSD's syncookie support for pf. This feature help pf resist TCP SYN floods by only creating states once the remote host completes the TCP handshake rather than when the initial SYN packet is received. This is accomplished by using the initial sequence numbers to encode a cookie (hence the name) in the SYN+ACK response and verifying this on receipt of the client ACK. Reviewed by: kbowling Obtained from: OpenBSD MFC after: 1 week Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D31138	2021-07-20 10:36:13 +02:00
Michael Tuexen	a730d82378	tcp: fix RACK and BBR when using VIMAGE enabled kernel Fix a bug in VNET handling, which occurs when using specific NICs. PR: 257195 Reviewed by: rrs MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D31212	2021-07-20 00:29:18 +02:00
Randall Stewart	db4d2d7222	tcp: When rack or bbr get a pullup failure in the common code, don't free the NULL mbuf. There is a bug in the error path where rack_bbr_common does a m_pullup() and the pullup fails. There is a stray mfree(m) after m is set to NULL. This is not a good idea :-) Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31194	2021-07-16 13:59:57 -04:00
Randall Stewart	1d171e5ab9	tcp: Lro needs to validate that it does not go beyond the end of the mbuf as it parses. Currently the LRO parser, if given a packet that say has ETH+IP header but the TCP header is in the next mbuf (split), would walk garbage. Lets make sure we keep track as we parse of the length and return NULL anytime we exceed the length of the mbuf. Reviewed by: tuexen, hselasky Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31195	2021-07-16 06:07:13 -04:00
Randall Stewart	ca1a7e1021	tcp: TCP_LRO getting bad checksums and sending it in to TCP incorrectly. In reviewing tcp_lro.c we have a possibility that some drives may send a mbuf into LRO without making sure that the checksum passes. Some drivers actually are aware of this and do not call lro when the csum failed, others do not do this and thus could end up sending data up that we think has a checksum passing when it does not. This change will fix that situation by properly verifying that the mbuf has the correct markings (CSUM VALID bits as well as csum in mbuf header is set to 0xffff). Reviewed by: tuexen, hselasky, gallatin Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31155	2021-07-13 12:45:15 -04:00
Stefan Eßer	58080fbca0	libalias: fix divide by zero causing panic The packet_limit can fall to 0, leading to a divide by zero abort in the "packets % packet_limit". An possible solution would be to apply a lower limit of 1 after the calculation of packet_limit, but since any number modulo 1 gives 0, the more efficient solution is to skip the modulo operation for packet_limit <= 1. Since this is a fix for a panic observed in stable/12, merging this fix to stable/12 and stable/13 before expiry of the 3 day waiting period might be justified, if it works for the reporter of the issue. Reported by: Karl Denninger <karl@denninger.net> MFC after: 3 days	2021-07-10 13:08:18 +02:00
Michael Tuexen	105b68b42d	sctp: Fix errno in case of association setup failures Do not report always ETIMEDOUT, but only when appropriate. In other cases report ECONNABORTED. MFC after: 3 days	2021-07-09 23:19:25 +02:00
Michael Tuexen	ce64352a70	sctp: provide consistent stream information in case of early errors While there, make sure the function is called correctly. MFC after: 3 days	2021-07-09 14:16:59 +02:00
Michael Tuexen	84992a3251	sctp: provide sac_error also for ABORT chunk being sent Thanks to Florent Castelli for bringing this issue up for the userland stack and providing an initial patch. MFC: 3 days	2021-07-09 13:46:27 +02:00
Randall Stewart	7312e4e5cf	tcp: Fix 32 bit platform breakage This fixes the incorrect use of a sysctl add to u64. It was for a useconds time, but on 32 bit platforms its not a u64. Instead use the long directive. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31107	2021-07-08 08:16:45 -04:00
Andrew Gallatin	b1e806c0ed	tcp: fix alternate stack build with LINT-NO{INET,INET6,IP} When fixing another bug, I noticed that the alternate TCP stacks do not build when various combinations of ipv4 and ipv6 are disabled. Reviewed by: rrs, tuexen Differential Revision: https://reviews.freebsd.org/D31094 Sponsored by: Netflix	2021-07-07 13:02:08 -04:00
Randall Stewart	d7955cc0ff	tcp: HPTS performance enhancements HPTS drives both rack and bbr, and yet there have been many complaints about performance. This bit of work restructures hpts to help reduce CPU overhead. It does this by now instead of relying on the timer/callout to drive it instead use user return from a system call as well as lro flushes to drive hpts. The timer becomes a backstop that dynamically adjusts based on how "late" we are. Reviewed by: tuexen, glebius Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31083	2021-07-07 07:22:35 -04:00
Randall Stewart	e834f9a44a	tcp: Address goodput and TLP edge cases. There are several cases where we make a goodput measurement and we are running out of data when we decide to make the measurement. In reality we should not make such a measurement if there is no chance we can have "enough" data. There is also some corner case TLP's that end up not registering as a TLP like they should, we fix this by pushing the doing_tlp setup to the actual timeout that knows it did a TLP. This makes it so we always have the appropriate flag on the sendmap indicating a TLP being done as well as count correctly so we make no more that two TLP's. In addressing the goodput lets also add a "quality" metric that can be viewed via blackbox logs so that a casual observer does not have to figure out how good of a measurement it is. This is needed due to the fact that we may still make a measurement that is of a poorer quality as we run out of data but still have a minimal amount of data to make a measurement. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31076	2021-07-06 15:26:37 -04:00
Andrew Gallatin	28d0a740dd	ktls: auto-disable ifnet (inline hw) kTLS Ifnet (inline) hw kTLS NICs typically keep state within a TLS record, so that when transmitting in-order, they can continue encryption on each segment sent without DMA'ing extra state from the host. This breaks down when transmits are out of order (eg, TCP retransmits). In this case, the NIC must re-DMA the entire TLS record up to and including the segment being retransmitted. This means that when re-transmitting the last 1448 byte segment of a TLS record, the NIC will have to re-DMA the entire 16KB TLS record. This can lead to the NIC running out of PCIe bus bandwidth well before it saturates the network link if a lot of TCP connections have a high retransmoit rate. This change introduces a new sysctl (kern.ipc.tls.ifnet_max_rexmit_pct), where TCP connections with higher retransmit rate will be switched to SW kTLS so as to conserve PCIe bandwidth. Reviewed by: hselasky, markj, rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30908	2021-07-06 10:28:32 -04:00
Lutz Donnerhacke	4060e77f49	libalias: Remove a stray directive Removal of a preprocessor line was missed during development. Do it now and MFC it together with the other patches. MFC after: 2 days	2021-07-04 17:54:45 +02:00
Lutz Donnerhacke	2f4d91f9cb	libalias: Rewrite HISTORY Fix the history entry (wrong year) and add the missing recent work. MFC together with the other patches. MFC after: 2 days	2021-07-04 17:46:47 +02:00
Lutz Donnerhacke	f284553444	libalias: Fix API bug on initialization The kernel part of ipfw(8) does initialize LibAlias uncondistionally with an zeroized port range (allowed ports from 0 to 0). During restucturing of libalias, port ranges are used everytime and are therefor initialized with different values than zero. The secondary initialization from ipfw (and probably others) overrides the new default values and leave the instance in an unfunctional state. The obvious solution is to detect such reinitializations and use the new default value instead. MFC after: 3 days	2021-07-03 23:03:07 +02:00
Lutz Donnerhacke	b50a4dce18	libalias: Avoid uninitialized expiration The expiration time of direct address mappings is explicitly uninitialized. Expire times are always compared during housekeeping. Despite the uninitialized value does not harm, it's simpler to just set it to a reasonable default. This was detected during valgrinding the test suite. MFC after: 3 days	2021-07-03 01:09:18 +02:00
Lutz Donnerhacke	25392fac94	libalias: Fix splay comparsion bug Comparing elements in a tree requires transitiviy. If a < b and b < c then a must be smaller than c. This way the tree elements are always pairwise comparable. Tristate comparsion functions returning values lower, equal, or greater than zero, are usually implemented by a simple subtraction of the operands. If the size of the operands are equal to the size of the result, integer modular arithmetics kick in and violates the transitivity. Example: Working on byte with 0, 120, and 240. Now computing the differences: 120 - 0 = 120 240 - 120 = 120 240 - 0 = -16 MFC after: 3 days	2021-07-03 00:31:53 +02:00
Michael Tuexen	c7f048ab35	sctp: initialize sequence numbers for ECN correctly MFC after: 3 days Reported by: Junseok Yang (for the userland stack)	2021-06-27 20:14:48 +02:00
Michael Tuexen	6587a2bd1e	sctp: Fix length check for ECNE chunks MFC after: 3 days	2021-06-27 16:10:39 +02:00
Michael Tuexen	870af3f4dc	tcp: tolerate missing timestamps Some TCP stacks negotiate TS support, but do not send TS at all or not for keep-alive segments. Since this includes modern widely deployed stacks, tolerate the violation of RFC 7323 per default. Reviewed by: rgrimes, rrs, rscheff MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D30740 Sponsored by: Netflix, Inc.	2021-06-27 16:03:57 +02:00
Randall Stewart	9e4d9e4c4d	tcp: Preparation for allowing hardware TLS to be able to kick a tcp connection that is retransmitting too much out of hardware and back to software. Hardware TLS is now supported in some interface cards and it works well. Except that when we have connections that retransmit a lot we get into trouble with all the retransmits. This prep step makes way for change that Drew will be making so that we can "kick out" a session from hardware TLS. Reviewed by: mtuexen, gallatin Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30895	2021-06-25 09:30:54 -04:00
Randall Stewart	66aec14a53	tcp: Rack not being very friendly with V6:4 socket and having a connection from V4 There were two bugs that prevented V4 sockets from connecting to a rack server running a V4/V6 socket. As well as a bug that stops the mapped v4 in V6 address from working. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30885	2021-06-24 14:42:21 -04:00
Wojciech Macek	17ac6d94db	ip_mroute: initialize vif ifnet properly Use if_alloc to ensure all fields of ifnet are allocated properly Reported by: Damien Deville Sponsored by: Stormshield Obtained from: Semihalf Reviewed by: mw Differential revision: https://reviews.freebsd.org/D30608	2021-06-23 10:13:52 +02:00
Lutz Donnerhacke	f70c98a2f5	libalias: Fix compile time warning about unused functions Compiling libalias results in warnings about unused functions. Those warnings are caused by clang's heuristic to consider an inline function as in use, iff the declaration is in a .c file. Declarations in .h files do not emit those warnings. Hence the declarations must be moved to an extra *.h file. MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D30844	2021-06-23 10:06:04 +02:00
John Baldwin	a7f6c6fd94	toe: Read-lock the inp in toe_4tuple_check(). tcp_twcheck now expects a read lock on the inp for the SYN case instead of a write lock. Reviewed by: np Fixes: `1db08fbe3f` tcp_input: always request read-locking of PCB for any pure SYN segment. Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D30782	2021-06-22 16:31:01 -07:00
Gleb Smirnoff	c4804b6b0b	Unbreak TFO, that was broken with `8d5719aa74`. These two assignments are unneccessary and used to be there before TFO as an invariant. With TFO and after `8d5719aa74` the "so" value is still needed. Reported & tested by: tuexen Fixes: `8d5719aa74`	2021-06-22 16:03:44 -07:00
Lutz Donnerhacke	d261e57dea	libalias: Switch to efficient data structure for incoming traffic Current data structure is using a hash of unordered lists. Those unordered lists are quite efficient, because the least recently inserted entries are most likely to be used again. In order to avoid long search times in other cases, the lists are hashed into many buckets. Unfortunatly a search for a miss needs an exhaustive inspection and a careful definition of the hash. Splay trees offer a similar feature: Almost O(1) for access of the least recently used entries, and amortized O(ln(n)) for almost all other cases. Get rid of the hash. Now the data structure should able to quickly react to external packets without eating CPU cycles for breakfast, preventing a DoS. PR: 192888 Discussed with: Dimitry Luhtionov MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30536	2021-06-19 22:12:28 +02:00
Lutz Donnerhacke	935fc93af1	libalias: Switch to efficient data structure for outgoing traffic Current data structure is using a hash of unordered lists. Those unordered lists are quite efficient, because the least recently inserted entries are most likely to be used again. In order to avoid long search times in other cases, the lists are hashed into many buckets. Unfortunatly a search for a miss needs an exhaustive inspection and a careful definition of the hash. Splay trees offer a similar feature - almost O(1) for access of the least recently used entries), and amortized O(ln(n) - for almost all other cases. Get rid of the hash. Discussed with: Dimitry Luhtionov MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30516	2021-06-19 22:09:44 +02:00
Lutz Donnerhacke	d989935b5b	libalias: Restructure - Finalize Note, that the restructuring is done. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30582	2021-06-19 21:58:56 +02:00
Lutz Donnerhacke	fe83900f9f	libalias: Restructure - Remove temporary state deleteAllLinks from global struct The entry deleteAllLinks in the struct libalias is only used to signal a state between internal calls. It's not used between API calls. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30604	2021-06-19 21:55:11 +02:00
Lutz Donnerhacke	9efcad61d8	libalias: Restructure - Use AliasRange instead of PORT_BASE Get rid of PORT_BASE, replace by AliasRange. Simplify code. Factor out the search for a new port. Improves the perfomance a bit. Discussed with: Dimitry Luhtionov MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30581	2021-06-19 21:40:09 +02:00
Lutz Donnerhacke	1178dda53d	libalias: Restructure - Table for PPTP Let PPTP use its own data structure. Regroup and rename other lists, which are not PPTP. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30580	2021-06-19 21:26:31 +02:00
Lutz Donnerhacke	7b44ff4c52	libalias: Restructure - Group expire handling entries Reorder the internal structure semantically. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30575	2021-06-19 21:12:27 +02:00
Lutz Donnerhacke	492d3b7109	libalias: Restructure - Group incoming links Reorder incoming links by grouping of common search terms. Significant performance improvement for incoming (missing) flows. Remove LSNAT from outgoing search. Slight speedup due to less comparsions in the loop. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30574	2021-06-19 21:03:47 +02:00
Lutz Donnerhacke	d4ab07d2ae	libalias: Restructure - Cleanup and Use for links Factor out a common idiom to return found links. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30573	2021-06-19 20:28:53 +02:00
Lutz Donnerhacke	d541903438	libalias: Restructure - Outgoing search Factor out the outgoing search function. Preparation for a new data structure. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30572	2021-06-19 20:25:08 +02:00
Lutz Donnerhacke	19dcc4f225	libalias: Restructure - Cleanup _FindLinkIn Simplify program flow in function _FindLinkIn. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30571	2021-06-19 20:19:16 +02:00
Lutz Donnerhacke	cac129e603	libalias: Restructure - Table for partially links Separate the partially specified links into a separate data structure. This would causes a major parformance impact, if there are many of them. Use a (smaller) hash table to speed up the partially link access. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30570	2021-06-19 20:03:08 +02:00
Richard Scheffenegger	74d7fc8753	tcp: Add PRR cwnd reduction for non-SACK loss This completes PRR cwnd reduction in all circumstances for the base TCP stack (SACK loss recovery, ECN window reduction, non-SACK loss recovery), preventing the arriving ACKs to clock out new data at the old, too high rate. This reduces the chance to induce additional losses while recovering from loss (during congested network conditions). For non-SACK loss recovery, each ACK is assumed to have one MSS delivered. In order to prevent ACK-split attacks, only one window worth of ACKs is considered to actually have delivered new data. MFC after: 6 weeks Reviewed By: rrs, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29441	2021-06-19 19:25:22 +02:00
Lutz Donnerhacke	32f9c2ceb3	libalias: Restructure - Separate fully qualified search Search fully specified links first. Some performance loss due to need to revisit the db twice, if not found. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30569	2021-06-19 19:21:05 +02:00
Lutz Donnerhacke	d41044ddfd	libalias: Restructure - Common search terms Factor out the common Out and In filter Slightly better performance due to eager skip of search loop MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30568	2021-06-19 18:58:52 +02:00
Lutz Donnerhacke	ef828d39be	libalias: Promote per instance global variable timeStamp Summary: - Use LibAliasTime as a real global variable for central timekeeping. - Reduce number of syscalls in user space considerably. - Dynamically adjust the packet counters to match the second resolution. - Only check the first few packets after a time increase for expiry. Discussed with: hselasky MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30566	2021-06-19 18:25:44 +02:00
Lutz Donnerhacke	3fd20a79e7	libalias: Stats are unsigned Stats counters are used as unsigned valued (i.e. printf("%u")) but are defined as signed int. This causes trouble later, so fix it early. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30587	2021-06-19 18:21:17 +02:00
Mark Johnston	a100217489	Consistently use the SOCKBUF_MTX() and SOCK_MTX() macros This makes it easier to change the socket locking protocols. No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-06-14 17:32:32 -04:00
Mark Johnston	f4bb1869dd	Consistently use the SOLISTENING() macro Some code was using it already, but in many places we were testing SO_ACCEPTCONN directly. As a small step towards fixing some bugs involving synchronization with listen(2), make the kernel consistently use SOLISTENING(). No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-06-14 17:32:27 -04:00
Michael Tuexen	f1536bb538	tcp: remove debug output from RACK Reported by: iron.udjin@gmail.com, Marek Zarychta Reviewed by: rrs PR: 256538 MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D30723 Sponsored by: Netflix, Inc.	2021-06-11 20:23:39 +02:00
Randall Stewart	ba1b3e48f5	tcp: Missing mfree in rack and bbr Recently (Nov) we added logic that protects against a peer negotiating a timestamp, and then not including a timestamp. This involved in the input path doing a goto done_with_input label. Now I suspect the code was cribbed from one in Rack that has to do with the SYN. This had a bug, i.e. it should have a m_freem(m) before going to the label (bbr had this missing m_freem() but rack did not). This then caused the missing m_freem to show up in both BBR and Rack. Also looking at the code referencing m->m_pkthdr.lro_nsegs later (after processing) is not a good idea, even though its only for logging. Best to copy that off before any frees can take place. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30727	2021-06-11 11:38:08 -04:00
Michael Tuexen	fa3746be42	tcp: fix two bugs in new reno * Completely initialise the CC module specific data * Use beta_ecn in case of an ECN event whenever ABE is enabled or it is requested by the stack. Reviewed by: rscheff, rrs MFC after: 3 days Sponsored by: Netflix, Inc.	2021-06-11 15:40:34 +02:00
Michael Tuexen	224cf7b35b	tcp: fix compilation of IPv4-only builds PR: 256538 Reported by: iron.udjin@gmail.com MFC after: 3 days Sponsored by: Netflix, Inc.	2021-06-11 09:50:46 +02:00
Lutz Donnerhacke	294799c6b0	libalias: tidy up housekeeping Replace current expensive, but sparsly called housekeeping by a single, repetive action. This is part of a larger restructure of libalias in order to switch to more efficient data structures. The whole restructure process is split into 15 reviews to ease reviewing. All those steps will be squashed into a single commit for MFC in order to hide the intermediate states from production systems. Reviewed by: hselasky MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30277	2021-06-10 23:30:10 +02:00
Randall Stewart	67e892819b	tcp: Mbuf leak while holding a socket buffer lock. When running at NF the current Rack and BBR changes with the recent commits from Richard that cause the socket buffer lock to be held over the ip_output() call and then finally culminating in a call to tcp_handle_wakeup() we get a lot of leaked mbufs. I don't think that this leak is actually caused by holding the lock or what Richard has done, but is exposing some other bug that has probably been lying dormant for a long time. I will continue to look (using his changes) at what is going on to try to root cause out the issue. In the meantime I can't leave the leaks out for everyone else. So this commit will revert all of Richards changes and move both Rack and BBR back to just doing the old sorwakeup_locked() calls after messing with the so_rcv buffer. We may want to look at adding back in Richards changes after I have pinpointed the root cause of the mbuf leak and fixed it. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30704	2021-06-10 08:33:57 -04:00
Randall Stewart	b45daaea95	tcp: LRO timestamps have lost their previous precision Recently we had a rewrite to tcp_lro.c that was tested but one subtle change was the move to a less precise timestamp. This causes all kinds of chaos in tcp's that do pacing and needs to be fixed to use the more precise time that was there before. Reviewed by: mtuexen, gallatin, hselasky Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30695	2021-06-09 13:58:54 -04:00
Randall Stewart	4747500dea	tcp: A better fix for the previously attempted fix of the ack-war issue with tcp. So it turns out that my fix before was not correct. It ended with us failing some of the "improved" SYN tests, since we are not in the correct states. With more digging I have figured out the root of the problem is that when we receive a SYN\|FIN the reassembly code made it so we create a segq entry to hold the FIN. In the established state where we were not in order this would be correct i.e. a 0 len with a FIN would need to be accepted. But if you are in a front state we need to strip the FIN so we correctly handle the ACK but ignore the FIN. This gets us into the proper states and avoids the previous ack war. I back out some of the previous changes but then add a new change here in tcp_reass() that fixes the root cause of the issue. We still leave the rack panic fixes in place however. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30627	2021-06-04 05:26:43 -04:00
Mark Johnston	f96603b56f	tcp, udp: Permit binding with AF_UNSPEC if the address is INADDR_ANY Prior to commit `f161d294b` we only checked the sockaddr length, but now we verify the address family as well. This breaks at least ttcp. Relax the check to avoid breaking compatibility too much: permit AF_UNSPEC if the address is INADDR_ANY. Fixes: `f161d294b` Reported by: Bakul Shah <bakul@iitbombay.org> Reviewed by: tuexen MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30539	2021-05-31 18:53:34 -04:00
Lutz Donnerhacke	bec0a5dca7	libalias: Remove LibAliasCheckNewLink Finally drop the function in 14-CURRENT. Discussed with: kp Differential Revision: https://reviews.freebsd.org/D30275	2021-05-31 13:04:11 +02:00
Lutz Donnerhacke	bfd41ba1fe	libalias: Remove unused function LibAliasCheckNewLink The functionality to detect a newly created link after processing a single packet is decoupled from the packet processing. Every new packet is processed asynchronously and will reset the indicator, hence the function is unusable. I made a Google search for third party code, which uses the function, and failed to find one. That's why the function should be removed: It unusable and unused. A much simplified API/ABI will remain in anything below 14. Discussed with: kp Reviewed by: manpages (bcr) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30275	2021-05-31 12:53:57 +02:00
Wojciech Macek	d40cd26a86	ip_mroute: rework ip_mroute Approved by: mw Obtained from: Semihalf Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D30354 Changes: 1. add spinlock to bw_meter If two contexts read and modify bw_meter values it might happen that these are corrupted. Guard only code fragments which do read-and-modify. Context which only do "reads" are not done inside spinlock block. The only sideffect that can happen is an 1-p;acket outdated value reported back to userspace. 2. replace all locks with a single RWLOCK Multiple locks caused a performance issue in routing hot path, when two of them had to be taken. All locks were replaced with single RWLOCK which makes the hot path able to take only shared access to lock most of the times. All configuration routines have to take exclusive lock (as it was done before) but these operation are very rare compared to packet routing. 3. redesign MFC expire and UPCALL expire Use generic kthread and cv_wait/cv_signal for deferring work. Previously, upcalls could be sent from two contexts which complicated the design. All upcall sending is now done in a kthread which allows hot path to work more efficient in some rare cases. 4. replace mutex-guarded linked list with lock free buf_ring All message and data is now passed over lockless buf_ring. This allowed to remove some heavy locking when linked lists were used.	2021-05-31 05:48:15 +02:00
Lutz Donnerhacke	b03a41befe	libalias: Fix nameing and initialization of a constant The commit `189f8eea` contains a refactorisation of a constant. During later review D30283 the naming of the constant was improved and the initialization became explicit. Put this into the tree, in order to MFC the correct naming.	2021-05-30 15:47:29 +02:00
Randall Stewart	8c69d988a8	tcp: When we have an out-of-order FIN we do want to strip off the FIN bit. The last set of commits fixed both a panic (in rack) and an ACK-war (in freebsd and bbr). However there was a missing case, i.e. where we get an out-of-order FIN by itself. In such a case we don't want to leave the FIN bit set, otherwise we will do the wrong thing and ack the FIN incorrectly. Instead we need to go through the tcp_reasm() code and that way the FIN will be stripped and all will be well. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30497	2021-05-27 10:50:32 -04:00
Richard Scheffenegger	c358f1857f	tcp: Use local CC data only in the correct context Most CC algos do use local data, and when calling newreno_cong_signal from there, the latter misinterprets the data as its own struct, leading to incorrect behavior. Reported by: chengc_netapp.com Reviewed By: chengc_netapp.com, tuexen, #transport MFC after: 3 days Sponsored By: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D30470	2021-05-26 20:15:53 +02:00
Randall Stewart	4f3addd94b	tcp: Add a socket option to rack so we can test various changes to the slop value in timers. Timer_slop, in TCP, has been 200ms for a long time. This value dates back a long time when delayed ack timers were longer and links were slower. A 200ms timer slop allows 1 MSS to be sent over a 60kbps link. Its possible that lowering this value to something more in line with todays delayed ack values (40ms) might improve TCP. This bit of code makes it so rack can, via a socket option, adjust the timer slop. Reviewed by: mtuexen Sponsered by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30249	2021-05-26 06:43:30 -04:00
Andrew Gallatin	086a35562f	tcp: enter network epoch when calling tfb_tcp_fb_fini We need to enter the network epoch when calling into tfb_tcp_fb_fini. I noticed this when I hit an assert running the latest rack Differential Revision: https://reviews.freebsd.org/D30407 Reviewed by: rrs, tuexen Sponsored by: Netflix	2021-05-25 13:45:37 -04:00
Randall Stewart	13c0e198ca	tcp: Fix bugs related to the PUSH bit and rack and an ack war Michaels testing with UDP tunneling found an issue with the push bit, which was only partly fixed in the last commit. The problem is the left edge gets transmitted before the adjustments are done to the send_map, this means that right edge bits must be considered to be added only if the entire RSM is being retransmitted. Now syzkaller also continued to find a crash, which Michael sent me the reproducer for. Turns out that the reproducer on default (freebsd) stack made the stack get into an ack-war with itself. After fixing the reference issues in rack the same ack-war was found in rack (and bbr). Basically what happens is we go into the reassembly code and lose the FIN bit. The trick here is we should not be going into the reassembly code if tlen == 0 i.e. the peer never sent you anything. That then gets the proper action on the FIN bit but then you end up in LAST_ACK with no timers running. This is because the usrclosed function gets called and the FIN's and such have already been exchanged. So when we should be entering FIN_WAIT2 (or even FIN_WAIT1) we get stuck in LAST_ACK. Fixing this means tweaking the usrclosed function so that we properly recognize the condition and drop into FIN_WAIT2 where a timer will allow at least TP_MAXIDLE before closing (to allow time for the peer to retransmit its FIN if the ack is lost). Setting the fast_finwait2 timer can speed this up in testing. Reviewed by: mtuexen,rscheff Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30451	2021-05-25 13:23:31 -04:00
Michael Tuexen	9bbd1a8fcb	tcp: fix a RACK socket buffer lock issue Fix a missing socket buffer unlocking of the socket receive buffer. Reviewed by: gallatin, rrs MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D30402	2021-05-24 20:31:23 +02:00
Randall Stewart	631449d5d0	tcp: Fix an issue with the PUSH bit as well as fill in the missing mtu change for fsb's The push bit itself was also not actually being properly moved to the right edge. The FIN bit was incorrectly on the left edge. We fix these two issues as well as plumb in the mtu_change for alternate stacks. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30413	2021-05-24 14:42:15 -04:00
Zhenlei Huang	03b0505b8f	ip_forward: Restore RFC reference Add RFC reference lost in `3d846e4822` PR: 255388 Reviewed By: rgrimes, donner, karels, marcus, emaste MFC after: 27 days Differential Revision: https://reviews.freebsd.org/D30374	2021-05-23 00:01:37 +02:00
Michael Tuexen	8923ce6304	tcp: Handle stack switch while processing socket options Handle the case where during socket option processing, the user switches a stack such that processing the stack specific socket option does not make sense anymore. Return an error in this case. MFC after: 1 week Reviewed by: markj Reported by: syzbot+a6e1d91f240ad5d72cd1@syzkaller.appspotmail.com Sponsored by: Netflix, Inc. Differential revision: https://reviews.freebsd.org/D30395	2021-05-22 14:39:36 +02:00
Richard Scheffenegger	3975688563	rack: honor prior socket buffer lock when doing the upcall While partially reverting D24237 with D29690, due to introducing some unintended effects for in-kernel TCP consumers, the preexisting lock on the socket send buffer was not considered properly. Found by: markj MFC after: 2 weeks Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D30390	2021-05-22 00:09:59 +02:00
Mark Johnston	7d2608a5d2	tcp: Make error handling in tcp_usr_send() more consistent - Free the input mbuf in a single place instead of in every error path. - Handle PRUS_NOTREADY consistently. - Flush the socket's send buffer if an implicit connect fails. At that point the mbuf has already been enqueued but we don't want to keep it in the send buffer. Reviewed by: gallatin, tuexen Discussed with: jhb MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30349	2021-05-21 17:45:18 -04:00
Richard Scheffenegger	032bf749fd	[tcp] Keep socket buffer locked until upcall r367492 would unlock the socket buffer before eventually calling the upcall. This leads to problematic interaction with NFS kernel server/client components (MP threads) accessing the socket buffer with potentially not correctly updated state. Reported by: rmacklem Reviewed By: tuexen, #transport Tested by: rmacklem, otis MFC after: 2 weeks Sponsored By: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29690	2021-05-21 11:07:51 +02:00
Michael Tuexen	500eb6dd80	tcp: Fix sending of TCP segments with IP level options When bringing in TCP over UDP support in https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605, the length of IP level options was considered when locating the transport header. This was incorrect and is fixed by this patch. X-MFC with: https://cgit.FreeBSD.org/src/commit/?id=9e644c23000c2f5028b235f6263d17ffb24d3605 MFC after: 3 days Reviewed by: markj, rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D30358	2021-05-21 09:49:45 +02:00
Wojciech Macek	eedbbec3fd	ip_mroute: remove unused declarations fix build for non-x86 targets	2021-05-21 08:01:26 +02:00
Wojciech Macek	741afc6233	ip_mroute: refactor bw_meter API API should work as following: - periodicaly report Lower-or-EQual bandwidth (LEQ) connections over kernel socket, if user application registered for such per-flow notifications - report Grater-or-EQual (GEQ) bandwidth as soon as it reaches specified value in configured time window Custom implementation of callouts was removed. There is no point of doing calout-wheel here as generic callouts are doing exactly the same. The performance is not critical for such reporting, so the biggest concern should be to have a code which can be easily maintained. This is ia preparation for locking rework which is highly inefficient. Approved by: mw Sponsored by: Stormshield Obtained from: Semihalf Differential Revision: https://reviews.freebsd.org/D30210	2021-05-21 06:43:41 +02:00
Wojciech Macek	787845c0e8	Revert "ip_mroute: refactor bw_meter API" This reverts commit `d1cd99b147`.	2021-05-20 12:14:58 +02:00
Wojciech Macek	d1cd99b147	ip_mroute: refactor bw_meter API API should work as following: - periodicaly report Lower-or-EQual bandwidth (LEQ) connections over kernel socket, if user application registered for such per-flow notifications - report Grater-or-EQual (GEQ) bandwidth as soon as it reaches specified value in configured time window Custom implementation of callouts was removed. There is no point of doing calout-wheel here as generic callouts are doing exactly the same. The performance is not critical for such reporting, so the biggest concern should be to have a code which can be easily maintained. This is ia preparation for locking rework which is highly inefficient. Approved by: mw Sponsored by: Stormshield Obtained from: Semihalf Differential Revision: https://reviews.freebsd.org/D30210	2021-05-20 10:13:55 +02:00
Zhenlei Huang	3d846e4822	Do not forward datagrams originated by link-local addresses The current implement of ip_input() reject packets destined for 169.254.0.0/16, but not those original from 169.254.0.0/16 link-local addresses. Fix to fully respect RFC 3927 section 2.7. PR: 255388 Reviewed by: donner, rgrimes, karels MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D29968	2021-05-18 22:59:46 +02:00
Lutz Donnerhacke	2e6b07866f	libalias: Ensure ASSERT behind varable declarations At some places the ASSERT was inserted before variable declarations are finished. This is fixed now. Reported by: kib Reviewed by: kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D30282	2021-05-16 02:28:36 +02:00
Lutz Donnerhacke	189f8eea13	libalias: replace placeholder with static constant The field nullAddress in struct libalias is never set and never used. It exists as a placeholder for an unused argument only. Reviewed by: hselasky MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D30253	2021-05-15 09:05:30 +02:00
Lutz Donnerhacke	effc8e57fb	libalias: Style cleanup libalias is a convolut of various coding styles modified by a series of different editors enforcing interesting convetions on spacing and comments. This patch is a baseline to start with a perfomance rework of libalias. Upcoming patches should be focus on the code, not on the style. That's why most annoying style errors should be fixed beforehand. Reviewed by: hselasky Discussed by: emaste MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D30259	2021-05-15 08:57:55 +02:00
Randall Stewart	02cffbc250	tcp: Incorrect KASSERT causes a panic in rack Skyzall found an interesting panic in rack. When a SYN and FIN are both sent together a KASSERT gets tripped where it is validating that a mbuf pointer is in the sendmap. But a SYN and FIN often will not have a mbuf pointer. So the fix is two fold a) make sure that the SYN and FIN split the right way when cloning an RSM SYN on left edge and FIN on right. And also make sure the KASSERT properly accounts for the case that we have a SYN or FIN so we don't panic. Reviewed by: mtuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D30241	2021-05-13 07:36:04 -04:00
Michael Tuexen	eec6aed5b8	sctp: fix another locking bug in COOKIE handling Thanks to Tolya Korniltsev for reporting the issue for the userland stack and testing the fix. MFC after: 3 days	2021-05-12 23:05:28 +02:00
Mark Johnston	d8acd2681b	Fix mbuf leaks in various pru_send implementations The various protocol implementations are not very consistent about freeing mbufs in error paths. In general, all protocols must free both "m" and "control" upon an error, except if PRUS_NOTREADY is specified (this is only implemented by TCP and unix(4) and requires further work not handled in this diff), in which case "control" still must be freed. This diff plugs various leaks in the pru_send implementations. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30151	2021-05-12 13:00:09 -04:00
Michael Tuexen	251842c639	tcp rack: improve initialisation of retransmit timeout When the TCP is in the front states, don't take the slop variable into account. This improves consistency with the base stack. Reviewed by: rrs@ Differential Revision: https://reviews.freebsd.org/D30230 MFC after: 1 week Sponsored by: Netflix, Inc.	2021-05-12 18:02:21 +02:00
Michael Tuexen	12dda000ed	sctp: fix locking in case of error handling during a restart Thanks to Taylor Brandstetter for finding the issue and providing a patch for the userland stack. MFC after: 3 days	2021-05-12 15:29:06 +02:00
Randall Stewart	4b86a24a76	tcp: In rack, we must only convert restored rtt when the hostcache does restore them. Rack now after the previous commit is very careful to translate any value in the hostcache for srtt/rttvar into its proper format. However there is a snafu here in that if tp->srtt is 0 is the only time that the HC will actually restore the srtt. We need to then only convert the srtt restored when it is actually restored. We do this by making sure it was zero before the call to cc_conn_init and it is non-zero afterwards. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30213	2021-05-11 08:15:05 -04:00
Wojciech Macek	0b103f7237	mrouter: do not loopback packets unconditionally Looping back router multicast traffic signifficantly stresses network stack. Add possibility to disable or enable loopbacked based on sysctl value. Reported by: Daniel Deville Reviewed by: mw Differential Revision: https://reviews.freebsd.org/D29947	2021-05-11 12:36:07 +02:00
Wojciech Macek	65634ae748	mroute: fix race condition during mrouter shutting down There is a race condition between V_ip_mrouter de-init and ip_mforward handling. It might happen that mrouted is cleaned up after V_ip_mrouter check and before processing packet in ip_mforward. Use epoch call aproach, similar to IPSec which also handles such case. Reported by: Damien Deville Obtained from: Stormshield Reviewed by: mw Differential Revision: https://reviews.freebsd.org/D29946	2021-05-11 12:34:20 +02:00
Richard Scheffenegger	0471a8c734	tcp: SACK Lost Retransmission Detection (LRD) Recover from excessive losses without reverting to a retransmission timeout (RTO). Disabled by default, enable with sysctl net.inet.tcp.do_lrd=1 Reviewed By: #transport, rrs, tuexen, #manpages Sponsored by: Netapp, Inc. Differential Revision: https://reviews.freebsd.org/D28931	2021-05-10 19:06:20 +02:00
Randall Stewart	9867224bab	tcp:Host cache and rack ending up with incorrect values. The hostcache up to now as been updated in the discard callback but without checking if we are all done (the race where there are more than one calls and the counter has not yet reached zero). This means that when the race occurs, we end up calling the hc_upate more than once. Also alternate stacks can keep there srtt/rttvar in different formats (example rack keeps its values in microseconds). Since we call the hc_update before the stack fini() then the values will be in the wrong format. Rack on the other hand, needs to convert items pulled from the hostcache into its internal format else it may end up with very much incorrect values from the hostcache. In the process lets commonize the update mechanism for srtt/rttvar since we now have more than one place that needs to call it. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30172	2021-05-10 11:25:51 -04:00
Randall Stewart	5a4333a537	This takes Warners suggested approach to making it so that platforms that for whatever reason cannot include the RATELIMIT option can still work with rack. It adds two dummy functions that rack will call and find out that the highest hw supported b/w is 0 (which kinda makes sense and rack is already prepared to handle). Reviewed by: Michael Tuexen, Warner Losh Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30163	2021-05-07 17:32:32 -04:00
Mark Johnston	a1fadf7de2	divert: Fix mbuf ownership confusion in div_output() div_output_outbound() and div_output_inbound() relied on the caller to free the mbuf if an error occurred. However, this is contrary to the semantics of their callees, ip_output(), ip6_output() and netisr_queue_src(), which always consume the mbuf. So, if one of these functions returned an error, that would get propagated up to div_output(), resulting in a double free. Fix the problem by making div_output_outbound() and div_output_inbound() responsible for freeing the mbuf in all cases. Reported by: Michael Schmiedgen <schmiedgen@gmx.net> Tested by: Michael Schmiedgen Reviewed by: donner MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D30129	2021-05-07 14:31:08 -04:00
Randall Stewart	a16cee0218	Fix a UDP tunneling issue with rack. Basically there are two issues. A) Not enough hdrlen was being calculated when a UDP tunnel is in place. and B) Not enough memory is allocated in racks fsb. We need to overbook the fsb to include a udphdr just in case. Submitted by: Peter Lei Reviewed by: Michael Tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30157	2021-05-07 14:06:43 -04:00
Gleb Smirnoff	be578b67b5	tcp_twcheck(): use correct unlock macro. This crippled in due to conflict between two last commits `1db08fbe3f` and `9e644c2300`. Submitted by: Peter Lei	2021-05-06 10:19:21 -07:00
Randall Stewart	5d8fd932e4	This brings into sync FreeBSD with the netflix versions of rack and bbr. This fixes several breakages (panics) since the tcp_lro code was committed that have been reported. Quite a few new features are now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the largest). There is also support for ack-war prevention. Documents comming soon on rack.. Sponsored by: Netflix Reviewed by: rscheff, mtuexen Differential Revision: https://reviews.freebsd.org/D30036	2021-05-06 11:22:26 -04:00
Michael Tuexen	d1cb8d11b0	sctp: improve consistency when handling chunks of wrong size MFC after: 3 days	2021-05-06 01:02:41 +02:00
Mark Johnston	6c34dde83e	igmp: Avoid an out-of-bounds access when zeroing counters When verifying, byte-by-byte, that the user-supplied counters are zero-filled, sysctl_igmp_stat() would check for zero before checking the loop bound. Perform the checks in the correct order. Reported by: KASAN MFC after: 1 week Sponsored by: The FreeBSD Foundation	2021-05-05 17:12:51 -04:00
Marko Zec	2aca58e16f	Introduce DXR as an IPv4 longest prefix matching / FIB module DXR maintains compressed lookup structures with a trivial search procedure. A two-stage trie is indexed by the more significant bits of the search key (IPv4 address), while the remaining bits are used for finding the next hop in a sorted array. The tradeoff between memory footprint and search speed depends on the split between the trie and the remaining binary search. The default of 20 bits of the key being used for trie indexing yields good performance (see below) with footprints of around 2.5 Bytes per prefix with current BGP snapshots. Rebuilding lookup structures takes some time, which is compensated for by batching several RIB change requests into a single FIB update, i.e. FIB synchronization with the RIB may be delayed for a fraction of a second. RIB to FIB synchronization, next-hop table housekeeping, and lockless lookup capability is provided by the FIB_ALGO infrastructure. DXR works well on modern CPUs with several MBytes of caches, especially in VMs, where is outperforms other currently available IPv4 FIB algorithms by a large margin. Synthetic single-thread LPM throughput test method: kldload test_lookup; kldload dpdk_lpm4; kldload fib_dxr sysctl net.route.test.run_lps_rnd=N sysctl net.route.test.run_lps_seq=N where N is the number of randomly generated keys (IPv4 addresses) which should be chosen so that each test iteration runs for several seconds. Each reported score represents the best of three runs, in million lookups per second (MLPS), for two bechmarks (RND & SEQ) with two FIBs: host: single interface address, local subnet route + default route BGP: snapshot from linx.routeviews.org, 887957 prefixes, 496 next hops Bhyve VM on an Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60 GHz: inet.algo host, RND host, SEQ BGP, RND BGP, SEQ bsearch4 40.6 20.2 N/A N/A radix4 7.8 3.8 1.2 0.6 radix4_lockless 18.0 9.0 1.6 0.8 dpdk_lpm4 14.4 5.0 14.6 5.0 dxr 70.3 34.7 43.0 19.5 Intel(R) Core(TM) i5-5300U CPU @ 2.30 GHz: inet.algo host, RND host, SEQ BGP, RND BGP, SEQ bsearch4 47.0 23.1 N/A N/A radix4 8.5 4.2 1.9 1.0 radix4_lockless 19.2 9.5 2.5 1.2 dpdk_lpm4 31.2 9.4 31.6 9.3 dxr 84.9 41.4 51.7 23.6 Intel(R) Core(TM) i7-4771 CPU @ 3.50 GHz: inet.algo host, RND host, SEQ BGP, RND BGP, SEQ bsearch4 59.5 29.4 N/A N/A radix4 10.8 5.5 2.5 1.3 radix4_lockless 24.7 12.0 3.1 1.6 dpdk_lpm4 29.1 9.0 30.2 9.1 dxr 101.3 49.9 69.8 32.5 AMD Ryzen 7 3700X 8-Core Processor @ 3.60 GHz: inet.algo host, RND host, SEQ BGP, RND BGP, SEQ bsearch4 70.8 35.4 N/A N/A radix4 14.4 7.2 2.8 1.4 radix4_lockless 30.2 15.1 3.7 1.8 dpdk_lpm4 29.9 9.0 30.0 8.9 dxr 163.3 81.5 99.5 44.4 AMD Ryzen 5 5600X 6-Core Processor @ 3.70 GHz: inet.algo host, RND host, SEQ BGP, RND BGP, SEQ bsearch4 93.6 46.7 N/A N/A radix4 18.9 9.3 4.3 2.1 radix4_lockless 37.2 18.6 5.3 2.7 dpdk_lpm4 51.8 15.1 51.6 14.9 dxr 218.2 103.3 114.0 49.0 Reviewed by: melifaro MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29821	2021-05-05 13:45:52 +02:00
Michael Tuexen	b621fbb1bf	sctp: drop packet with SHUTDOWN-ACK chunks with wrong vtags MFC after: 3 days	2021-05-04 18:43:31 +02:00
Mark Johnston	f161d294b9	Add missing sockaddr length and family validation to various protocols Several protocol methods take a sockaddr as input. In some cases the sockaddr lengths were not being validated, or were validated after some out-of-bounds accesses could occur. Add requisite checking to various protocol entry points, and convert some existing checks to assertions where appropriate. Reported by: syzkaller+KASAN Reviewed by: tuexen, melifaro MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29519	2021-05-03 13:35:19 -04:00
Michael Tuexen	8b3d0f6439	sctp: improve address list scanning If the alternate address has to be removed, force the stack to find a new one, if it is still needed. MFC after: 3 days	2021-05-03 02:50:05 +02:00
Michael Tuexen	a89481d328	sctp: improve restart handling This fixes in particular a possible use after free bug reported Anatoly Korniltsev and Taylor Brandstetter for the userland stack. MFC after: 3 days	2021-05-03 02:20:24 +02:00
Alexander Motin	655c200cc8	Fix build after `5f2e183505`.	2021-05-02 20:07:38 -04:00
Michael Tuexen	5f2e183505	sctp: improve error handling in INIT/INIT-ACK processing When processing INIT and INIT-ACK information, also during COOKIE processing, delete the current association, when it would end up in an inconsistent state. MFC after: 3 days	2021-05-02 22:41:35 +02:00
Michael Tuexen	e010d20032	sctp: update the vtag for INIT and INIT-ACK chunks This is needed in case of responding with an ABORT to an INIT-ACK.	2021-04-30 13:33:16 +02:00
Michael Tuexen	eb79855920	sctp: fix SCTP_PEER_ADDR_PARAMS socket option Ignore spp_pathmtu if it is 0, when setting the IPPROTO_SCTP level socket option SCTP_PEER_ADDR_PARAMS as required by RFC 6458. MFC after: 1 week	2021-04-30 12:31:09 +02:00
Michael Tuexen	eecdf5220b	sctp: use RTO.Initial of 1 second as specified in RFC 4960bis	2021-04-30 00:45:56 +02:00
Michael Tuexen	9de7354bb8	sctp: improve consistency in handling chunks with wrong size Just skip the chunk, if no other handling is required by the specification.	2021-04-28 18:11:06 +02:00
Richard Scheffenegger	48be5b976e	tcp: stop spurious rescue retransmissions and potential asserts Reported by: pho@ MFC after: 3 days Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29970	2021-04-28 15:01:10 +02:00
Michael Tuexen	059ec2225c	sctp: cleanup verification of INIT and INIT-ACK chunks	2021-04-27 12:45:43 +02:00
Michael Tuexen	c70d1ef15d	sctp: improve handling of illegal packets containing INIT chunks Stop further processing of a packet when detecting that it contains an INIT chunk, which is too small or is not the only chunk in the packet. Still allow to finish the processing of chunks before the INIT chunk. Thanks to Antoly Korniltsev and Taylor Brandstetter for reporting an issue with the userland stack, which made me aware of this issue. MFC after: 3 days	2021-04-26 10:43:58 +02:00
Michael Tuexen	163153c2a0	sctp: small cleanup, no functional change MFC: 3 days	2021-04-26 02:56:48 +02:00
Hans Petter Selasky	a9b66dbd91	Allow the tcp_lro_flush_all() function to be called when the control structure is zeroed, by setting the VNET after checking the mbuf count for zero. It appears there are some cases with early interrupts on some network devices which still trigger page-faults on accessing a NULL "ifp" pointer before the TCP LRO control structure has been initialized. This basically preserves the old behaviour, prior to `9ca874cf74` . No functional change. Reported by: rscheff@ Differential Revision: https://reviews.freebsd.org/D29564 MFC after: 2 weeks Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-04-24 12:23:42 +02:00
Mark Johnston	8e8f1cc9bb	Re-enable network ioctls in capability mode This reverts a portion of `274579831b` ("capsicum: Limit socket operations in capability mode") as at least rtsol and dhcpcd rely on being able to configure network interfaces while in capability mode. Reported by: bapt, Greg V Sponsored by: The FreeBSD Foundation	2021-04-23 09:22:49 -04:00
Navdeep Parhar	01d74fe1ff	Path MTU discovery hooks for offloaded TCP connections. Notify the TOE driver when when an ICMP type 3 code 4 (Fragmentation needed and DF set) message is received for an offloaded connection. This gives the driver an opportunity to lower the path MTU for the connection and resume transmission, much like what the kernel does for the connections that it handles. Reviewed by: glebius@ Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D29755	2021-04-21 13:00:16 -07:00
Mark Johnston	652908599b	Add required checks for unmapped mbufs in ipdivert and ipfw Also add an M_ASSERTMAPPED() macro to verify that all mbufs in the chain are mapped. Use it in ipfw_nat, which operates on a chain returned by m_megapullup(). PR: 255164 Reviewed by: ae, gallatin MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29838	2021-04-21 15:47:05 -04:00
Gleb Smirnoff	d554522f6e	tcp_hostcache: use SMR for lookups, mutex(9) for updates. In certain cases, e.g. a SYN-flood from a limited set of hosts, the TCP hostcache becomes the main contention point. To solve that, this change introduces lockless lookups on the hostcache. The cache remains a hash, however buckets are now CK_SLIST. For updates a bucket mutex is obtained, for read an SMR section is entered. Reviewed by: markj, rscheff Differential revision: https://reviews.freebsd.org/D29729	2021-04-20 10:02:20 -07:00
Gleb Smirnoff	1db08fbe3f	tcp_input: always request read-locking of PCB for any pure SYN segment. This is further rework of `08d9c92027`. Now we carry the knowledge of lock type all the way through tcp_input() and also into tcp_twcheck(). Ideally the rlocking for pure SYNs should propagate all the way into the alternative TCP stacks, but not yet today. This should close a race when socket is bind(2)-ed but not yet listen(2)-ed and a SYN-packet arrives racing with listen(2), discovered recently by pho@.	2021-04-20 10:02:20 -07:00
Gleb Smirnoff	7b5053ce22	tcp_input: remove comments and assertions about tcpbinfo locking They aren't valid since `d40c0d47cd`.	2021-04-20 10:02:20 -07:00
Richard Scheffenegger	a649f1f6fd	tcp: Deal with DSACKs, and adjust rescue hole on success. When a rescue retransmission is successful, rather than inserting new holes to the left of it, adjust the old rescue entry to cover the missed sequence space. Also, as snd_fack may be stale by that point, pull it forward in order to never create a hole left of snd_una/th_ack. Finally, with DSACKs, tcp_sack_doack() may be called with new full ACKs but a DSACK block. Account for this eventuality properly to keep sacked_bytes >= 0. MFC after: 3 days Reviewed By: kbowling, tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29835	2021-04-20 14:54:28 +02:00
Hans Petter Selasky	9ca874cf74	Add TCP LRO support for VLAN and VxLAN. This change makes the TCP LRO code more generic and flexible with regards to supporting multiple different TCP encapsulation protocols and in general lays the ground for broader TCP LRO support. The main job of the TCP LRO code is to merge TCP packets for the same flow, to reduce the number of calls to upper layers. This reduces CPU and increases performance, due to being able to send larger TSO offloaded data chunks at a time. Basically the TCP LRO makes it possible to avoid per-packet interaction by the host CPU. Because the current TCP LRO code was tightly bound and optimized for TCP/IP over ethernet only, several larger changes were needed. Also a minor bug was fixed in the flushing mechanism for inactive entries, where the expire time, "le->mtime" was not always properly set. To avoid having to re-run time consuming regression tests for every change, it was chosen to squash the following list of changes into a single commit: - Refactor parsing of all address information into the "lro_parser" structure. This easily allows to reuse parsing code for inner headers. - Speedup header data comparison. Don't compare field by field, but instead use an unsigned long array, where the fields get packed. - Refactor the IPv4/TCP/UDP checksum computations, so that they may be computed recursivly, only applying deltas as the result of updating payload data. - Make smaller inline functions doing one operation at a time instead of big functions having repeated code. - Refactor the TCP ACK compression code to only execute once per TCP LRO flush. This gives a minor performance improvement and keeps the code simple. - Use sbintime() for all time-keeping. This change also fixes flushing of inactive entries. - Try to shrink the size of the LRO entry, because it is frequently zeroed. - Removed unused TCP LRO macros. - Cleanup unused TCP LRO statistics counters while at it. - Try to use __predict_true() and predict_false() to optimise CPU branch predictions. Bump the __FreeBSD_version due to changing the "lro_ctrl" structure. Tested by: Netflix Reviewed by: rrs (transport) Differential Revision: https://reviews.freebsd.org/D29564 MFC after: 2 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-04-20 13:36:22 +02:00
Gleb Smirnoff	faa9ad8a90	Fix off-by-one error in KASSERT from `02f26e98c7`.	2021-04-19 17:20:19 -07:00
Richard Scheffenegger	b87cf2bc84	tcp: keep SACK scoreboard sorted when doing rescue retransmission Reviewed By: tuexen, kbowling, #transport MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29825	2021-04-18 23:11:10 +02:00
Michael Tuexen	9e644c2300	tcp: add support for TCP over UDP Adding support for TCP over UDP allows communication with TCP stacks which can be implemented in userspace without requiring special priviledges or specific support by the OS. This is joint work with rrs. Reviewed by: rrs Sponsored by: Netflix, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29469	2021-04-18 16:16:42 +02:00
Richard Scheffenegger	2e97826052	rack: Fix ECN on finalizing session. Maintain code similarity between RACK and base stack for ECN. This may not strictly be necessary, depending when a state transition to FIN_WAIT_1 is done in RACK after a shutdown() or close() syscall. MFC after: 3 days Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29658	2021-04-17 20:16:42 +02:00
Richard Scheffenegger	d1de2b05a0	tcp: Rename rfc6675_pipe to sack.revised, and enable by default As full support of RFC6675 is in place, deprecating net.inet.tcp.rfc6675_pipe and enabling by default net.inet.tcp.sack.revised. Reviewed By: #transport, kbowling, rrs Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28702	2021-04-17 14:59:45 +02:00
Gleb Smirnoff	86046cf55f	tcp_respond(): fix assertion, should have been done in `08d9c92027`.	2021-04-16 15:39:51 -07:00
Gleb Smirnoff	cb8d7c44d6	tcp_syncache: add net.inet.tcp.syncache.see_other sysctl A security feature from `c06f087ccb` appeared to be a huge bottleneck under SYN flood. To mitigate that add a sysctl that would make syncache(4) globally visible, ignoring UID/GID, jail(2) and mac(4) checks. When turned on, we won't need to call crhold() on the listening socket credential for every incoming SYN packet. Reviewed by: bz	2021-04-15 15:26:48 -07:00
John Baldwin	774c4c82ff	TOE: Use a read lock on the PCB for syncache_add(). Reviewed by: np, glebius Fixes: `08d9c92027` Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D29739	2021-04-13 16:31:04 -07:00
Gleb Smirnoff	8d5719aa74	syncache: simplify syncache_add() KPI to return struct socket pointer directly, not overwriting the listen socket pointer argument. Not a functional change.	2021-04-12 08:27:40 -07:00
Gleb Smirnoff	08d9c92027	tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets When packet is a SYN packet, we don't need to modify any existing PCB. Normally SYN arrives on a listening socket, we either create a syncache entry or generate syncookie, but we don't modify anything with the listening socket or associated PCB. Thus create a new PCB lookup mode - rlock if listening. This removes the primary contention point under SYN flood - the listening socket PCB. Sidenote: when SYN arrives on a synchronized connection, we still don't need write access to PCB to send a challenge ACK or just to drop. There is only one exclusion - tcptw recycling. However, existing entanglement of tcp_input + stacks doesn't allow to make this change small. Consider this patch as first approach to the problem. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D29576	2021-04-12 08:25:31 -07:00
Alexander V. Chernikov	c3a456defa	Always use inp fib in the inp_lookup_mcast_ifp(). inp_lookup_mcast_ifp() is static and is only used in the inp_join_group(). The latter function is also static, and is only used in the inp_setmoptions(), which relies on inp being non-NULL. As a result, in the current code, inp_lookup_mcast_ifp() is always called with non-NULL inp. Eliminate unused RT_DEFAULT_FIB condition and always use inp fib instead. Differential Revision: https://reviews.freebsd.org/D29594 Reviewed by: kp MFC after: 2 weeks	2021-04-10 13:47:49 +00:00
Gleb Smirnoff	1a7fe55ab8	tcp_hostcache: make THC_LOCK/UNLOCK macros to work with hash head pointer. Not a functional change.	2021-04-09 14:07:35 -07:00
Gleb Smirnoff	4f49e3382f	tcp_hostcache: style(9) Reviewed by: rscheff	2021-04-09 14:07:27 -07:00
Gleb Smirnoff	7c71f3bd6a	tcp_hostcache: remove extraneous check. All paths leading here already checked this setting. Reviewed by: rscheff	2021-04-09 14:07:19 -07:00
Gleb Smirnoff	0c25bf7e7c	tcp_hostcache: implement tcp_hc_updatemtu() via tcp_hc_update. Locking changes are planned here, and without this change too much copy-and-paste would be between these two functions. Reviewed by: rscheff	2021-04-09 14:06:44 -07:00
Richard Scheffenegger	b878ec024b	tcp: Use jenkins_hash32() in hostcache As other parts of the base tcp stack (eg. tcp fastopen) already use jenkins_hash32, and the properties appear reasonably good, switching to use that. Reviewed By: tuexen, #transport, ae MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29515	2021-04-08 20:29:19 +02:00
Gleb Smirnoff	373ffc62c1	tcp_hostcache.c: remove unneeded includes. Reviewed by: rscheff	2021-04-08 10:58:44 -07:00
Gleb Smirnoff	29acb54393	tcp_hostcache: add bool argument for tcp_hc_lookup() to tell are we looking to only read from the result, or to update it as well. For now doesn't affect locking, but allows to push stats and expire update into single place. Reviewed by: rscheff	2021-04-08 10:58:44 -07:00
Gleb Smirnoff	489bde5753	tcp_hostcache: hide rmx_hits/rmx_updates under ifdef. They have little value unless you do some profiling investigations, but they are performance bottleneck. Reviewed by: rscheff	2021-04-08 10:58:44 -07:00
Gleb Smirnoff	2cca4c0ee0	Remove tcp_hostcache.h. Everything is private. Reviewed by: rscheff	2021-04-08 10:58:44 -07:00
Richard Scheffenegger	90cca08e91	tcp: Prepare PRR to work with NewReno LossRecovery Add proper PRR vnet declarations for consistency. Also add pointer to tcpopt struct to tcp_do_prr_ack, in preparation for it to deal with non-SACK window reduction (after loss). No functional change. MFC after: 2 weeks Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29440	2021-04-08 19:16:31 +02:00
Richard Scheffenegger	9f2eeb0262	[tcp] Fix ECN on finalizing sessions. A subtle oversight would subtly change new data packets sent after a shutdown() or close() call, while the send buffer is still draining. MFC after: 3 days Reviewed By: #transport, tuexen Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29616	2021-04-08 15:26:09 +02:00
Mark Johnston	274579831b	capsicum: Limit socket operations in capability mode Capsicum did not prevent certain privileged networking operations, specifically creation of raw sockets and network configuration ioctls. However, these facilities can be used to circumvent some of the restrictions that capability mode is supposed to enforce. Add capability mode checks to disallow network configuration ioctls and creation of sockets other than PF_LOCAL and SOCK_DGRAM/STREAM/SEQPACKET internet sockets. Reviewed by: oshogbo Discussed with: emaste Reported by: manu Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D29423	2021-04-07 14:32:56 -04:00
Richard Scheffenegger	a04906f027	fix typo in `38ea2bd069`	2021-04-02 20:34:33 +02:00
Richard Scheffenegger	38ea2bd069	Use sbuf_drain unconditionally After making sbuf_drain safe for external use, there is no need to protect the call. MFC after: 2 weeks Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29545	2021-04-02 20:27:46 +02:00
Richard Scheffenegger	9aef4e7c2b	tcp: Shouldn't drain empty sbuf MFC after: 2 weeks Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29524	2021-04-01 17:18:38 +02:00
Richard Scheffenegger	02f26e98c7	tcp: Add hash histogram output and validate bucket length accounting Provide a histogram output to check, if the hashsize or bucketlimit could be optimized. Also add some basic sanity checks around the accounting of the hash utilization. MFC after: 2 weeks Reviewed By: tuexen, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29506	2021-04-01 14:44:14 +02:00
Richard Scheffenegger	529a2a0f27	tcp: For hostcache performance, use atomics instead of counters As accessing the tcp hostcache happens frequently on some classes of servers, it was recommended to use atomic_add/subtract rather than (per-CPU distributed) counters, which have to be summed up at high cost to cache efficiency. PR: 254333 MFC after: 2 weeks Sponsored by: NetApp, Inc. Reviewed By: #transport, tuexen, jtl Differential Revision: https://reviews.freebsd.org/D29522	2021-04-01 10:03:30 +02:00
Richard Scheffenegger	95e56d31e3	tcp: Make hostcache.cache_count MPSAFE by using a counter_u64_t Addressing the underlying root cause for cache_count to show unexpectedly high values, by protecting all arithmetic on that global variable by using counter(9). PR: 254333 Reviewed By: tuexen, #transport MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29510	2021-03-31 20:24:13 +02:00
Richard Scheffenegger	869880463c	tcp: drain tcp_hostcache_list in between per-bucket locks Explicitly drain the sbuf after completing each hash bucket to minimize the work performed while holding the hash bucket lock. PR: 254333 MFC after: 2 weeks Reviewed By: tuexen, jhb, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29483	2021-03-31 19:24:21 +02:00
Andrey V. Elsukov	c80a4b76ce	ipdivert: check that PCB is still valid after taking INPCB_RLOCK. We are inspecting PCBs of divert sockets under NET_EPOCH section, but PCB could be already detached and we should check INP_FREED flag when we took INP_RLOCK. PR: 254478 MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29420	2021-03-30 12:31:09 +03:00
Richard Scheffenegger	cb0dd7e122	tcp: reduce memory footprint when listing tcp hostcache In tcp_hostcache_list, the sbuf used would need a large (~2MB) blocking allocation of memory (M_WAITOK), when listing a full hostcache. This may stall the requestor for an indeterminate time. A further optimization is to return the expected userspace buffersize right away, rather than preparing the output of each current entry of the hostcase, provided by: @tuexen. This makes use of the ready-made functions of sbuf to work with sysctl, and repeatedly drain the much smaller buffer. PR: 254333 MFC after: 2 weeks Reviewed By: #transport, tuexen Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29471	2021-03-28 23:50:23 +02:00
Richard Scheffenegger	b9f803b7d4	tcp: Use PRR for ECN congestion recovery MFC after: 2 weeks Reviewed By: #transport, rrs Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28972	2021-03-26 02:06:15 +01:00
Richard Scheffenegger	eb3a59a831	tcp: Refactor PRR code No functional change intended. MFC after: 2 weeks Reviewed By: #transport, rrs Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29411	2021-03-26 00:01:34 +01:00
Richard Scheffenegger	0533fab89e	tcp: Perform simple fast retransmit when SACK Blocks are missing on SACK session MFC after: 2 weeks Reviewed By: #transport, rrs Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28634	2021-03-25 23:23:48 +01:00
Michael Tuexen	d995cc7e54	sctp: fix handling of RTO.initial of 1 ms MFC after: 3 days Reported by: syzbot+5eb0e009147050056ce9@syzkaller.appspotmail.com	2021-03-22 16:44:18 +01:00
Michael Tuexen	40f41ece76	tcp: improve handling of SYN segments in SYN-SENT state Ensure that the stack does not generate a DSACK block for user data received on a SYN segment in SYN-SENT state. Reviewed by: rscheff MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D29376 Sponsored by: Netflix, Inc.	2021-03-22 15:58:49 +01:00
Richard Scheffenegger	e9f029831f	fix panic when rescue retransmission and FIN overlap PR: 254244 PR: 254309 Reviewed By: #transport, hselasky, tuexen MFC after: 3 days Sponsored By: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29315	2021-03-17 17:12:04 +01:00
Gordon Bergling	5666643a95	Fix some common typos in comments - occured -> occurred - normaly -> normally - controling -> controlling - fileds -> fields - insterted -> inserted - outputing -> outputting MFC after: 1 week	2021-03-13 18:26:15 +01:00
Gordon Bergling	183502d162	Fix a few typos in comments - trough -> through MFC after: 1 week	2021-03-13 16:37:28 +01:00
John Baldwin	5a50eb6585	Don't pass RFPROC to kproc_create(), it is redundant. Reviewed by: tuexen, kib MFC after: 1 week Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D29206	2021-03-12 09:48:10 -08:00
Alexander V. Chernikov	b1d63265ac	Flush remaining routes from the routing table during VNET shutdown. Summary: This fixes rtentry leak for the cloned interfaces created inside the VNET. PR: 253998 Reported by: rashey at superbox.pl MFC after: 3 days Loopback teardown order is `SI_SUB_INIT_IF`, which happens after `SI_SUB_PROTO_DOMAIN` (route table teardown). Thus, any route table operations are too late to schedule. As the intent of the vnet teardown procedures to minimise the amount of effort by doing global cleanups instead of per-interface ones, address this by adding a relatively light-weight routing table cleanup function, `rib_flush_routes()`. It removes all remaining routes from the routing table and schedules the deletion, which will happen later, when `rtables_destroy()` waits for the current epoch to finish. Test Plan: ``` set_skip:set_skip_group_lo -> passed [0.053s] tail -n 200 /var/log/messages \| grep rtentry ``` Reviewers: #network, kp, bz Reviewed By: kp Subscribers: imp, ae Differential Revision: https://reviews.freebsd.org/D29116	2021-03-10 21:10:14 +00:00
Richard Scheffenegger	e53138694a	tcp: Add prr_out in preparation for PRR/nonSACK and LRD Reviewed By: #transport, kbowling MFC after: 3 days Sponsored By: Netapp, Inc. Differential Revision: https://reviews.freebsd.org/D29058	2021-03-06 00:38:22 +01:00
Richard Scheffenegger	9a13d9dcee	tcp: remove a superfluous local var in tcp_sack_partialack() No functional change. Reviewed By: #transport, tuexen MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29088	2021-03-05 18:20:23 +01:00
Richard Scheffenegger	4a8f3aad37	tcp: remove incorrect reset of SACK variable in PRR Reviewed By: #transport, rrs, tuexen PR: 253848 MFC after: 3 days Sponsored By: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D29083	2021-03-05 17:45:54 +01:00
Michael Tuexen	705d06b289	rack: unbreak TCP fast open for the client side Allow sending user data on the SYN segment. Reviewed by: rrs MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D29082 Sponsored by: Netflix, Inc.	2021-03-05 16:03:03 +01:00
Kristof Provost	bb4a7d94b9	net: Introduce IPV6_DSCP(), IPV6_ECN() and IPV6_TRAFFIC_CLASS() macros Introduce convenience macros to retrieve the DSCP, ECN or traffic class bits from an IPv6 header. Use them where appropriate. Reviewed by: ae (previous version), rscheff, tuexen, rgrimes MFC after: 2 weeks Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D29056	2021-03-04 20:56:48 +01:00
Michael Tuexen	99adf23006	RACK: fix an issue triggered by using the CDG CC module Obtained from: rrs@ MFC after: 3 days PR: 238741 Sponsored by: Netlix, Inc.	2021-03-02 12:32:16 +01:00
Richard Scheffenegger	0b0f8b359d	calculate prr_out correctly when pipe < ssthresh Reviewed By: #transport, tuexen MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28998	2021-03-01 16:26:05 +01:00
Richard Scheffenegger	e9071000c9	Improve PRR initial transmission timing Reviewed By: tuexen, #transport MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28953	2021-02-28 15:46:54 +01:00
Michael Tuexen	70e95f0b69	sctp: avoid integer overflow when starting the HB timer MFC after: 3 days Reported by: syzbot+14b9d7c3c64208fae62f@syzkaller.appspotmail.com	2021-02-27 23:27:30 +01:00
Richard Scheffenegger	9e83a6a556	Include new data sent in PRR calculation Reviewed By: #transport, kbowling MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28941	2021-02-26 22:31:58 +01:00
Richard Scheffenegger	2593f858d7	A TCP server has to take into consideration, if TCP_NOOPT is preventing the negotiation of TCP features. This affects most TCP options but adherance to RFC7323 with the timestamp option will prevent a session from getting established. PR: 253576 Reviewed By: tuexen, #transport MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28652	2021-02-25 19:12:20 +01:00
Richard Scheffenegger	31d7a27c6e	PRR: Avoid accounting left-edge twice in partial ACK. Reviewed By: #transport, kbowling MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28819	2021-02-25 18:37:47 +01:00
Richard Scheffenegger	48396dc779	Address two incorrect calculations and enhance readability of PRR code - address second instance of cwnd potentially becoming zero - fix sublte bug due to implicit int to uint typecase in max() - fix bug due to typo in hand-coded CEILING() function by using howmany() macro - use int instead of long, and add a missing long typecast - replace if conditionals with easier to read imax/imin (as in pseudocode) Reviewed By: #transport, kbowling MFC after: 3 days Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28813	2021-02-25 18:32:04 +01:00
Kristof Provost	f3245be349	net: remove legacy in_addmulti() Despite the comment to the contrary neither pf nor carp use in_addmulti(). Nothing does, so get rid of it. Carp stopped using it in `08b68b0e4c` (2011). It's unclear when pf stopped using it, but before `d6d3f01e0a` (2012). Reviewed by: bz@, melifaro@ Sponsored by: Rubicon Communications, LLC ("Netgate") Differential Revision: https://reviews.freebsd.org/D28918	2021-02-25 10:13:52 +01:00
Kristof Provost	c139b3c19b	arp/nd: Cope with late calls to iflladdr_event When tearing down vnet jails we can move an if_bridge out (as part of the normal vnet_if_return()). This can, when it's clearing out its list of member interfaces, change its link layer address. That sends an iflladdr_event, but at that point we've already freed the AF_INET/AF_INET6 if_afdata pointers. In other words: when the iflladdr_event callbacks fire we can't assume that ifp->if_afdata[AF_INET] will be set. Reviewed by: donner@, melifaro@ MFC after: 1 week Sponsored by: Orange Business Services Differential Revision: https://reviews.freebsd.org/D28860	2021-02-23 13:54:07 +01:00
Hans Petter Selasky	9febbc4541	Fix for natd(8) sending wrong sequence number after TCP retransmission, terminating a TCP connection. If a TCP packet must be retransmitted and the data length has changed in the retransmitted packet, due to the internal workings of TCP, typically when ACK packets are lost, then there is a 30% chance that the logic in GetDeltaSeqOut() will find the correct length, which is the last length received. This can be explained as follows: If a "227 Entering Passive Mode" packet must be retransmittet and the length changes from 51 to 50 bytes, for example, then we have three cases for the list scan in GetDeltaSeqOut(), depending on how many prior packets were received modulus N_LINK_TCP_DATA=3: case 1: index 0: original packet 51 index 1: retransmitted packet 50 index 2: not relevant case 2: index 0: not relevant index 1: original packet 51 index 2: retransmitted packet 50 case 3: index 0: retransmitted packet 50 index 1: not relevant index 2: original packet 51 This patch simply changes the searching order for TCP packets, always starting at the last received packet instead of any received packet, in GetDeltaAckIn() and GetDeltaSeqOut(). Else no functional changes. Discussed with: rscheff@ Submitted by: Andreas Longwitz <longwitz@incore.de> PR: 230755 MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-02-22 17:13:58 +01:00
Michael Tuexen	b963ce4588	sctp: improve computation of an alternate net Espeially handle the case where the net passed in is about to be deleted and therefore not in the list of nets anymore. MFC after: 3 days Reported by: syzbot+9756917a7c8381adf5e8@syzkaller.appspotmail.com	2021-02-21 17:13:06 +01:00
Michael Tuexen	5ac839029d	sctp: clear a pointer to a net which will be removed MFC after: 3 days	2021-02-21 13:06:05 +01:00
Richard Scheffenegger	a8e431e153	PRR: use accurate rfc6675_pipe when enabled Reviewed By: #transport, tuexen MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28816	2021-02-20 20:11:48 +01:00
Richard Scheffenegger	853fd7a2e3	Ensure cwnd doesn't shrink to zero with PRR Under some circumstances, PRR may end up with a fully collapsed cwnd when finalizing the loss recovery. Reviewed By: #transport, kbowling Reported by: Liang Tian MFC after: 1 week Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D28780	2021-02-19 13:55:32 +01:00
Kyle Evans	4c0bef07be	kern: net: remove TCP_LINGERTIME TCP_LINGERTIME can be traced back to BSD 4.4 Lite and perhaps beyond, in exactly the same form that it appears here modulo slightly different context. It used to be the case that there was a single pr_usrreq method with requests dispatched to it; these exact two lines appeared in tcp_usrreq's PRU_ATTACH handling. The only purpose of this that I can find is to cause surprising behavior on accepted connections. Newly-created sockets will never hit these paths as one cannot set SO_LINGER prior to socket(2). If SO_LINGER is set on a listening socket and inherited, one would expect the timeout to be inherited rather than changed arbitrarily like this -- noting that SO_LINGER is nonsense on a listening socket beyond inheritance, since they cannot be 'connected' by definition. Neither Illumos nor Linux reset the timer like this based on testing and inspection of Illumos, and testing of Linux. Reviewed by: rscheff, tuexen Differential Revision: https://reviews.freebsd.org/D28265	2021-02-18 22:36:01 -06:00
Randall Stewart	e13e4fa6c4	fix Navdeeps LINT_NOINET error.	2021-02-18 07:29:12 -05:00
Randall Stewart	0a4f851074	Fix another pesky missing #ifdef TCPHPTS	2021-02-18 01:27:30 -05:00
Randall Stewart	ab4fad4be1	Add ifdef TCPHPTS around build_ack_entry and do_bpf_and_csum to avoid warnings when HPTS is not included Thanks to Gary Jennejohn for pointing this out.	2021-02-17 12:49:42 -05:00
Randall Stewart	69a34e8d02	Update the LRO processing code so that we can support a further CPU enhancements for compressed acks. These are acks that are compressed into an mbuf. The transport has to be aware of how to process these, and an upcoming update to rack will do so. You need the rack changes to actually test and validate these since if the transport does not support mbuf compression, then the old code paths stay in place. We do in this commit take out the concept of logging if you don't have a lock (which was quite dangerous and was only for some early debugging but has been left in the code). Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D28374	2021-02-17 10:41:01 -05:00
Alexander V. Chernikov	9fdbf7eef5	Make in_localip_more() fib-aware. It fixes loopback route installation for the interfaces in the different fibs using the same prefix. Reviewed By: donner PR: 189088 Differential Revision: https://reviews.freebsd.org/D28673 MFC after: 1 week	2021-02-16 20:00:46 +00:00
Richard Scheffenegger	3c40e1d52c	update the SACK loss recovery to RFC6675, with the following new features: - improved pipe calculation which does not degrade under heavy loss - engaging in Loss Recovery earlier under adverse conditions - Rescue Retransmission in case some of the trailing packets of a request got lost All above changes are toggled with the sysctl "rfc6675_pipe" (disabled by default). Reviewers: #transport, tuexen, lstewart, slavash, jtl, hselasky, kib, rgrimes, chengc_netapp.com, thj, #manpages, kbowling, #netapp, rscheff Reviewed By: #transport Subscribers: imp, melifaro MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D18985	2021-02-16 13:08:37 +01:00
Alexander V. Chernikov	8268d82cff	Remove per-packet ifa refcounting from IPv6 fast path. Currently ip6_input() calls in6ifa_ifwithaddr() for every local packet, in order to check if the target ip belongs to the local ifa in proper state and increase its counters. in6ifa_ifwithaddr() references found ifa. With epoch changes, both `ip6_input()` and all other current callers of `in6ifa_ifwithaddr()` do not need this reference anymore, as epoch provides stability guarantee. Given that, update `in6ifa_ifwithaddr()` to allow it to return ifa without referencing it, while preserving option for getting referenced ifa if so desired. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D28648	2021-02-15 22:33:12 +00:00
Michael Tuexen	ed782b9f5a	tcp: improve behaviour when using TCP_NOOPT Use ISS for SEG.SEQ when sending a SYN-ACK segment in response to an SYN segment received in the SYN-SENT state on a socket having the IPPROTO_TCP level socket option TCP_NOOPT enabled. Reviewed by: rscheff Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D28656	2021-02-14 12:16:57 +01:00
Andrey V. Elsukov	c6ded47d0b	[udp] fix possible mbuf and lock leak in udp_input(). In error case we can leave `inp' locked, also we need to free mbuf chain `m' in the same case. Release the lock and use `badunlocked' label to exit with freed mbuf. Also modify UDP error statistic to match the IPv6 code. Remove redundant INP_RUNLOCK() from the `if (last == NULL)' block, there are no ways to reach this point with locked `inp'. Obtained from: Yandex LLC MFC after: 3 days Sponsored by: Yandex LLC	2021-02-11 12:08:41 +03:00
Alexander V. Chernikov	924d1c9a05	Revert "SO_RERROR indicates that receive buffer overflows should be handled as errors." Wrong version of the change was pushed inadvertenly. This reverts commit `4a01b854ca`.	2021-02-08 22:32:32 +00:00
Alexander V. Chernikov	4a01b854ca	SO_RERROR indicates that receive buffer overflows should be handled as errors. Historically receive buffer overflows have been ignored and programs could not tell if they missed messages or messages had been truncated because of overflows. Since programs historically do not expect to get receive overflow errors, this behavior is not the default. This is really really important for programs that use route(4) to keep in sync with the system. If we loose a message then we need to reload the full system state, otherwise the behaviour from that point is undefined and can lead to chasing bogus bug reports.	2021-02-08 21:42:20 +00:00
Neel Chauhan	a08cdb6cfb	Allow setting alias port ranges in libalias and ipfw. This will allow a system to be a true RFC 6598 NAT444 setup, where each network segment (e.g. user, subnet) can have their own dedicated port aliasing ranges. Reviewed by: donner, kp Approved by: 0mp (mentor), donner, kp Differential Revision: https://reviews.freebsd.org/D23450	2021-02-02 13:24:17 -08:00
Hans Petter Selasky	db46c0d0cb	Fix LINT kernel builds after `1a714ff204` . MFC after: 1 week Discussed with: rrs@ Differential Revision: https://reviews.freebsd.org/D28357 Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-02-01 14:24:15 +01:00
Michael Tuexen	bdd4630c9a	sctp: small cleanup, no functional change intended. MFC after: 3 days	2021-02-01 14:04:57 +01:00
Michael Tuexen	af885c57d6	sctp: improve input validation Improve the handling of INIT chunks in specific szenarios and report and appropriate error cause. Thanks to Anatoly Korniltsev for reporting the issue for the userland stack. MFC after: 3 days	2021-01-31 23:46:53 +01:00
Michael Tuexen	8dc6a1edca	sctp: fix a locking issue for old unordered data Thanks to Anatoly Korniltsev for reporting the issue for the userland stack. MFC after: 3 days	2021-01-31 10:46:23 +01:00
Gleb Smirnoff	3f43ada98c	Catch up with `6edfd179c8`: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG. Originally IFCAP_NOMAP meant that the mbuf has external storage pointer that points to unmapped address. Then, this was extended to array of such pointers. Then, such mbufs were augmented with header/trailer. Basically, extended mbufs are extended, and set of features is subject to change. The new name should be generic enough to avoid further renaming.	2021-01-29 11:46:24 -08:00
Randall Stewart	1a714ff204	This pulls over all the changes that are in the netflix tree that fix the ratelimit code. There were several bugs in tcp_ratelimit itself and we needed further work to support the multiple tag format coming for the joint TLS and Ratelimit dances. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D28357	2021-01-28 11:53:05 -05:00
Hans Petter Selasky	093e723190	Add missing decrement of active ratelimit connections. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-01-26 18:00:21 +01:00
Hans Petter Selasky	85d8d30f9f	Don't allow allocating a new send tag on an INP which is being torn down. This fixes a potential send tag leak. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking	2021-01-26 18:00:02 +01:00
Richard Scheffenegger	6a376af0cd	TCP PRR: Patch div/0 in tcp_prr_partialack With clearing of recover_fs in `bc7ee8e5bc`, div/0 was observed while processing partial_acks. Suspect that rewind of an erraneous RTO may be causing this - with the above change, recover_fs would no longer retained at the last calculated value, and reset. But CC_RTO_ERR can reenable IN_RECOVERY(), without setting this again. Adding a safety net prior to the division in that function, which I missed in D28114.	2021-01-26 16:06:32 +01:00
Richard Scheffenegger	84761f3df5	Adjust line length in tcp_prr_partialack Summary: Wrap lines before column 80 in new prr code checked in recently. No functional changes. Reviewers: tuexen, rrs, jtl, mm, kbowling, #transport Reviewed By: tuexen, mm, #transport Subscribers: imp, melifaro Differential Revision: https://reviews.freebsd.org/D28329	2021-01-26 14:47:19 +01:00
Michael Tuexen	0f7573ffd6	sctp: fix PR-SCTP stats when adding addtional streams MFC after: 1 week	2021-01-24 00:50:33 +01:00
Michael Tuexen	7a051c0a78	sctp: improve consistency No functional change intended. MFC: 1 week	2021-01-24 00:07:41 +01:00
Alexander V. Chernikov	130aebbab0	Further refactor IPv4 interface route creation. * Fix bug with /32 aliases introduced in `81728a538d`. * Explicitly document business logic for IPv4 ifa routes. * Remove remnants of rtinit() * Deduplicate ifa->route prefix code by moving it into ia_getrtprefix() * Deduplicate conditional check for ifa_maintain_loopback_route() by moving into ia_need_loopback_route() * Remove now-unused flags argument from in_addprefix(). Reviewed by: donner PR: 252883 Differential Revision: https://reviews.freebsd.org/D28246	2021-01-21 21:48:49 +00:00
Richard Scheffenegger	bc7ee8e5bc	Address panic with PRR due to missed initialization of recover_fs Summary: When using the base stack in conjunction with RACK, it appears that infrequently, ++tp->t_dupacks is instantly larger than tcprexmtthresh. This leaves the recover flightsize (sackhint.recover_fs) uninitialized, leading to a div/0 panic. Address this by properly initializing the variable just prior to first use, if it is not properly initialized. In order to prevent stale information from a prior recovery to negatively impact the PRR calculations in this event, also clear recover_fs once loss recovery is finished. Finally, improve the readability of the initialization of recover_fs when t_dupacks == tcprexmtthresh by adjusting the indentation and using the max(1, snd_nxt - snd_una) macro. Reviewers: rrs, kbowling, tuexen, jtl, #transport, gnn!, jmg, manu, #manpages Reviewed By: rrs, kbowling, #transport Subscribers: bdrewery, andrew, rpokala, ae, emaste, bz, bcran, #linuxkpi, imp, melifaro Differential Revision: https://reviews.freebsd.org/D28114	2021-01-20 12:06:34 +01:00
Alex Richardson	a81c165bce	Require uint32_t alignment for ipfw_insn There are many casts of this struct to uint32_t, so we also need to ensure that it is sufficiently aligned to safely perform this cast on architectures that don't allow unaligned accesses. This fixes lots of -Wcast-align warnings. Reviewed By: ae Differential Revision: https://reviews.freebsd.org/D27879	2021-01-19 21:23:25 +00:00
Alex Richardson	be5972695f	libalias: Fix remaining compiler warnings This fixes some sign-compare warnings and adds a missing static to a variable declaration. Differential Revision: https://reviews.freebsd.org/D27883	2021-01-19 21:23:24 +00:00
Alex Richardson	bc596e5632	libalias: Fix -Wcast-align compiler warnings This fixes -Wcast-align warnings caused by the underaligned `struct ip`. This also silences them in the public functions by changing the function signature from char * to void *. This is source and binary compatible and avoids the -Wcast-align warning. Reviewed By: ae, gbe (manpages) Differential Revision: https://reviews.freebsd.org/D27882	2021-01-19 21:23:24 +00:00
Alexander V. Chernikov	f879876721	Fix IPv4 fib bsearch4() lookup array construction. Current code didn't properly handle the case with nested prefixes like 10.0.0.0/24 && 10.0.0.0/25.	2021-01-17 20:32:26 +00:00
Alexander V. Chernikov	81728a538d	Split rtinit() into multiple functions. rtinit[1]() is a function used to add or remove interface address prefix routes, similar to ifa_maintain_loopback_route(). It was intended to be family-agnostic. There is a problem with this approach in reality. 1) IPv6 code does not use it for the ifa routes. There is a separate layer, nd6_prelist_(), providing interface for maintaining interface routes. Its part, responsible for the actual route table interaction, mimics rtenty() code. 2) rtinit tries to combine multiple actions in the same function: constructing proper route attributes and handling iterations over multiple fibs, for the non-zero net.add_addr_allfibs use case. It notably increases the code complexity. 3) dstaddr handling. flags parameter re-uses RTF_ flags. As there is no special flag for p2p connections, host routes and p2p routes are handled in the same way. Additionally, mapping IFA flags to RTF flags makes the interface pretty messy. It make rtinit() to clash with ifa_mainain_loopback_route() for IPV4 interface aliases. 4) rtinit() is the last customer passing non-masked prefixes to rib_action(), complicating rib_action() implementation. 5) rtinit() coupled ifa announce/withdrawal notifications, producing "false positive" ifa messages in certain corner cases. To address all these points, the following has been done: * rtinit() has been split into multiple functions: - Route attribute construction were moved to the per-address-family functions, dealing with (2), (3) and (4). - funnction providing net.add_addr_allfibs handling and route rtsock notificaions is the new routing table inteface. - rtsock ifa notificaion has been moved out as well. resulting set of funcion are only responsible for the actual route notifications. Side effects: * /32 alias does not result in interface routes (/32 route and "host" route) * RTF_PINNED is now set for IPv6 prefixes corresponding to the interface addresses Differential revision: https://reviews.freebsd.org/D28186	2021-01-16 22:42:41 +00:00
Michael Tuexen	d2b3ceddcc	tcp: add sysctl to tolerate TCP segments missing timestamps When timestamp support has been negotiated, TCP segements received without a timestamp should be discarded. However, there are broken TCP implementations (for example, stacks used by Omniswitch 63xx and 64xx models), which send TCP segments without timestamps although they negotiated timestamp support. This patch adds a sysctl variable which tolerates such TCP segments and allows to interoperate with broken stacks. Reviewed by: jtl@, rscheff@ Differential Revision: https://reviews.freebsd.org/D28142 Sponsored by: Netflix, Inc. PR: 252449 MFC after: 1 week	2021-01-14 19:28:25 +01:00
Michael Tuexen	cc3c34859e	tcp: fix handling of TCP RST segments missing timestamps A TCP RST segment should be processed even it is missing TCP timestamps. Reported by: dmgk@, kevans@ Reviewed by: rscheff@, dmgk@ Sponsored by: Netflix, Inc. MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D28143	2021-01-14 14:39:35 +01:00
Mateusz Guzik	6b3a9a0f3d	Convert remaining cap_rights_init users to cap_rights_init_one semantic patch: @@ expression rights, r; @@ - cap_rights_init(&rights, r) + cap_rights_init_one(&rights, r)	2021-01-12 13:16:10 +00:00
Alexander V. Chernikov	2defbe9f0e	Use rn_match instead of doing indirect calls in fib_algo. Relevant inet/inet6 code has the control over deciding what the RIB lookup function currently is. With that in mind, explicitly set it to the current value (rn_match) in the datapath lookups. This avoids cost on indirect call. Differential Revision: https://reviews.freebsd.org/D28066	2021-01-11 23:30:35 +00:00
Alexander V. Chernikov	0da3f8c98d	Bump amount of queued packets in for unresolved ARP/NDP entries to 16. Currently default behaviour is to keep only 1 packet per unresolved entry. Ability to queue more than one packet was added 10 years ago, in r215207, though the default value was kep intact. Things have changed since that time. Systems tend to initiate multiple connections at once for a variety of reasons. For example, recent kern/252278 bug report describe happy-eyeball DNS behaviour sending multiple requests to the DNS server. The primary driver for upper value for the queue length determination is memory consumption. Remote actors should not be able to easily exhaust local memory by sending packets to unresolved arp/ND entries. For now, bump value to 16 packets, to match Darwin implementation. The proper approach would be to switch the limit to calculate memory consumption instead of packet count and limit based on memory. We should MFC this with a variation of D22447. Reviewers: #manpages, #network, bz, emaste Reviewed By: emaste, gbe(doc), jilles(doc) MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D28068	2021-01-11 19:51:11 +00:00
Mark Johnston	501159696c	igmp: Avoid leaking mbuf when source validation fails PR: 252504 Submitted by: Panagiotis Tsolakos <panagiotis.tsolakos@gmail.com> MFC after: 3 days	2021-01-08 13:32:04 -05:00
Alexander V. Chernikov	d68cf57b7f	Refactor rt_addrmsg() and rt_routemsg(). Summary: * Refactor rt_addrmsg(): make V_rt_add_addr_allfibs decision locally. * Fix rt_routemsg() and multipath by accepting nexthop instead of interface pointer. * Refactor rtsock_routemsg(): avoid accessing rtentry fields directly. * Simplify in_addprefix() by moving prefix search to a separate function. Reviewers: #network Subscribers: imp, ae, bz Differential Revision: https://reviews.freebsd.org/D28011	2021-01-07 19:38:19 +00:00
Michael Tuexen	a7aa5eea4f	sctp: improve handling of aborted associations Don't clear a flag, when the structure already has been freed. Reported by: syzbot+07667d16c96779c737b4@syzkaller.appspotmail.com	2021-01-01 15:59:10 +01:00
Alexander V. Chernikov	f733d9701b	Fix default route handling in radix4_lockless algo. Improve nexthop debugging. Reported by: Florian Smeets <flo at smeets.xyz>	2020-12-26 22:51:02 +00:00
Alexander V. Chernikov	f5baf8bb12	Add modular fib lookup framework. This change introduces framework that allows to dynamically attach or detach longest prefix match (lpm) lookup algorithms to speed up datapath route tables lookups. Framework takes care of handling initial synchronisation, route subscription, nhop/nhop groups reference and indexing, dataplane attachments and fib instance algorithm setup/teardown. Framework features automatic algorithm selection, allowing for picking the best matching algorithm on-the-fly based on the amount of routes in the routing table. Currently framework code is guarded under FIB_ALGO config option. An idea is to enable it by default in the next couple of weeks. The following algorithms are provided by default: IPv4: * bsearch4 (lockless binary search in a special IP array), tailored for small-fib (<16 routes) * radix4_lockless (lockless immutable radix, re-created on every rtable change), tailored for small-fib (<1000 routes) * radix4 (base system radix backend) * dpdk_lpm4 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized for large-fib (D27412) IPv6: * radix6_lockless (lockless immutable radix, re-created on every rtable change), tailed for small-fib (<1000 routes) * radix6 (base system radix backend) * dpdk_lpm6 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized for large-fib (D27412) Performance changes: Micro benchmarks (I7-7660U, single-core lookups, 2048k dst, code in D27604): IPv4: 8 routes: radix4: ~20mpps radix4_lockless: ~24.8mpps bsearch4: ~69mpps dpdk_lpm4: ~67 mpps 700k routes: radix4_lockless: 3.3mpps dpdk_lpm4: 46mpps IPv6: 8 routes: radix6_lockless: ~20mpps dpdk_lpm6: ~70mpps 100k routes: radix6_lockless: 13.9mpps dpdk_lpm6: 57mpps Forwarding benchmarks: + 10-15% IPv4 forwarding performance (small-fib, bsearch4) + 25% IPv4 forwarding performance (full-view, dpdk_lpm4) + 20% IPv6 forwarding performance (full-view, dpdk_lpm6) Control: Framwork adds the following runtime sysctls: List algos * net.route.algo.inet.algo_list: bsearch4, radix4_lockless, radix4 * net.route.algo.inet6.algo_list: radix6_lockless, radix6, dpdk_lpm6 Debug level (7=LOG_DEBUG, per-route) net.route.algo.debug_level: 5 Algo selection (currently only for fib 0): net.route.algo.inet.algo: bsearch4 net.route.algo.inet6.algo: radix6_lockless Support for manually changing algos in non-default fib will be added soon. Some sysctl names will be changed in the near future. Differential Revision: https://reviews.freebsd.org/D27401	2020-12-25 11:33:17 +00:00
Michael Tuexen	0ec2ce0d32	Improve input validation for parameters in ASCONF and ASCONF-ACK chunks Thanks to Tolya Korniltsev for drawing my attention to this part of the code by reporting an issue for the userland stack.	2020-12-23 18:03:47 +01:00
Andrew Gallatin	a034518ac8	Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636	2020-12-19 22:04:46 +00:00
Michael Tuexen	0066de1c4b	Harden the handling of outgoing streams in case of an restart or INIT collision. This avouds an out-of-bounce access in case the peer can break the cookie signature. Thanks to Felix Wilhelm from Google for reporting the issue. MFC after: 1 week	2020-12-13 23:51:51 +00:00
Michael Tuexen	aa6db9a045	Clean up more resouces of an existing SCTP association in case of a restart. This fixes a use-after-free scenario, which was reported by Felix Wilhelm from Google in case a peer is able to modify the cookie. However, this can also be triggered by an assciation restart under some specific conditions. MFC after: 1 week	2020-12-12 22:23:45 +00:00
Richard Scheffenegger	0e1d7c25c5	Add TCP feature Proportional Rate Reduction (PRR) - RFC6937 PRR improves loss recovery and avoids RTOs in a wide range of scenarios (ACK thinning) over regular SACK loss recovery. PRR is disabled by default, enable by net.inet.tcp.do_prr = 1. Performance may be impeded by token bucket rate policers at the bottleneck, where net.inet.tcp.do_prr_conservate = 1 should be enabled in addition. Submitted by: Aris Angelogiannopoulos Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D18892	2020-12-04 11:29:27 +00:00
Alexander V. Chernikov	d1d941c5b9	Remove RADIX_MPATH config option. ROUTE_MPATH is the new config option controlling new multipath routing implementation. Remove the last pieces of RADIX_MPATH-related code and the config option. Reviewed by: glebius Differential Revision: https://reviews.freebsd.org/D27244	2020-11-29 19:43:33 +00:00
Alexander V. Chernikov	b712e3e343	Refactor fib4/fib6 functions. No functional changes. * Make lookup path of fib<4\|6>_lookup_debugnet() separate functions (fib<46>_lookup_rt()). These will be used in the control plane code requiring unlocked radix operations and actual prefix pointer. * Make lookup part of fib<4\|6>_check_urpf() separate functions. This change simplifies the switch to alternative lookup implementations, which helps algorithmic lookups introduction. * While here, use static initializers for IPv4/IPv6 keys Differential Revision: https://reviews.freebsd.org/D27405	2020-11-29 13:41:49 +00:00
Michael Tuexen	75fcd27ac2	Fix two occurences of a typo in a comment introduced in r367530. Reported by: lstewart@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D27148	2020-11-23 10:13:56 +00:00
Alexander V. Chernikov	7511a63825	Refactor rib iterator functions. * Make rib_walk() order of arguments consistent with the rest of RIB api * Add rib_walk_ext() allowing to exec callback before/after iteration. * Rename rt_foreach_fib_walk_del -> rib_foreach_table_walk_del * Rename rt_forach_fib_walk -> rib_foreach_table_walk * Move rib_foreach_table_walk{_del} to route/route_helpers.c * Slightly refactor rib_foreach_table_walk{_del} to make the implementation consistent and prepare for upcoming iterator optimizations. Differential Revision: https://reviews.freebsd.org/D27219	2020-11-22 20:21:10 +00:00
Michael Tuexen	47384244f9	Fix an issue I introuced in r367530: tcp_twcheck() can be called with to == NULL for SYN segments. So don't assume tp != NULL. Thanks to jhb@ for reporting and suggesting a fix. PR: 250499 MFC after: 1 week XMFC-with: r367530 Sponsored by: Netflix, Inc.	2020-11-20 13:00:28 +00:00
Ed Maste	360d1232ab	ip_fastfwd: style(9) tidy for r367628 Discussed with: gnn MFC with: r367628	2020-11-13 18:25:07 +00:00
George V. Neville-Neil	d65d6d5aa9	Followup pointed out by ae@	2020-11-13 13:07:44 +00:00
George V. Neville-Neil	8ad114c082	An earlier commit effectively turned out the fast forwading path due to its lack of support for ICMP redirects. The following commit adds redirects to the fastforward path, again allowing for decent forwarding performance in the kernel. Reviewed by: ae, melifaro Sponsored by: Rubicon Communications, LLC (d/b/a "Netgate")	2020-11-12 21:58:47 +00:00
Michael Tuexen	283c76c7c3	RFC 7323 specifies that: * TCP segments without timestamps should be dropped when support for the timestamp option has been negotiated. * TCP segments with timestamps should be processed normally if support for the timestamp option has not been negotiated. This patch enforces the above. PR: 250499 Reviewed by: gnn, rrs MFC after: 1 week Sponsored by: Netflix, Inc Differential Revision: https://reviews.freebsd.org/D27148	2020-11-09 21:49:40 +00:00
Michael Tuexen	e597bae4ee	Fix a potential use-after-free bug introduced in https://svnweb.freebsd.org/changeset/base/363046 Thanks to Taylor Brandstetter for finding this issue using fuzz testing and reporting it in https://github.com/sctplab/usrsctp/issues/547	2020-11-09 13:12:07 +00:00
Mitchell Horne	b02c4e5c78	igmp: convert igmpstat to use PCPU counters Currently there is no locking done to protect this structure. It is likely okay due to the low-volume nature of IGMP, but allows for the possibility of underflow. This appears to be one of the only holdouts of the conversion to counter(9) which was done for most protocol stat structures around 2013. This also updates the visibility of this stats structure so that it can be consumed from elsewhere in the kernel, consistent with the vast majority of VNET_PCPUSTAT structures. Reviewed by: kp Sponsored by: NetApp, Inc. Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D27023	2020-11-08 18:49:23 +00:00
Richard Scheffenegger	4d0770f172	Prevent premature SACK block transmission during loss recovery Under specific conditions, a window update can be sent with outdated SACK information. Some clients react to this by subsequently delaying loss recovery, making TCP perform very poorly. Reported by: chengc_netapp.com Reviewed by: rrs, jtl MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D24237	2020-11-08 18:47:05 +00:00

... 3 4 5 6 7 ...

7202 Commits