freebsd-nq

Author	SHA1	Message	Date
Michael Tuexen	fc0eb7637c	Fix division by zero issue. Thanks to Stas Denisov for reporting the issue for the userland stack and providing a fix. MFC after: 3 days	2020-01-12 15:45:27 +00:00
Mateusz Guzik	879e0604ee	Add KERNEL_PANICKED macro for use in place of direct panicstr tests	2020-01-12 06:07:54 +00:00
Alexander V. Chernikov	ead85fe415	Add fibnum, family and vnet pointer to each rib head. Having metadata such as fibnum or vnet in the struct rib_head is handy as it eases building functionality in the routing space. This change is required to properly bring back route redirect support. Reviewed by: bz MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D23047	2020-01-09 17:21:00 +00:00
Bjoern A. Zeeb	334fc5822b	vnet: virtualise more network stack sysctls. Virtualise tcp_always_keepalive, TCP and UDP log_in_vain. All three are set in the netoptions startup script, which we would love to run for VNETs as well [1]. While virtualising the log_in_vain sysctls seems pointles at first for as long as the kernel message buffer is not virtualised, it at least allows an administrator to debug the base system or an individual jail if needed without turning the logging on for all jails running on a system. PR: 243193 [1] MFC after: 2 weeks	2020-01-08 23:30:26 +00:00
Ed Maste	ee92463aca	Do not define TCPOUTFLAGS in rack_bbr_common tcp_outflags isn't used in this source file and compilation failed with external GCC on sparc64. I'm not sure why only that case failed (perhaps inconsistent -Werror config) but it is a legitimate issue to fix. Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D23068	2020-01-07 17:57:08 +00:00
Randall Stewart	4ad2473790	This catches rack up in the recent changes to ECN and also commonizes the functions that both the freebsd and rack stack uses. Sponsored by:Netflix Inc Differential Revision: https://reviews.freebsd.org/D23052	2020-01-06 15:29:14 +00:00
Randall Stewart	a9a08eced6	This change adds a small feature to the tcp logging code. Basically a connection can now have a separate tag added to the id. Obtained from: Lawrence Stewart Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D22866	2020-01-06 12:48:06 +00:00
Michael Tuexen	97a8ab398e	Don't make the sendall iterator as being up if it could not be started. MFC after: 1 week	2020-01-05 14:08:01 +00:00
Michael Tuexen	4b66d476b3	Return -1 consistently if an error occurs. MFC after: 1 week	2020-01-05 14:06:40 +00:00
Michael Tuexen	397b1c945f	Ensure that we don't miss a trigger for kicking off the SCTP iterator. Reported by: nwhitehorn@ MFC after: 1 week	2020-01-05 13:56:32 +00:00
Michael Tuexen	ae7cc6c9f8	Make the message size limit used for SCTP_SENDALL configurable via a sysctl variable instead of a compiled in constant. This is based on a patch provided by nwhitehorn@.	2020-01-04 20:33:12 +00:00
Mark Johnston	31069f383a	Take the ifnet's address lock in igmp_v3_cancel_link_timers(). inm_rele_locked() may remove the multicast address associated with inm. Reported by: syzbot+871c5d1fd5fac6c28f52@syzkaller.appspotmail.com Reviewed by: hselasky MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D23009	2020-01-03 17:03:10 +00:00
Michael Tuexen	0eeb0d180e	Remove empty line which was added in r356270 by accident. MFC after: 1 week	2020-01-02 14:04:16 +00:00
Michael Tuexen	ac1d75d23a	Improve input validation of the spp_pathmtu field in the SCTP_PEER_ADDR_PARAMS socket option. The code in the stack assumes sane values for the MTU. This issue was found by running an instance of syzkaller. MFC after: 1 week	2020-01-02 13:55:10 +00:00
Gleb Smirnoff	8d5c56dab1	In r343631 error code for a packet blocked by a firewall was changed from EACCES to EPERM. This change was not intentional, so fix that. Return EACCESS if a firewall forbids sending. Noticed by: ae	2020-01-01 17:31:43 +00:00
Alexander Motin	8c3fbf3c20	Relax locking of carp_forus(). This fixes deadlock between CARP and bridge. Bridge calls this function taking CARP lock while holding bridge lock. Same time CARP tries to send its announcements via the bridge while holding CARP lock. Use of CARP_LOCK() here does not solve anything, since sc_addr is constant while race on sc_state is harmless and use of the lock does not close it. Reviewed by: glebius MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-12-31 18:58:29 +00:00
Michael Tuexen	e11c9783e1	Fix delayed ACK generation for DCTCP. Submitted by: Richard Scheffenegger Reviewed by: chengc@netapp.com, rgrimes@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22644	2019-12-31 16:15:47 +00:00
Michael Tuexen	493c98c6d2	Add flags for upcoming patches related to improved ECN handling. No functional change. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22429	2019-12-31 14:32:48 +00:00
Michael Tuexen	83a2839fb9	Clear the flag indicating that the last received packet was marked CE also in the case where a packet not marked was received. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, tuexen@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D19143	2019-12-31 14:23:52 +00:00
Michael Tuexen	7d87664a04	Add curly braces missed in https://svnweb.freebsd.org/changeset/base/354773 Sponsored by: Netflix, Inc. CID: 1407649	2019-12-31 12:29:01 +00:00
Michael Tuexen	6088175a18	Improve input validation for some parameters having a too small reported length. Thanks to Natalie Silvanovich from Google for finding one of these issues in the SCTP userland stack and reporting it. MFC after: 1 week	2019-12-20 15:25:08 +00:00
Alexander V. Chernikov	bdb214a4a4	Remove useless code from in6_rmx.c The code in questions walks IPv6 tree every 60 seconds and looks into the routes with non-zero expiration time (typically, redirected routes). For each such route it sets RTF_PROBEMTU flag at the expiration time. No other part of the kernel checks for RTF_PROBEMTU flag. RTF_PROBEMTU was defined 21 years ago, 30 Jun 1999, as RTF_PROTO1. RTF_PROTO1 is a de-facto standard indication of a route installed by a routing daemon for a last decade. Reviewed by: bz, ae MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D22865	2019-12-18 22:10:56 +00:00
Hans Petter Selasky	a4c5668d12	Leave multicast group before reaping and committing state for both IPv4 and IPv6. This fixes a regression issue after r349369. When trying to exit a multicast group before closing the socket, a multicast leave packet should be sent. Differential Revision: https://reviews.freebsd.org/D22848 PR: 242677 Reviewed by: bz (network) Tested by: Aleksandr Fedorov <aleksandr.fedorov@itglobal.com> MFC after: 1 week Sponsored by: Mellanox Technologies	2019-12-18 12:06:34 +00:00
Randall Stewart	1cf55767b8	This commit is a bit of a re-arrange of deck chairs. It gets both rack and bbr ready for the completion of the STATs framework in FreeBSD. For now if you don't have both NF_stats and stats on it disables them. As soon as the rest of the stats framework lands we can remove that restriction and then just uses stats when defined. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D22479	2019-12-17 16:08:07 +00:00
John Baldwin	5773ac113c	Use callout_func_t instead of the deprecated timeout_t. Reviewed by: kib, imp Differential Revision: https://reviews.freebsd.org/D22752	2019-12-10 22:06:53 +00:00
Bjoern A. Zeeb	646e34881c	Remove the extra epoch tracker change sneaked into r355449 and was not part of the originally reviewed or described change. Pointyhat to: bz Reported by: glebius	2019-12-06 22:20:26 +00:00
Bjoern A. Zeeb	dad68fc301	carp: replace caddr_t with char * Change the remaining caddr_t usages to char * following the removal of the KAME macros No functional change. Requested by: glebius Reviewed by: glebius MFC after: 2 weeks Sponsored by: Netflix (originally) Differential Revision: https://reviews.freebsd.org/D22399	2019-12-06 16:35:48 +00:00
Gleb Smirnoff	e5a084d020	Fix regression from r354484. Don't leak pcb lock if cr_canseeinpcb() returns non-zero. PR: 242415	2019-12-04 22:41:52 +00:00
Bjoern A. Zeeb	0700d2c3f0	Make icmp6_reflect() static. icmp6_reflect() is not used anywhere outside icmp6.c, no reason to export it. Sponsored by: Netflix	2019-12-03 14:46:38 +00:00
Hans Petter Selasky	5b64b824b9	Use refcount from "in_joingroup_locked()" when joining multicast groups. Do not acquire additional references. This makes the IPv4 IGMP code in line with the IPv6 MLD code. Background: The IPv4 multicast code puts an extra reference on the in_multi struct when joining groups. This becomes visible when using daemons like igmpproxy from ports, that multicast entries do not disappear from the output of ifmcstat(8) when multicast streams are disconnected. This fixes a regression issue after r349762. While at it factor the ip_mfilter_insert() and ip6_mfilter_insert() calls to avoid repeated "is_new" check. Differential Revision: https://reviews.freebsd.org/D22595 Tested by: Guido van Rooij <guido@gvr.org> Reviewed by: rgrimes (network) MFC after: 1 week Sponsored by: Mellanox Technologies	2019-12-03 08:46:59 +00:00
Edward Tomasz Napierala	adc56f5a38	Make use of the stats(3) framework in the TCP stack. This makes it possible to retrieve per-connection statistical information such as the receive window size, RTT, or goodput, using a newly added TCP_STATS getsockopt(3) option, and extract them using the stats_voistat_fetch(3) API. See the net/tcprtt port for an example consumer of this API. Compared to the existing TCP_INFO system, the main differences are that this mechanism is easy to extend without breaking ABI, and provides statistical information instead of raw "snapshots" of values at a given point in time. stats(3) is more generic and can be used in both userland and the kernel. Reviewed by: thj Tested by: thj Obtained from: Netflix Relnotes: yes Sponsored by: Klara Inc, Netflix Differential Revision: https://reviews.freebsd.org/D20655	2019-12-02 20:58:04 +00:00
Michael Tuexen	3cf38784e2	Move all ECN related flags from the flags to the flags2 field. This allows adding more ECN related flags in the future. No functional change intended. Submitted by: Richard Scheffenegger Reviewed by: rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22497	2019-12-01 21:01:33 +00:00
Michael Tuexen	77aabfd94f	Make the TF_* flags easier readable by humans by adding leading zeroes to make them aligned. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22428	2019-12-01 20:45:48 +00:00
Michael Tuexen	b72e56e758	This is an initial step in implementing the new congestion window validation as specified in RFC 7661. Submitted by: Richard Scheffenegger Reviewed by: rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D21798	2019-12-01 20:35:41 +00:00
Michael Tuexen	8df12ffcc2	Make the IPTOS value available to all substate handlers. This will allow to add support for L4S or SCE, which require processing of the IP TOS field. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22426	2019-12-01 18:47:53 +00:00
Michael Tuexen	fa49a96419	In order for the TCP Handshake to support ECN++, and further ECN-related improvements, the ECN bits need to be exposed to the TCP SYNcache. This change is a minimal modification to the function headers, without any functional change intended. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, rrs@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22436	2019-12-01 18:05:02 +00:00
Michael Tuexen	669a285ffb	When changing the MTU of an SCTP path, not only cancel all ongoing RTT measurements, but also scheldule new ones for the future. Submitted by: Julius Flohr MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D22547	2019-12-01 17:35:36 +00:00
Bjoern A. Zeeb	a4adf6cc65	Fix m_pullup() problem after removing PULLDOWN_TESTs and KAME EXT_*macros. r354748-354750 replaced the KAME macros with m_pulldown() calls. Contrary to the rest of the network stack m_len checks before m_pulldown() were not put in placed (see r354748). Put these m_len checks in place for now (to go along with the style of the network stack since the initial commits). These are not put in for performance but to avoid an error scenario (even though it also will help performance at the moment as it avoid allocating an extra mbuf; not because of the unconditional function call). The observed error case went like this: (1) an mbuf with M_EXT arrives and we call m_pullup() unconditionally on it. (2) m_pullup() will call m_get() unless the requested length is larger than MHLEN (in which case it'll m_freem() the perfectly fine mbuf) and migrate the requested length of data and pkthdr into the new mbuf. (3) If m_get() succeeds, a further m_pullup() call going over MHLEN will fail. This was observed with failing auto-configuration as an RA packet of 200 bytes exceeded MHLEN and the m_pullup() called from nd6_ra_input() dropped the mbuf. (Re-)adding the m_len checks before m_pullup() calls avoids this problems with mbufs using external storage for now. MFC after: 3 weeks Sponsored by: Netflix	2019-12-01 00:22:04 +00:00
Michael Tuexen	f727fee546	Really ignore the SCTP association identifier on 1-to-1 style sockets as requiresd by the socket API specification. Thanks to Inaki Baz Castillo, who found this bug running the userland stack with valgrind and reported the issue in https://github.com/sctplab/usrsctp/issues/408 MFC after: 1 week	2019-11-28 12:50:25 +00:00
Michael Tuexen	645f3a1cd1	Plug two mbuf leaks during INIT-ACK handling. One leak happens when there is not enough memory to allocate the the resources for streams. The other leak happens if the are unknown parameters in the received INIT-ACK chunk which require reporting and the INIT-ACK requires sending an ABORT due to illegal parameter combinations. Hopefully this fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=19083 MFC after: 1 week	2019-11-27 19:32:29 +00:00
Ryan Libby	200f3ac6f7	in_mcast.c: need if_addr_lock around inm_release_deferred Apply a similar fix as for in6_mcast.c. Reviewed by: hselasky Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20740	2019-11-25 22:25:34 +00:00
Bjoern A. Zeeb	1a117215c7	Reduce the vnet_set module size of ip_mroute to allow loading as a module. With VIMAGE kernels modules get special treatment as they need to also keep the original values and make copies for each instance. For that a few pages of vnet modspace are provided and the kernel-linker and the VNET framework know how to deal with things. When the modspace is (almost) full, other modules which would overflow the modspace cannot be loaded and kldload will fail. ip_mroute uses a lot of variable space, mostly be four big arrays: set_vnet 0000000000000510 vnet_entry_multicast_register_if set_vnet 0000000000000700 vnet_entry_viftable set_vnet 0000000000002000 vnet_entry_bw_meter_timers set_vnet 0000000000002800 vnet_entry_bw_upcalls Dynamically malloc the three big ones for each instance we need and free them again on vnet teardown (the 4th is an ifnet). That way they only need module space for a single pointer and allow a lot more modules using virtualized variables to be loaded on a VNET kernel. PR: 206583 Reviewed by: hselasky, kp MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D22443	2019-11-19 15:38:55 +00:00
Michael Tuexen	c968c769af	Add boundary and overflow checks to the formulas used in the TCP CUBIC congestion control module. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@ Differential Revision: https://reviews.freebsd.org/D19118	2019-11-16 12:00:22 +00:00
Michael Tuexen	b0c1a13e4e	Improve TCP CUBIC specific after idle reaction. The adjustments are inspired by the Linux stack, which has had a functionally equivalent implementation for more than a decade now. Submitted by: Richard Scheffenegger Reviewed by: Cheng Cui Differential Revision: https://reviews.freebsd.org/D18982	2019-11-16 11:57:12 +00:00
Michael Tuexen	35cd141b4b	Implement a tCP CUBIC-specific after idle reaction. This patch addresses a very common case of frequent application stalls, where TCP runs idle and looses the state of the network. Submitted by: Richard Scheffenegger Reviewed by: Cheng Cui Differential Revision: https://reviews.freebsd.org/D18954	2019-11-16 11:37:26 +00:00
Michael Tuexen	453e633384	Revert https://svnweb.freebsd.org/changeset/base/354708 I used the wrong Differential Revision, so back it out and do it right in a follow-up commit.	2019-11-16 11:10:09 +00:00
Bjoern A. Zeeb	b141dd5ddf	Remove now unused IPv6 macros and update docs. After r354748-354750 all uses of the IP6_EXTHDR_CHECK() and IP6_EXTHDR_GET() macros are gone from the kernel. IP6_EXTHDR_GET0() was unused. Remove the macros and update the documentation. Sponsored by: Netflix	2019-11-15 21:55:41 +00:00
Bjoern A. Zeeb	4e619b17c5	IP6_EXTHDR_CHECK(): remove the last instances While r354748 removed almost all IP6_EXTHDR_CHECK() calls, these are not part of the PULLDOWN_TESTS. Equally convert these IP6_EXTHDR_CHECK()s here to m_pullup() and remove the extra check and m_pullup() in tcp_input() under isipv6 given tcp6_input() has done exactly that pullup already. MFC after: 8 weeks Sponsored by: Netflix	2019-11-15 21:51:43 +00:00
Bjoern A. Zeeb	63abacc204	netinet*: replace IP6_EXTHDR_GET() In a few places we have IP6_EXTHDR_GET() left in upper layer protocols. The IP6_EXTHDR_GET() macro might perform an m_pulldown() in case the data fragment is not contiguous. Convert these last remaining instances into m_pullup()s instead. In CARP, for example, we will a few lines later call m_pullup() anyway, the IPsec code coming from OpenBSD would otherwise have done the m_pullup() and are copying the data a bit later anyway, so pulling it in seems no better or worse. Note: this leaves very few m_pulldown() cases behind in the tree and we might want to consider removing them as well to make mbuf management easier again on a path to variable size mbufs, especially given m_pulldown() still has an issue not re-checking M_WRITEABLE(). Reviewed by: gallatin MFC after: 8 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22335	2019-11-15 21:44:17 +00:00
Michael Tuexen	730cbbc10d	For idle TCP sessions using the CUBIC congestio control, reset ssthresh to the higher of the previous ssthresh or 3/4 of the prior cwnd. Submitted by: Richard Scheffenegger Reviewed by: Cheng Cui Differential Revision: https://reviews.freebsd.org/D18982	2019-11-14 16:28:02 +00:00
Bjoern A. Zeeb	a8fe77d877	netinet: update mp to pass the proper value back In ip6_[direct_]input() we are looping over the extension headers to deal with the next header. We pass a pointer to an mbuf pointer to the handling functions. In certain cases the mbuf can be updated there and we need to pass the new one back. That missing in dest6_input() and route6_input(). In tcp6_input() we should also update it before we call tcp_input(). In addition to that mark the mbuf NULL all the times when we return that we are done with handling the packet and no next header should be checked (IPPROTO_DONE). This will eventually allow us to assert proper behaviour and catch the above kind of errors more easily, expecting *mp to always be set. This change is extracted from a larger patch and not an exhaustive change across the entire stack yet. PR: 240135 Reported by: prabhakar.lakhera gmail.com MFC after: 3 weeks Sponsored by: Netflix	2019-11-12 15:46:28 +00:00
Gleb Smirnoff	273b2e4c55	Remove now unused INP_INFO_RLOCK macros.	2019-11-07 22:26:54 +00:00
Gleb Smirnoff	43e8b279b8	In TCP HPTS enter the epoch in tcp_hpts_thread() and assert it in the leaf functions.	2019-11-07 21:30:27 +00:00
Gleb Smirnoff	aed553598d	Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER() in TCP timewait manipulation leaf functions.	2019-11-07 21:29:38 +00:00
Gleb Smirnoff	a81e3ecf46	Since pfslowtimo() runs in the network epoch, tcp_slowtimo() also does. This allows to simplify tcp_tw_2msl_scan() and always require the network epoch in it.	2019-11-07 21:28:46 +00:00
Gleb Smirnoff	032677ceb5	Now that there is no R/W lock on PCB list the pcblist sysctls handlers can be greatly simplified. All the previous double cycling and complex locking was added to avoid these functions holding global PCB locks for extended period of time, preventing addition of new entries.	2019-11-07 21:27:32 +00:00
Gleb Smirnoff	d40c0d47cd	Now that all of the tcp_input() and all its branches are executed in the network epoch, we can greatly simplify synchronization. Remove all unneccesary epoch enters hidden under INP_INFO_RLOCK macro. Remove some unneccesary assertions and convert necessary ones into the NET_EPOCH_ASSERT macro.	2019-11-07 21:23:07 +00:00
Gleb Smirnoff	80577e5583	Remove unnecessary recursive epoch enter via INP_INFO_RLOCK macro in udp_input(). It shall always run in the network epoch.	2019-11-07 21:08:49 +00:00
Gleb Smirnoff	5015a05f0b	Remove now unused INP_HASH_RLOCK() macros.	2019-11-07 21:03:15 +00:00
Gleb Smirnoff	2435e507de	Now with epoch synchronized PCB lookup tables we can greatly simplify locking in udp_output() and udp6_output(). First, we select if we need read or write lock in PCB itself, we take the lock and enter network epoch. Then, we proceed for the rest of the function. In case if we need to modify PCB hash, we would take write lock on it for a short piece of code. We could exit the epoch before allocating an mbuf, but with this patch we are keeping it all the way into ip_output()/ip6_output(). Today this creates an epoch recursion, since ip_output() enters epoch itself. However, once all protocols are reviewed, ip_output() and ip6_output() would require epoch instead of entering it. Note: I'm not 100% sure that in udp6_output() the epoch is required. We don't do PCB hash lookup for a bound socket. And all branches of in6_select_src() don't require epoch, at least they lack assertions. Today inet6 address list is protected by rmlock, although it is CKLIST. AFAIU, the future plan is to protect it by network epoch. That would require epoch in in6_select_src(). Anyway, in future ip6_output() would require epoch, udp6_output() would need to enter it.	2019-11-07 21:01:36 +00:00
Gleb Smirnoff	5a1264335d	Add INP_UNLOCK() which will do whatever R/W unlock is required.	2019-11-07 20:57:51 +00:00
Gleb Smirnoff	d797164a86	Since r353292 on input path we are always in network epoch, when we lookup PCBs. Thus, do not enter epoch recursively in in_pcblookup_hash() and in6_pcblookup_hash(). Same applies to tcp_ctlinput() and tcp6_ctlinput(). This leaves several sysctl(9) handlers that return PCB credentials unprotected. Add epoch enter/exit to all of them. Differential Revision: https://reviews.freebsd.org/D22197	2019-11-07 20:49:56 +00:00
Gleb Smirnoff	de537d63c2	Remove unnecessary recursive epoch enter via INP_INFO_RLOCK macro in divert_packet(). This function is called only from pfil(9) filters, which in their place always run in the network epoch.	2019-11-07 20:44:34 +00:00
Gleb Smirnoff	f42347c39a	Remove unnecessary recursive epoch enter via INP_INFO_RLOCK macro in raw input functions for IPv4 and IPv6. They shall always run in the network epoch.	2019-11-07 20:40:44 +00:00
Bjoern A. Zeeb	503f4e4736	netinet*: variable cleanup In preparation for another change factor out various variable cleanups. These mainly include: (1) do not assign values to variables during declaration: this makes the code more readable and does allow for better grouping of variable declarations, (2) do not assign values to variables before need; e.g., if a variable is only used in the 2nd half of a function and we have multiple return paths before that, then do not set it before it is needed, and (3) try to avoid assigning the same value multiple times. MFC after: 3 weeks Sponsored by: Netflix	2019-11-07 18:29:51 +00:00
Gleb Smirnoff	58d94bd0d9	TCP timers are executed in callout context, so they need to enter network epoch to look into PCB lists. Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER(). No functional change here.	2019-11-07 00:27:23 +00:00
Gleb Smirnoff	97a95ee134	Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER() in TCP functions that are executed in syscall context. No functional change here.	2019-11-07 00:10:14 +00:00
Gleb Smirnoff	1a49612526	Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER(). Remove few outdated comments and extraneous assertions. No functional change here.	2019-11-07 00:08:34 +00:00
Bjoern A. Zeeb	6e6b5143f5	Properly set VNET when nuking recvif from fragment queues. In theory the eventhandler invoke should be in the same VNET as the the current interface. We however cannot guarantee that for all cases in the future. So before checking if the fragmentation handling for this VNET is active, switch the VNET to the VNET of the interface to always get the one we want. Reviewed by: hselasky MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D22153	2019-10-25 18:54:06 +00:00
Michael Tuexen	4a91aa8fc9	Ensure that the flags indicating IPv4/IPv6 are not changed by failing bind() calls. This would lead to inconsistent state resulting in a panic. A fix for stable/11 was committed in https://svnweb.freebsd.org/base?view=revision&revision=338986 An accelerated MFC is planned as discussed with emaste@. Reported by: syzbot+2609a378d89264ff5a42@syzkaller.appspotmail.com Obtained from: jtl@ MFC after: 1 day Sponsored by: Netflix, Inc.	2019-10-24 20:05:10 +00:00
Michael Tuexen	9f36ec8bba	Store a handle for the event handler. This will be used when unloading the SCTP as a module. Obtained from: markj@	2019-10-24 09:22:23 +00:00
Randall Stewart	9992c365b6	Fix a small bug in bbr when running under a VM. Basically what happens is we are more delayed in the pacer calling in so we remove the stack from the pacer and recalculate how much time is left after all data has been acknowledged. However the comparision was backwards so we end up with a negative value in the last_pacing_delay time which causes us to add in a huge value to the next pacing time thus stalling the connection. Reported by: vm2.finance@gmail.com	2019-10-24 05:54:30 +00:00
Michael Tuexen	4ad8cb6813	Fix compile issues when building a kernel without the VIMAGE option. Thanks to cem@ for discussing the issue which resulted in this patch. Reviewed by: cem@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D22089	2019-10-19 20:48:53 +00:00
Conrad Meyer	e9c6962599	debugnet(4): Add optional full-duplex mode It remains unattached to any client protocol. Netdump is unaffected (remaining half-duplex). The intended consumer is NetGDB. Submitted by: John Reimer <john.reimer AT emc.com> (earlier version) Discussed with: markj Differential Revision: https://reviews.freebsd.org/D21541	2019-10-17 20:25:15 +00:00
Conrad Meyer	fde2cf65ce	debugnet(4): Infer non-server connection parameters Loosen requirements for connecting to debugnet-type servers. Only require a destination address; the rest can theoretically be inferred from the routing table. Relax corresponding constraints in netdump(4) and move ifp validation to debugnet connection time. Submitted by: John Reimer <john.reimer AT emc.com> (earlier version) Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D21482	2019-10-17 20:10:32 +00:00
Conrad Meyer	8270d35eca	Add ddb(4) 'netdump' command to netdump a core without preconfiguration Add a 'X -s <server> -c <client> [-g <gateway>] -i <interface>' subroutine to the generic debugnet code. The imagined use is both netdump, shown here, and NetGDB (vaporware). It uses the ddb(4) lexer, with some new extensions, to parse out IPv4 addresses. 'Netdump' uses the generic debugnet routine to load a configuration and start a dump, without any netdump configuration prior to panic. Loosely derived from work by: John Reimer <john.reimer AT emc.com> Reviewed by: markj Differential Revision: https://reviews.freebsd.org/D21460	2019-10-17 19:49:20 +00:00
Gleb Smirnoff	c0ebee428e	Quickly fix up r353683: enter the epoch before calling into netisr_dispatch().	2019-10-17 17:02:50 +00:00
Conrad Meyer	7790c8c199	Split out a more generic debugnet(4) from netdump(4) Debugnet is a simplistic and specialized panic- or debug-time reliable datagram transport. It can drive a single connection at a time and is currently unidirectional (debug/panic machine transmit to remote server only). It is mostly a verbatim code lift from netdump(4). Netdump(4) remains the only consumer (until the rest of this patch series lands). The INET-specific logic has been extracted somewhat more thoroughly than previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as much as possible as is protocol-independent, remains in debugnet.c. The separation is not perfect and future improvement is welcome. Supporting INET6 is a long-term goal. Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to 'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the generic module would be more confusing than the refactoring. The only functional change here is the mbuf allocation / tracking. Instead of initiating solely on netdump-configured interface(s) at dumpon(8) configuration time, we watch for any debugnet-enabled NIC for link activation and query it for mbuf parameters at that time. If they exceed the existing high-water mark allocation, we re-allocate and track the new high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone. In a future patch in this series, this will allow initiating netdump from panic ddb(4) without pre-panic configuration. No other functional change intended. Reviewed by: markj (earlier version) Some discussion with: emaste, jhb Objection from: marius Differential Revision: https://reviews.freebsd.org/D21421	2019-10-17 16:23:03 +00:00
Gleb Smirnoff	756368b68b	igmp_v1v2_queue_report() doesn't require epoch.	2019-10-17 16:02:34 +00:00
Hans Petter Selasky	a55383e720	Fix panic in network stack due to use after free when receiving partial fragmented packets before a network interface is detached. When sending IPv4 or IPv6 fragmented packets and a fragment is lost before the network device is freed, the mbuf making up the fragment will remain in the temporary hashed fragment list and cause a panic when it times out due to accessing a freed network interface structure. 1) Make sure the m_pkthdr.rcvif always points to a valid network interface. Else the rcvif field should be set to NULL. 2) Use the rcvif of the last received fragment as m_pkthdr.rcvif for the fully defragged packet, instead of the first received fragment. Panic backtrace for IPv6: panic() icmp6_reflect() # tries to access rcvif->if_afdata[AF_INET6]->xxx icmp6_error() frag6_freef() frag6_slowtimo() pfslowtimo() softclock_call_cc() softclock() ithread_loop() Reviewed by: bz Differential Revision: https://reviews.freebsd.org/D19622 MFC after: 1 week Sponsored by: Mellanox Technologies	2019-10-16 09:11:49 +00:00
Michael Tuexen	776cd558f0	Separate out SCTP related dtrace code. This is based on work done by markj@. Discussed with: markj@ MFC after: 3 days	2019-10-14 20:32:11 +00:00
Randall Stewart	8ee1cf039e	if_hw_tsomaxsegsize needs to be initialized to zero, just like in bbr.c and tcp_output.c	2019-10-14 13:10:29 +00:00
Michael Tuexen	fcfd8ad537	Rename sctp_dtrace_declare.h to sctp_kdtrace.h for consistentcy. MFC after: 3 days	2019-10-14 13:02:49 +00:00
Gleb Smirnoff	5df91cbe02	Revert r353313. It is not needed with r353357 and is actually incorrect.	2019-10-14 04:10:00 +00:00
Michael Tuexen	d6e23cf0cf	Use an event handler to notify the SCTP about IP address changes instead of calling an SCTP specific function from the IP code. This is a requirement of supporting SCTP as a kernel loadable module. This patch was developed by markj@, I tweaked a bit the SCTP related code. Submitted by: markj@ MFC after: 3 days	2019-10-13 18:17:08 +00:00
Mark Johnston	671d68fad9	Move SCTP DTrace probe definitions into a .c file. Previously they were defined in a header which was included exactly once. Change this to follow the usual practice of putting definitions in C files. No functional change intended. Discussed with: tuexen MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-10-13 16:14:04 +00:00
Michael Tuexen	1b6ddd9404	Ensure that local variables are reset to their initial value when dealing with error cases in a loop over all remote addresses. This issue was found and reported by OSS_Fuzz in: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18080 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18086 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18121 https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18163 MFC after: 3 days	2019-10-12 17:57:03 +00:00
Gleb Smirnoff	1e5db73d07	The divert(4) module must always be running in network epoch, thus call to if_addr_rlock() isn't needed.	2019-10-10 23:48:42 +00:00
Warner Losh	b23b156e2e	Fix casting error from newer gcc Cast the pointers to (uintptr_t) before assigning to type uint64_t. This eliminates an error from gcc when we cast the pointer to a larger integer type.	2019-10-09 21:02:06 +00:00
Hans Petter Selasky	eabddb25a3	Factor out TCP rateset destruction code. Ensure the epoch_call() function is not called more than one time before the callback has been executed, by always checking the RS_FUNERAL_SCHD flag before invoking epoch_call(). The "rs_number_dead" is balanced again after r353353. Discussed with: rrs@ Sponsored by: Mellanox Technologies	2019-10-09 17:08:40 +00:00
Gleb Smirnoff	0732ac0eff	Revert most of the multicast changes from r353292. This needs a more accurate approach.	2019-10-09 17:03:20 +00:00
Hans Petter Selasky	24be13533b	Fix locking order reversal in the TCP ratelimit code by moving destructors outside the rsmtx mutex. Witness message: lock order reversal: (sleepable after non-sleepable) 1st tcp_rs_mtx (rsmtx) @ sys/netinet/tcp_ratelimit.c:242 2nd sysctl lock (sysctl lock) @ sys/kern/kern_sysctl.c:607 Backtrace: witness_debugger witness_checkorder _rm_wlock_debug sysctl_ctx_free rs_destroy epoch_call_task gtaskqueue_run_locked gtaskqueue_thread_loop Discussed with: rrs@ Sponsored by: Mellanox Technologies	2019-10-09 16:48:48 +00:00
John Baldwin	9e14430d46	Add a TOE KTLS mode and a TOE hook for allocating TLS sessions. This adds the glue to allocate TLS sessions and invokes it from the TLS enable socket option handler. This also adds some counters for active TOE sessions. The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when TOE KTLS is in use on a socket, but cannot be set via setsockopt(). To simplify various checks, a TLS session now includes an explicit 'mode' member set to the value returned by TLSTX_TLS_MODE. Various places that used to check 'sw_encrypt' against NULL to determine software vs ifnet (NIC) TLS now check 'mode' instead. Reviewed by: np, gallatin Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21891	2019-10-08 21:34:06 +00:00
Gleb Smirnoff	e4c40a8a71	Quickly plug another regression from r353292. Again, multicast locking needs lots of work... Reported by: pho	2019-10-08 16:59:17 +00:00
Michael Tuexen	953b78bed9	Validate length before use it, not vice versa. r353060 should have contained this... This fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18070 MFC after: 3 days	2019-10-08 11:07:16 +00:00
Gleb Smirnoff	b8a6e03fac	Widen NET_EPOCH coverage. When epoch(9) was introduced to network stack, it was basically dropped in place of existing locking, which was mutexes and rwlocks. For the sake of performance mutex covered areas were as small as possible, so became epoch covered areas. However, epoch doesn't introduce any contention, it just delays memory reclaim. So, there is no point to minimise epoch covered areas in sense of performance. Meanwhile entering/exiting epoch also has non-zero CPU usage, so doing this less often is a win. Not the least is also code maintainability. In the new paradigm we can assume that at any stage of processing a packet, we are inside network epoch. This makes coding both input and output path way easier. On output path we already enter epoch quite early - in the ip_output(), in the ip6_output(). This patch does the same for the input path. All ISR processing, network related callouts, other ways of packet injection to the network stack shall be performed in net_epoch. Any leaf function that walks network configuration now asserts epoch. Tricky part is configuration code paths - ioctls, sysctls. They also call into leaf functions, so some need to be changed. This patch would introduce more epoch recursions (see EPOCH_TRACE) than we had before. They will be cleaned up separately, as several of them aren't trivial. Note, that unlike a lock recursion the epoch recursion is safe and just wastes a bit of resources. Reviewed by: gallatin, hselasky, cy, adrian, kristof Differential Revision: https://reviews.freebsd.org/D19111	2019-10-07 22:40:05 +00:00
Michael Tuexen	746c7ae563	In r343587 a simple port filter as sysctl tunable was added to siftr. The new sysctl was not added to the siftr.4 man page at the time. This updates the man page, and removes one left over trailing whitespace. Submitted by: Richard Scheffenegger Reviewed by: bcr@ MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D21619	2019-10-07 20:35:04 +00:00
Randall Stewart	5b63b22075	Brad Davis identified a problem with the new LRO code, VLAN's no longer worked. The problem was that the defines used the same space as the VLAN id. This commit does three things. 1) Move the LRO used fields to the PH_per fields. This is safe since the entire PH_per is used for IP reassembly which LRO code will not hit. 2) Remove old unused pace fields that are not used in mbuf.h 3) The VLAN processing is not in the mbuf queueing code. Consequently if a VLAN submits to Rack or BBR we need to bypass the mbuf queueing for now until rack_bbr_common is updated to handle the VLAN properly. Reported by: Brad Davis	2019-10-06 22:29:02 +00:00
Michael Tuexen	63fb39ba7b	Plumb an mbuf leak in a code path that should not be taken. Also avoid that this path is taken by setting the tail pointer correctly. There is still bug related to handling unordered unfragmented messages which were delayed in deferred handling. This issue was found by OSS-Fuzz testing the usrsctp stack and reported in https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17794 MFC after: 3 days	2019-10-06 08:47:10 +00:00
Michael Tuexen	2560cb1eb8	Fix a use after free bug when removing remote addresses. This bug was found by OSS-Fuzz and reported in https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18004 MFC after: 3 days	2019-10-05 13:28:01 +00:00
Michael Tuexen	0941b9dc37	Plumb an mbuf leak found by Mark Wodrich from Google by fuzz testing the userland stack and reporting it in: https://github.com/sctplab/usrsctp/issues/396 MFC after: 3 days	2019-10-05 12:34:50 +00:00
Michael Tuexen	44f788d793	Fix the adding of padding to COOKIE-ECHO chunks. Thanks to Mark Wodrich who found this issue while fuzz testing the usrsctp stack and reported the issue in https://github.com/sctplab/usrsctp/issues/382 MFC after: 3 days	2019-10-05 09:46:11 +00:00
Michael Tuexen	aac50dab6d	When skipping the address parameter, take the padding into account. MFC after: 3 days	2019-10-03 20:47:57 +00:00
Michael Tuexen	5989470c37	Cleanup sctp_asconf_error_response() and ensure that the parameter is padded as required. This fixes the followig bug reported by OSS-Fuzz for the usersctp stack: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17790 MFC after: 3 days	2019-10-03 20:39:17 +00:00
Michael Tuexen	967e1a5333	Add missing input validation. This could result in reading from uninitialized memory. The issue was found by OSS-Fuzz for usrsctp and reported in https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17780 MFC after: 3 days	2019-10-03 18:36:54 +00:00
Michael Tuexen	2974e263c3	Don't use stack memory which is not initialized. Thanks to Mark Wodrich for reporting this issue for the userland stack in https://github.com/sctplab/usrsctp/issues/380 This issue was also found for usrsctp by OSS-fuzz in https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=17778 MFC after: 3 days	2019-09-30 12:06:57 +00:00
Michael Tuexen	12a43d0d5d	RFC 7112 requires a host to put the complete IP header chain including the TCP header in the first IP packet. Enforce this in tcp_output(). In addition make sure that at least one byte payload fits in the TCP segement to allow making progress. Without this check, a kernel with INVARIANTS will panic. This issue was found by running an instance of syzkaller. Reviewed by: jtl@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21665	2019-09-29 10:45:13 +00:00
Michael Tuexen	71e85612a9	Replacing MD5 by SipHash improves the performance of the TCP time stamp initialisation, which is important when the host is dealing with a SYN flood. This affects the computation of the initial TCP sequence number for the client side. This has been discussed with secteam@. Reviewed by: gallatin@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21616	2019-09-28 13:13:23 +00:00
Michael Tuexen	79c2a2a07b	Ensure that the INP lock is released before leaving [gs]etsockopt() for RACK specific socket options. These issues were found by a syzkaller instance. Reviewed by: rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21825	2019-09-28 13:05:37 +00:00
Jonathan T. Looney	0b18fb0798	Add new functionality to switch to using cookies exclusively when we the syn cache overflows. Whether this is due to an attack or due to the system having more legitimate connections than the syn cache can hold, this situation can quickly impact performance. To make the system perform better during these periods, the code will now switch to exclusively using cookies until the syn cache stops overflowing. In order for this to occur, the system must be configured to use the syn cache with syn cookie fallback. If syn cookies are completely disabled, this change should have no functional impact. When the system is exclusively using syn cookies (either due to configuration or the overflow detection enabled by this change), the code will now skip acquiring a lock on the syn cache bucket. Additionally, the code will now skip lookups in several places (such as when the system receives a RST in response to a SYN\|ACK frame). Reviewed by: rrs, gallatin (previous version) Discussed with: tuexen Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21644	2019-09-26 15:18:57 +00:00
Jonathan T. Looney	0bee4d631a	Access the syncache secret directly from the V_tcp_syncache variable, rather than indirectly through the backpointer to the tcp_syncache structure stored in the hashtable bucket. This also allows us to remove the requirement in syncookie_generate() and syncookie_lookup() that the syncache hashtable bucket must be locked. Reviewed by: gallatin, rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21644	2019-09-26 15:06:46 +00:00
Jonathan T. Looney	867e98f8ee	Remove the unused sch parameter to the syncache_respond() function. The use of this parameter was removed in r313330. This commit now removes passing this now-unused parameter. Reviewed by: gallatin, rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21644	2019-09-26 15:02:34 +00:00
Randall Stewart	ac7bd23a7a	lets put (void) in a couple of functions to keep older platforms that are stuck with gcc happy (ppc). The changes are needed in both bbr and rack. Obtained from: Michael Tuexen (mtuexen@)	2019-09-24 20:36:43 +00:00
Randall Stewart	da99b33b17	don't call in_ratelmit detach when RATELIMIT is not compiled in the kernel.	2019-09-24 20:11:55 +00:00
Randall Stewart	2f1cc984db	Fix the ifdefs in tcp_ratelimit.h. They were reversed so that instead of functions only being inside the _KERNEL and the absence of RATELIMIT causing us to have NULL/error returning interfaces we ended up with non-kernel getting the error path. opps..	2019-09-24 20:04:31 +00:00
Randall Stewart	35c7bb3407	This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This is a completely separate TCP stack (tcp_bbr.ko) that will be built only if you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that supports hardware pacing, BBR understands how to use such a feature. Note that this commit also adds in a general purpose time-filter which allows you to have a min-filter or max-filter. A filter allows you to have a low (or high) value for some period of time and degrade slowly to another value has time passes. You can find out the details of BBR by looking at the original paper at: https://queue.acm.org/detail.cfm?id=3022184 or consult many other web resources you can find on the web referenced by "BBR congestion control". It should be noted that BBRv1 (which this is) does tend to unfairness in cases of small buffered paths, and it will usually get less bandwidth in the case of large BDP paths(when competing with new-reno or cubic flows). BBR is still an active research area and we do plan on implementing V2 of BBR to see if it is an improvement over V1. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D21582	2019-09-24 18:18:11 +00:00
Michael Tuexen	2b861c1538	Plumb a memory leak. Thnanks to Felix Weinrank for finding this issue using fuzz testing and reporting it for the userland stack: https://github.com/sctplab/usrsctp/issues/378 MFC after: 3 days	2019-09-24 13:15:24 +00:00
Michael Tuexen	1325a0de13	Don't hold the info lock when calling sctp_select_a_tag(). This avoids a double lock bug in the NAT colliding state processing of SCTP. Thanks to Felix Weinrank for finding and reporting this issue in https://github.com/sctplab/usrsctp/issues/374 He found this bug using fuzz testing. MFC after: 3 days	2019-09-22 11:11:01 +00:00
Michael Tuexen	44f2a3272e	Cleanup the RTO calculation and perform some consistency checks before computing the RTO. This should fix an overflow issue reported by Felix Weinrank in https://github.com/sctplab/usrsctp/issues/375 for the userland stack and found by running a fuzz tester. MFC after: 3 days	2019-09-22 10:40:15 +00:00
Michael Tuexen	e6b3bd22d8	Fix the handling of invalid parameters in ASCONF chunks. Thanks to Mark Wodrich from Google for reproting the issue in https://github.com/sctplab/usrsctp/issues/376 for the userland stack. MFC after: 3 days	2019-09-20 08:20:20 +00:00
Michael Tuexen	dd3121a895	When the RACK stack computes the space for user data in a TCP segment, it wasn't taking the IP level options into account. This patch fixes this. In addition, it also corrects a KASSERT and adds protection code to assure that the IP header chain and the TCP head fit in the first fragment as required by RFC 7112. Reviewed by: rrs@ MFC after: 3 days Sponsored by: Nertflix, Inc. Differential Revision: https://reviews.freebsd.org/D21666	2019-09-19 10:27:47 +00:00
Michael Tuexen	3c19311544	Only allow a SCTP-AUTH shared key to be updated by the application if it is not deactivated and not used. This avoids a use-after-free problem. Reported by: da_cheng_shao@yeah.net MFC after: 3 days	2019-09-17 09:46:42 +00:00
Michael Tuexen	5b66b7f11b	Don't write to memory outside of the allocated array for SACK blocks. Obtained from: rrs@ MFC after: 3 days Sponsored by: Netflix, Inc.	2019-09-16 08:18:05 +00:00
Michael Tuexen	52f6a8090e	When the IP layer calls back into the SCTP layer to perform the SCTP checksum computation, do not assume that the IP header chain and the SCTP common header are in contiguous memory although the SCTP lays out the mbuf chains that way. If there are IP-level options inserted by the IP layer, the constraint is not fulfilled anymore. This issues was found by running syzkaller. Thanks to markj@ who is running an instance which also provides kernel dumps. This allowed me to find this issue. MFC after: 3 days	2019-09-15 18:29:45 +00:00
Andrew Gallatin	d2e6258258	Avoid unneeded call to arc4random() in syncache_add() Don't call arc4random() unconditionally to initialize sc_iss, and then when syncookies are enabled, just overwrite it with the return value from from syncookie_generate(). Instead, only call arc4random() to initialize sc_iss when syncookies are not enabled. Note that on a system under a syn flood attack, arc4random() becomes quite expensive, and the chacha_poly crypto that it calls is one of the more expensive things happening on the system. Removing this unneeded arc4random() call reduces CPU from about 40% to about 35% in my test scenario (Broadwell Xeon, 6Mpps syn flood attack). Reviewed by: rrs, tuxen, bz Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21591	2019-09-11 18:48:26 +00:00
Randall Stewart	6f32ca1936	With the recent commit of ktls, we no longer have a sb_tls_flags, its just the sb_flags. Also the ratelimit code, now that the defintion is in sockbuf.h, does not need the ktls.h file (or its predecessor). Sponsored by: Netflix Inc	2019-09-11 15:41:36 +00:00
Michael Tuexen	6d261981ee	Only update SACK/DSACK lists when a non-empty segment was received. This fixes hitting a KASSERT with a valid packet exchange. Reviewed by: rrs@, Richard Scheffenegger MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21567	2019-09-09 16:07:47 +00:00
Conrad Meyer	373013b05f	Fix build after r351934 tcp_queue_pkts() is only used if TCPHPTS is defined (and it is not by default). Reported by: gcc	2019-09-06 18:33:39 +00:00
Randall Stewart	af9b9e0d9f	This adds in the missing counter initialization which I had forgotten to bring over.. opps. Differential Revision: https://reviews.freebsd.org/D21127	2019-09-06 18:29:48 +00:00
Warner Losh	82e837f803	Initialize if_hw_tsomaxsegsize to 0 to appease gcc's flow analysis as a fail-safe.	2019-09-06 18:25:42 +00:00
Randall Stewart	e57b2d0e51	This adds the final tweaks to LRO that will now allow me to add BBR. These changes make it so you can get an array of timestamps instead of a compressed ack/data segment. BBR uses this to aid with its delivery estimates. We also now (via Drew's suggestions) will not go to the expense of the tcb lookup if no stack registers to want this feature. If HPTS is not present the feature is not present either and you just get the compressed behavior. Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D21127	2019-09-06 14:25:41 +00:00
Michael Tuexen	ecc5b1d156	Fix the SACK block generation in the base TCP stack by bringing it in sync with the RACK stack. Reviewed by: rrs@ MFC after: 5 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21513	2019-09-04 04:38:31 +00:00
Michael Tuexen	191ae5bfa9	Fix two TCP RACK issues: * Convert the TCP delayed ACK timer from ms to ticks as required. This fixes the timer on platforms with hz != 1000. * Don't delay acknowledgements which report duplicate data using DSACKs. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21512	2019-09-03 19:48:02 +00:00
Michael Tuexen	fe5dee73f7	This patch improves the DSACK handling to conform with RFC 2883. The lowest SACK block is used when multiple Blocks would be elegible as DSACK blocks ACK blocks get reordered - while maintaining the ordering of SACK blocks not relevant in the DSACK context is maintained. Reviewed by: rrs@, tuexen@ Obtained from: Richard Scheffenegger MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D21038	2019-09-02 19:04:02 +00:00
Michael Tuexen	b21097894e	Fix initialization of top_fsn. MFC after: 3 days	2019-09-01 10:39:16 +00:00
Michael Tuexen	6182677fb3	Improve the handling of state cookie parameters in INIT-ACK chunks. This fixes problem with parameters indicating a zero length or partial parameters after an unknown parameter indicating to stop processing. It also fixes a problem with state cookie parameters after unknown parametes indicating to stop porcessing. Thanks to Mark Wodrich from Google for finding two of these issues by fuzz testing the userland stack and reporting them in https://github.com/sctplab/usrsctp/issues/355 and https://github.com/sctplab/usrsctp/issues/352 MFC after: 3 days	2019-09-01 10:09:53 +00:00
Michael Tuexen	e30a1788fa	Improve function definition. MFC after: 3 days	2019-08-31 13:13:40 +00:00
Michael Tuexen	ec24a1b67c	Improve the handling of illegal sequence number combinations in received data chunks. Abort the association if there are data chunks with larger fragement sequence numbers than the fragement sequence of the last fragment. Thanks to Mark Wodrich from Google who found this issue by fuzz testing the userland stack and reporting this issue in https://github.com/sctplab/usrsctp/issues/355 MFC after: 3 days	2019-08-31 08:18:49 +00:00
John Baldwin	b2e60773c6	Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277	2019-08-27 00:01:56 +00:00
Michael Tuexen	15ddc5e43f	Don't hold the rs_mtx lock while calling malloc(). Reviewed by: rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21416	2019-08-26 16:23:47 +00:00
Randall Stewart	e13ad86c67	Fix an issue when TSO and Rack play together. Basically an retransmission of the initial SYN (with data) would cause us to strip the SYN and decrement/increase offset/len which then caused us a -1 offset and a panic. Reported by: Larry Rosenman (Michael Tuexen helped me debug this at the IETF)	2019-08-21 10:45:28 +00:00
Mark Johnston	ccdc986607	Fix netdump buffering after r348473. nd_buf is used to buffer headers (for both the kernel dump itself and for EKCD) before the final call to netdump_dumper(), which flushes residual data in nd_buf. As a result, a small portion of the residual data would be corrupted. This manifests when kernel dump compression is enabled since both zstd and zlib detect the corruption during decompression. Reviewed by: cem Differential Revision: https://reviews.freebsd.org/D21294	2019-08-19 16:29:51 +00:00
Andrey V. Elsukov	c4556b2fbe	Save ip_ttl value and restore it after checksum calculation. Since ipvoly is used for checksum calculation, part of original IP header is zeroed. This part includes ip_ttl field, that can be used later in IP_MINTTL socket option handling. PR: 239799 MFC after: 1 week	2019-08-13 12:47:53 +00:00
Randall Stewart	23fa2dbc06	Place back in the dependency on HPTS via module depends versus a fatal error in compiling. This was taken out by mistake when I mis-merged from the 18q22p2 sources of rack in NF. Opps. Reported by: sbruno	2019-08-13 12:41:15 +00:00
Tom Jones	deb2aead4a	Rename IPPROTO 33 from SEP to DCCP IPPROTO 33 is DCCP in the IANA Registry: https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml IPPROTO_SEP was added about 20 years ago in r33804. The entries were added straight from RFC1700, without regard to whether they were used. The reference in RFC1700 for SEP is '[JC120] <mystery contact>', this is an indication that the protocol number was probably in use in a private network. As RFC1700 is no longer the authoritative list of internet numbers and that IANA assinged 33 to DCCP in RFC4340, change the header to the actual authoritative source. Reviewed by: Richard Scheffenegger, bz Approved by: bz (mentor) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D21178	2019-08-08 11:43:09 +00:00
Michael Tuexen	47d16bf9c1	Fix a typo. Submitted by: Thomas Dreibholz MFC after: 1 week	2019-08-08 08:23:27 +00:00
Michael Tuexen	cd2de8b735	Fix a locking issue in sctp_accept. PR: 238520 Reported by: pho@ MFC after: 1 week	2019-08-06 10:29:19 +00:00
Michael Tuexen	43ecbff2dc	Fix build issues for the userland stack on Raspbian.	2019-08-06 08:33:21 +00:00
Michael Tuexen	94962f6ba0	Improve consistency. No functional change. MFC after: 3 days	2019-08-05 13:22:15 +00:00
Xin LI	903c4ee6ec	Fix !INET build.	2019-08-02 22:43:09 +00:00
Randall Stewart	99c311c4d1	Fix one more atomic for i86 Obtained from: mtuexen@freebsd.org	2019-08-02 11:17:07 +00:00
Bjoern A. Zeeb	0ecd976e80	IPv6 cleanup: kernel Finish what was started a few years ago and harmonize IPv6 and IPv4 kernel names. We are down to very few places now that it is feasible to do the change for everything remaining with causing too much disturbance. Remove "aliases" for IPv6 names which confusingly could indicate that we are talking about a different data structure or field or have two fields, one for each address family. Try to follow common conventions used in FreeBSD. * Rename sin6p to sin6 as that is how it is spelt in most places. * Remove "aliases" (#defines) for: - in6pcb which really is an inpcb and nothing separate - sotoin6pcb which is sotoinpcb (as per above) - in6p_sp which is inp_sp - in6p_flowinfo which is inp_flow * Try to use ia6 for in6_addr rather than in6p. * With all these gone also rename the in6p variables to inp as that is what we call it in most of the network stack including parts of netinet6. The reasons behind this cleanup are that we try to further unify netinet and netinet6 code where possible and that people will less ignore one or the other protocol family when doing code changes as they may not have spotted places due to different names for the same thing. No functional changes. Discussed with: tuexen (SCTP changes) MFC after: 3 months Sponsored by: Netflix	2019-08-02 07:41:36 +00:00
Randall Stewart	a1589eb835	Opps use fetchadd_u64 not long to keep old 32 bit platforms happy.	2019-08-01 20:26:27 +00:00
Michael Tuexen	bedf9eb987	Fix the reporting of multiple unknown parameters in an received INIT chunk. This also plugs an potential mbuf leak. Thanks to Felix Weinrank for reporting this issue found by fuzz-testing the userland stack. MFC after: 3 days	2019-08-01 19:45:34 +00:00
Michael Tuexen	fdf15fd04e	When responding with an ABORT to an INIT chunk containing a HOSTNAME parameter or a parameter with an illegal length, only include an error cause indicating why the ABORT was sent. This also fixes an mbuf leak which could occur. MFC after: 3 days	2019-08-01 17:36:15 +00:00
Randall Stewart	20abea6663	This adds the third step in getting BBR into the tree. BBR and an updated rack depend on having access to the new ratelimit api in this commit. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D20953	2019-08-01 14:17:31 +00:00
Michael Tuexen	0a36d8cc81	Small cleanup, no functional change intended. MFC after: 3 days	2019-07-31 21:39:03 +00:00
Michael Tuexen	30735183aa	Consistently cleanup mbufs in case of other memory errors. MFC after: 3 days	2019-07-31 21:29:17 +00:00
Michael Tuexen	bb63f59b22	When performing after_idle() or post_recovery(), don't disable the DCTCP specific methods. Also fallthrough NewReno for non ECN capable TCP connections and improve the integer arithmetic. Obtained from: Richard Scheffenegger MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20550	2019-07-29 09:19:48 +00:00
Michael Tuexen	333ba164d6	* Improve input validation of sysctl parameters for DCTPC. * Initialize the alpha parameter to a conservative value (like Linux) * Improve handling of arithmetic. * Improve man-page Obtained from: Richard Scheffenegger MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D20549	2019-07-29 08:50:35 +00:00
Michael Tuexen	d21036e033	Add a sysctl variable ts_offset_per_conn to change the computation of the TCP TS offset from taking the IP addresses and the TCP port numbers into account to a version just taking only the IP addresses into account. This works around broken middleboxes or endpoints. The default is to keep the behaviour, which is also the behaviour recommended in RFC 7323. Reported by: devgs@ukr.net Reviewed by: rrs@ MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D20980	2019-07-23 21:28:20 +00:00
Michael Tuexen	9a4f1a2492	Don't hold a mutex while calling sbwait. This was found by syzkaller. Submitted by: rrs@ Reported by: markj@ MFC after: 1 week	2019-07-23 18:31:07 +00:00
Michael Tuexen	1a91d6987c	Fix a LOR in SCTP which was found by running syzkaller. Submitted by: rrs@ Reported by: markj@ MFC after: 1 week	2019-07-23 18:07:36 +00:00
Michael Tuexen	f1903dc055	Wakeup the application when doing PD-API for unordered DATA chunks. Work done with rrs@. MFC after: 1 week	2019-07-22 18:11:35 +00:00
Michael Tuexen	e4a5561e01	Fix compilation on platforms using gcc. When compiling RACK on platforms using gcc, a warning that tcp_outflags is defined but not used is issued and terminates compilation on PPC64, for example. So don't indicate that tcp_outflags is used. Reviewed by: rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D20971	2019-07-16 17:54:20 +00:00
Michael Tuexen	8a3cfbff92	Don't free read control entries, which are still on the stream queue when adding them the the read queue fails MFC after: 1 week	2019-07-15 20:45:01 +00:00
Michael Tuexen	248bd1b80f	Add support for MSG_EOR and MSG_EOF in sendmsg() for SCTP. This is an FreeBSD extension, not covered by Posix. This issue was found by running syzkaller. MFC after: 1 week	2019-07-15 14:54:04 +00:00
Michael Tuexen	25fa310a5f	Fix socket state handling when freeing an SCTP endpoint. This issue was found by runing syzkaller. MFC after: 1 week	2019-07-15 14:52:52 +00:00
Randall Stewart	e5926fd368	This is the second in a number of patches needed to get BBRv1 into the tree. This fixes the DSACK bug but is also needed by BBR. We have yet to go two more one will be for the pacing code (tcp_ratelimit.c) and the second will be for the new updated LRO code that allows a transport to know the arrival times of packets and (tcp_lro.c). After that we should finally be able to get BBRv1 into head. Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D20908	2019-07-14 16:05:47 +00:00
Michael Tuexen	8a956abe12	When calling sctp_initialize_auth_params(), the inp must have at least a read lock. To avoid more complex locking dances, just call it in sctp_aloc_assoc() when the write lock is still held. Reported by: syzbot+08a486f7e6966f1c3cfb@syzkaller.appspotmail.com MFC after: 1 week	2019-07-14 12:04:39 +00:00
Randall Stewart	55f795883f	add back the comment around the pending DSACK fixes.	2019-07-12 11:45:42 +00:00
Randall Stewart	1cf999a5f3	Update to jhb's other suggestion, use #error when we are missing HPTS.	2019-07-11 04:40:58 +00:00
Randall Stewart	9cf3c235c0	Update copyright per JBH's suggestions.. thanks.	2019-07-11 04:38:33 +00:00
Randall Stewart	3b0b41e613	This commit updates rack to what is basically being used at NF as well as sets in some of the groundwork for committing BBR. The hpts system is updated as well as some other needed utilities for the entrance of BBR. This is actually part 1 of 3 more needed commits which will finally complete with BBRv1 being added as a new tcp stack. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D20834	2019-07-10 20:40:39 +00:00
John Baldwin	82334850ea	Add an external mbuf buffer type that holds multiple unmapped pages. Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses. For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT. Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe. NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required. If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616	2019-06-29 00:48:33 +00:00
John Baldwin	6b69072acc	Reject attempts to register a TCP stack being unloaded. Reviewed by: gallatin MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20617	2019-06-27 22:34:05 +00:00
Hans Petter Selasky	59854ecf55	Convert all IPv4 and IPv6 multicast memberships into using a STAILQ instead of a linear array. The multicast memberships for the inpcb structure are protected by a non-sleepable lock, INP_WLOCK(), which needs to be dropped when calling the underlying possibly sleeping if_ioctl() method. When using a linear array to keep track of multicast memberships, the computed memory location of the multicast filter may suddenly change, due to concurrent insertion or removal of elements in the linear array. This in turn leads to various invalid memory access issues and kernel panics. To avoid this problem, put all multicast memberships on a STAILQ based list. Then the memory location of the IPv4 and IPv6 multicast filters become fixed during their lifetime and use after free and memory leak issues are easier to track, for example by: vmstat -m \| grep multi All list manipulation has been factored into inline functions including some macros, to easily allow for a future hash-list implementation, if needed. This patch has been tested by pho@ . Differential Revision: https://reviews.freebsd.org/D20080 Reviewed by: markj @ MFC after: 1 week Sponsored by: Mellanox Technologies	2019-06-25 11:54:41 +00:00
Andrey V. Elsukov	978f2d1728	Add "tcpmss" opcode to match the TCP MSS value. With this opcode it is possible to match TCP packets with specified MSS option, whose value corresponds to configured in opcode value. It is allowed to specify single value, range of values, or array of specific values or ranges. E.g. # ipfw add deny log tcp from any to any tcpmss 0-500 Reviewed by: melifaro,bcr Obtained from: Yandex LLC MFC after: 1 week Sponsored by: Yandex LLC	2019-06-21 10:54:51 +00:00
Kristof Provost	05fc9d78d7	ip_output: pass PFIL_FWD in the slow path If we take the slow path for forwarding we should still tell our firewalls (hooked through pfil(9)) that we're forwarding. Pass the ip_output() flags to ip_output_pfil() so it can set the PFIL_FWD flag when we're forwarding. MFC after: 1 week Sponsored by: Axiado	2019-06-21 07:58:08 +00:00
Jonathan T. Looney	5e02b277a4	Add the ability to limit how much the code will fragment the RACK send map in response to SACKs. The default behavior is unchanged; however, the limit can be activated by changing the new net.inet.tcp.rack.split_limit sysctl. Submitted by: Peter Lei <peterlei@netflix.com> Reported by: jtl Reviewed by: lstewart (earlier version) Security: CVE-2019-5599	2019-06-19 13:55:00 +00:00
Xin LI	f89d207279	Separate kernel crc32() implementation to its own header (gsb_crc32.h) and rename the source to gsb_crc32.c. This is a prerequisite of unifying kernel zlib instances. PR: 229763 Submitted by: Yoshihiro Ota <ota at j.email.ne.jp> Differential Revision: https://reviews.freebsd.org/D20193	2019-06-17 19:49:08 +00:00
John Baldwin	77a0144145	Sort opt_foo.h #includes and add a missing blank line in ip_output().	2019-06-11 22:07:39 +00:00
Bjoern A. Zeeb	4c62bffef5	Fix dpcpu and vnet panics with complex types at the end of the section. Apply a linker script when linking i386 kernel modules to apply padding to a set_pcpu or set_vnet section. The padding value is kind-of random and is used to catch modules not compiled with the linker-script, so possibly still having problems leading to kernel panics. This is needed as the code generated on certain architectures for non-simple-types, e.g., an array can generate an absolute relocation on the edge (just outside) the section and thus will not be properly relocated. Adding the padding to the end of the section will ensure that even absolute relocations of complex types will be inside the section, if they are the last object in there and hence relocation will work properly and avoid panics such as observed with carp.ko or ipsec.ko. There is a rather lengthy discussion of various options to apply in the mentioned PRs and their depends/blocks, and the review. There seems no best solution working across multiple toolchains and multiple version of them, so I took the liberty of taking one, as currently our users (and our CI system) are hitting this on just i386 and we need some solution. I wish we would have a proper fix rather than another "hack". Also backout r340009 which manually, temporarily fixed CARP before 12.0-R "by chance" after a lead-up of various other link-elf.c and related fixes. PR: 230857,238012 With suggestions from: arichardson (originally last year) Tested by: lwhsu Event: Waterloo Hackathon 2019 Reported by: lwhsu, olivier MFC after: 6 weeks Differential Revision: https://reviews.freebsd.org/D17512	2019-06-08 17:44:42 +00:00
Michael Tuexen	d1156b0505	r347382 added receiver side DSACK support for the TCP base stack. The corresponding changes for the RACK stack where missed and are added by this commit. Reviewed by: Richard Scheffenegger, rrs@ MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D20372	2019-06-06 07:49:03 +00:00
Bjoern A. Zeeb	eafaa1bc35	After parts of the locking fixes in r346595, syzkaller found another one in udp_output(). This one is a race condition. We do check on the laddr and lport without holding a lock in order to determine whether we want a read or a write lock (this is in the "sendto/sendmsg" cases where addr (sin) is given). Instrumenting the kernel showed that after taking the lock, we had bound to a local port from a parallel thread on the same socket. If we find that case, unlock, and retry again. Taking the write lock would not be a problem in first place (apart from killing some parallelism). However the retry is needed as later on based on similar condition checks we do acquire the pcbinfo lock and if the conditions have changed, we might find ourselves with a lock inconsistency, hence at the end of the function when trying to unlock, hitting the KASSERT. Reported by: syzbot+bdf4caa36f3ceeac198f@syzkaller.appspotmail.com Reviewed by: markj MFC after: 6 weeks Event: Waterloo Hackathon 2019	2019-06-01 14:57:42 +00:00
Mark Johnston	8726929d67	netdump: Buffer pages to avoid calling netdump_send() on each 4KB write. netdump waits for acknowledgement from the server for each write. When dumping page table pages, we perform many small writes, limiting throughput. Use the netdump client's buffer to buffer small contiguous writes before calling netdump_send() to flush the MAXDUMPPGS-sized buffer. This results in a significant reduction in the time taken to complete a netdump. Submitted by: Sam Gwydir <sam@samgwydir.com> Reviewed by: cem MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20317	2019-05-31 18:29:12 +00:00
Michael Tuexen	bc35229fad	When an ACK segment as the third message of the three way handshake is received and support for time stamps was negotiated in the SYN/SYNACK exchange, perform the PAWS check and only expand the syn cache entry if the check is passed. Without this check, endpoints may get stuck on the incomplete queue. Reviewed by: jtl@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D20374	2019-05-26 17:18:14 +00:00
John Baldwin	fb3bc59600	Restructure mbuf send tags to provide stronger guarantees. - Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags. Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds. To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions. For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN. - To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped. This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop). In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag. - cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called. To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag. - mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead. - Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer. Reviewed by: gallatin, hselasky, rgrimes, ae Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117	2019-05-24 22:30:40 +00:00
Bjoern A. Zeeb	bd4549f467	Massively blow up the locking-related KASSERTs used to make sure that we end up in a consistent locking state at the end of udp_output() in order to be able to see what the values are based on which we once took a decision (note: some values may have changed). This helped to debug a syzkaller report. MFC after: 2 months Event: Waterloo Hackathon 2019	2019-05-21 19:23:56 +00:00
Bjoern A. Zeeb	f9a6e8d72f	Similarly to r338257,338306 try to fold the two consecutive #ifdef RSS section in udp_output() into one by moving a '}' outside of the conditional block. MFC after: 2 months Event: Waterloo Hackathon 2019	2019-05-21 19:18:55 +00:00
Conrad Meyer	04e0c883c5	Add two missing eventhandler.h headers These are obviously missing from the .c files, but don't show up in any tinderbox configuration (due to latent header pollution of some kind). It seems some configurations don't have this pollution, and the includes are obviously missing, so go ahead and add them. Reported by: Peter Jeremy <peter AT rulingia.com> X-MFC-With: r347984	2019-05-21 00:04:19 +00:00
Conrad Meyer	e2e050c8ef	Extract eventfilter declarations to sys/_eventfilter.h This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h" in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header pollution substantially. EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c files into appropriate headers (e.g., sys/proc.h, powernv/opal.h). As a side effect of reduced header pollution, many .c files and headers no longer contain needed definitions. The remainder of the patch addresses adding appropriate includes to fix those files. LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by sys/mutex.h since r326106 (but silently protected by header pollution prior to this change). No functional change (intended). Of course, any out of tree modules that relied on header pollution for sys/eventhandler.h, sys/lock.h, or sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.	2019-05-20 00:38:23 +00:00
Michael Tuexen	3a4f12e3b1	Allow sending on demand SCTP HEARTBEATS only in the ESTABLISHED state. This issue was found by running syzkaller. MFC after: 3 days	2019-05-19 17:53:36 +00:00
Michael Tuexen	fc26bf717c	Improve input validation for the IPPROTO_SCTP level socket options SCTP_CONNECT_X and SCTP_CONNECT_X_DELAYED. Some issues where found by running syzkaller. MFC after: 3 days	2019-05-19 17:28:00 +00:00
Mark Johnston	f00876fb60	Revert r347582 for now. The inp lock still needs to be dropped when calling into the driver ioctl handler, as some drivers expect to be able to sleep. Reported by: kib	2019-05-16 13:04:26 +00:00
Mark Johnston	5a1e222bfd	Close some races in multicast socket option handling. r333175 converted the global multicast lock to a sleepable sx lock, so the lock order with respect to the (non-sleepable) inp lock changed. To handle this, r333175 and r333505 added code to drop the inp lock, but this opened races that could leave multicast group description structures in an inconsistent state. This change fixes the problem by simply acquiring the global lock sooner. Along the way, this fixes some LORs and bogus error handling introduced in r333175, and commits some related cleanup. Reported by: syzbot+ba7c4943547e0604faca@syzkaller.appspotmail.com Reported by: syzbot+1b803796ab94d11a46f9@syzkaller.appspotmail.com Reviewed by: ae MFC after: 3 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20070	2019-05-14 21:30:55 +00:00
Conrad Meyer	64e7d18f34	netdump: Ref the interface we're attached to Serialize netdump configuration / deconfiguration, and discard our configuration when the affiliated interface goes away by monitoring ifnet_departure_event. Reviewed by: markj, with input from vangyzen@ (earlier version) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20206	2019-05-10 23:12:59 +00:00
Conrad Meyer	070e7bf95e	netdump: Fix boot-time configuration typo Boot-time netdump configuration is much more useful if one can configure the client and gateway addresses. Fix trivial typo. (Long-standing bug, I believe it dates to the original netdump commit.) Spotted by: one of vangyzen@ or markj@ Sponsored by: Dell EMC Isilon	2019-05-10 23:10:22 +00:00
Conrad Meyer	6144b50f8b	netdump: Don't store sensitive key data we don't need Prior to this revision, struct diocskerneldump_arg (and struct netdump_conf with embedded diocskerneldump_arg before r347192), were copied in their entirety to the global 'nd_conf' variable. Also prior to this revision, de-configuring netdump would not remove the the key material from global nd_conf. As part of Encrypted Kernel Crash Dumps (EKCD), which was developed contemporaneously with netdump but happened to land first, the diocskerneldump_arg structure will contain sensitive key material (kda_key[]) when encrypted dumps are configured. Netdump doesn't have any use for the key data -- encryption is handled in the core dumper code -- so in this revision, we no longer store it. Unfortunately, I think this leak dates to the initial import of netdump in r333283; so it's present in FreeBSD 12.0. Fortunately, the impact seems relatively minor. Any new netdump configuration would overwrite the key material; for active encrypted netdump configurations, the key data stored was just a duplicate of the key material already in the core dumper code; and no user interface (other than /dev/kmem) actually exposed the leaked material to userspace. Reviewed by: markj, rpokala (earlier commit message) MFC after: 2 weeks Security: yes (minor) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D20233	2019-05-10 21:55:11 +00:00
Gleb Smirnoff	54bb7ac0c4	Fix regression from r347375: do not panic when sending an IP multicast packet from an interface that doesn't have IPv4 address. Reported by: Michael Butler <imb protected-networks.net>	2019-05-10 21:51:17 +00:00

... 2 3 4 5 6 ...

6580 Commits