freebsd-dev

Author	SHA1	Message	Date
Michael Tuexen	9d2e3f14c4	After allocating chunks set the fields in a consistent way. This removes two assignments for the flags field being done twice and adds one, which was missing. Thanks to Felix Weinrank for reporting the issue he found by using fuzz testing of the userland stack. Approved by: re (kib@) MFC after: 1 week	2018-10-01 13:09:18 +00:00
Andrey V. Elsukov	384a5c3c28	Add INP_INFO_WUNLOCK_ASSERT() macro and use it instead of INP_INFO_UNLOCK_ASSERT() in TCP-related code. For encapsulated traffic it is possible, that the code is running in net_epoch_preempt section, and INP_INFO_UNLOCK_ASSERT() is very strict assertion for such case. PR: 231428 Reviewed by: mmacy, tuexen Approved by: re (kib) Differential Revision: https://reviews.freebsd.org/D17335	2018-10-01 10:46:00 +00:00
Michael Tuexen	1b084a5e5e	Plug mbuf leak in the SCTP input path in an error case. Approved by: re (kib@) MFC after: 1 week CID: 749312	2018-09-30 21:54:02 +00:00
Michael Tuexen	66bcf0b333	Plug mbuf leaks in the SCTP output path in error cases. Approved by: re (kib@) MFC after: 1 week CID: 1395307	2018-09-30 21:31:33 +00:00
Michael Tuexen	8184648425	Fix the handling of ancillary data for SCTP socket. Implement sctp_process_cmsgs_for_init() and sctp_findassociation_cmsgs() similar to sctp_find_cmsg() to improve consistency and avoid the signed/unsigned issues in sctp_process_cmsgs_for_init() and sctp_findassociation_cmsgs(). Thanks to andrew@ for reporting the problem he found using syzcaller. Approved by: re (kib@) MFC after: 1 week	2018-09-30 16:21:31 +00:00
Michael Tuexen	ae0a9a8850	Increment the corresponding UDP stats counter (udps_opackets) when sending UDP encapsulated SCTP packets. This is consistent with the behaviour that when such packets are received, the corresponding UDP stats counter (udps_ipackets) is incremented. Thanks to Peter Lei for making me aware of this inconsistency. Approved by: re (kib@) MFC after: 1 week	2018-09-30 12:16:06 +00:00
Michael Tuexen	3552f16d82	Fix typo in comment. Reported by: @danfe Approved by: re (kib@) MFC after: 1 week X-MFC: r338941	2018-09-28 19:47:32 +00:00
Michael Tuexen	0277ec9c43	Whitespace changes and fixing a typo. No functional change. Approved by: re (kib@) MFC after: 1 week	2018-09-26 10:24:50 +00:00
Michael Tuexen	078a49a077	Remove the unused parameter 'locked' from the function syncache_respond(). There is no functional change. The parameter became unused in r313330, but wasn't removed. Approved by: re (kib@) MFC after: 1 month Sponsored by: Netflix, Inc.	2018-09-23 16:37:32 +00:00
Andrey V. Elsukov	76b09d1823	Add new field max_hdrsize to struct encap_config. It is currently unused and reserved for future use to keep KBI/KPI. Also add several spare pointers to be able extend structure if it will be needed. Approved by: re (gjb)	2018-09-20 19:45:27 +00:00
Michael Tuexen	ba4704a278	Remove unused code. Approved by: re (kib@) MFC after: 1 week	2018-09-18 10:53:07 +00:00
Michael Tuexen	a8a8a8a808	Fix TCP Fast Open for the TCP RACK stack. * Fix a bug where the SYN handling during established state was applied to a front state. * Move a check for retransmission after the timer handling. This was suppressing timer based retransmissions. * Fix an off-by one byte in the sequence number of retransmissions. * Apply fixes corresponding to https://svnweb.freebsd.org/changeset/base/336934 Reviewed by: rrs@ Approved by: re (kib@) MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16912	2018-09-12 10:27:58 +00:00
Mark Johnston	54af3d0dac	Fix synchronization of LB group access. Lookups are protected by an epoch section, so the LB group linkage must be a CK_LIST rather than a plain LIST. Furthermore, we were not deferring LB group frees, so in_pcbremlbgrouphash() could race with readers and cause a use-after-free. Reviewed by: sbruno, Johannes Lundberg <johalun0@gmail.com> Tested by: gallatin Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17031	2018-09-10 19:00:29 +00:00
Mark Johnston	a7026c7fd9	Use ratecheck(9) in in_pcbinslbgrouphash(). Reviewed by: bz, Johannes Lundberg <johalun0@gmail.com> Approved by: re (kib) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D17065	2018-09-07 21:11:41 +00:00
Bjoern A. Zeeb	113c4fad55	The inp_lle field to struct inpcb, along with two "valid" flags for the rt and lle cache were added in r191129 (2009). To my best knowledge they have never been used and route caching has converted the inp_rt field from that commit to inp_route rendering this field and these flags obsolete. Convert the pointer into a spare pointer to not change the size of the structure anymore (and to have a spare pointer) and mark the two fields as unused. Reviewed by: markj, karels Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17062	2018-09-06 19:55:40 +00:00
Bjoern A. Zeeb	6d2b0c0166	Make tcp_hpts.c compile a LINT kernel with options RSS and PCBGROUPS added by adding the missing include files and changing a the type of cpuid which would otherwise cause a false comparison with NETISR_CPUID_NONE. Reviewed by: rrs Approved by: re (marius) Differential Revision: https://reviews.freebsd.org/D16891	2018-09-06 16:11:24 +00:00
Mark Johnston	49365eb433	Define sctp probes only when SCTP is configured. Otherwise the "depends_on provider" guard in sctp.d does not work as intended. Reported by: mjg Reviewed by: tuexen Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17057	2018-09-06 14:15:03 +00:00
Mark Johnston	8be02ee4da	Fix style bugs in in_pcblookup_lbgroup(). No functional change intended. Reviewed by: bz, Johannes Lundberg <johalun0@gmail.com> Approved by: re (rgrimes) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17030	2018-09-05 15:04:11 +00:00
Eugene Grosbein	d5d21ad932	Fix "ipfw fwd" to work for incoming IPv4 packets when ip_tryforward() chooses fast forwarding path, as it already works for IPv6 and for both of them on old slow path. PR: 231143 Reviewed by: ae Approved by: re (gjb) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D17039	2018-09-05 13:59:36 +00:00
Mark Johnston	73ad0b6abf	Use the correct malloc type in in_pcblbgroup_free(). Approved by: re (kib) Sponsored by: The FreeBSD Foundation	2018-09-03 17:39:09 +00:00
Michael Tuexen	c6c0be2765	Fix a shadowed variable warning. Thanks to Peter Lei for reporting the issue. Approved by: re(kib@) MFH: 1 month Sponsored by: Netflix, Inc.	2018-08-24 10:50:19 +00:00
Michael Tuexen	90ab3571d8	Use arc4rand() instead of read_random() in the SCTP and TCP code. This was suggested by jmg@. Reviewed by: delphij@, jmg@, jtl@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16860	2018-08-23 19:10:45 +00:00
Michael Tuexen	4ba1513d1a	Don't use the explicit number 32 for the length of the secrets, use sizeof() or explicit #definesi instead. No functional change. This was suggested by jmg@. MFC after: 1 month XMFC with: r338053 Sponsored by: Netflix, Inc.	2018-08-23 06:03:59 +00:00
Michael Tuexen	1e88cc8b59	Add support for send, receive and state-change DTrace providers for SCTP. They are based on what is specified in the Solaris DTrace manual for Solaris 11.4. Reviewed by: 0mp, dteske, markj Relnotes: yes Differential Revision: https://reviews.freebsd.org/D16839	2018-08-22 21:23:32 +00:00
Matt Macy	d3878608d7	in_mcast: fix copy paste error when clearing flag	2018-08-22 04:09:55 +00:00
Michael Tuexen	5dff1c3845	Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP socket resulted in sending fragmented IPV6 packets. This is fixes by reducing the MSS to the appropriate value. In addtion, if the socket option is set before the handshake happens, announce this MSS to the peer. This is not stricly required, but done since TCP is conservative. PR: 173444 Reviewed by: bz@, rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16796	2018-08-21 14:12:30 +00:00
Michael Tuexen	7d4dcc36a8	Fix the inheritance of IPv6 level socket options on TCP sockets. This was broken for IPv6 listening socket, which are not IPV6_ONLY, and the accepted TCP connection was using IPv4. Reviewed by: bz@, rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16792	2018-08-21 14:07:36 +00:00
Michael Tuexen	6ef849e601	Whitespace change.	2018-08-21 13:37:06 +00:00
Michael Tuexen	1a0b021677	Refactor the SHUTDOWN_PENDING state handling. This is not a functional change but a preperation for the upcoming DTrace support. It is necessary to change the state in one logical operation, even if it involves clearing the sub state SHUTDOWN_PENDING. MFC after: 1 month	2018-08-21 13:25:32 +00:00
Bjoern A. Zeeb	10b070c166	GC inc_isipv6; it was added for "temp" compatibility in 2001, r86764 and does not seem to be used.	2018-08-20 20:06:36 +00:00
Randall Stewart	c28440db29	This change represents a substantial restructure of the way we reassembly inbound tcp segments. The old algorithm just blindly dropped in segments without coalescing. This meant that every segment could take up greater and greater room on the linked list of segments. This of course is now subject to a tighter limit (100) of segments which in a high BDP situation will cause us to be a lot more in-efficent as we drop segments beyond 100 entries that we receive. What this restructure does is cause the reassembly buffer to coalesce segments putting an emphasis on the two common cases (which avoid walking the list of segments) i.e. where we add to the back of the queue of segments and where we add to the front. We also have the reassembly buffer supporting a couple of debug options (black box logging as well as counters for code coverage). These are compiled out by default but can be added by uncommenting the defines. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16626	2018-08-20 12:43:18 +00:00
Michael Tuexen	8e02b4e00c	Don't expose the uptime via the TCP timestamps. The TCP client side or the TCP server side when not using SYN-cookies used the uptime as the TCP timestamp value. This patch uses in all cases an offset, which is the result of a keyed hash function taking the source and destination addresses and port numbers into account. The keyed hash function is the same a used for the initial TSN. Reviewed by: rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16636	2018-08-19 14:56:10 +00:00
Navdeep Parhar	32d2623ae2	Add the ability to look up the 3b PCP of a VLAN interface. Use it in toe_l2_resolve to fill up the complete vtag and not just the vid. Reviewed by: kib@ MFC after: 1 week Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D16752	2018-08-16 23:46:38 +00:00
Matt Macy	f9be038601	Fix in6_multi double free This is actually several different bugs: - The code is not designed to handle inpcb deletion after interface deletion - add reference for inpcb membership - The multicast address has to be removed from interface lists when the refcount goes to zero OR when the interface goes away - decouple list disconnect from refcount (v6 only for now) - ifmultiaddr can exist past being on interface lists - add flag for tracking whether or not it's enqueued - deferring freeing moptions makes the incpb cleanup code simpler but opens the door wider still to races - call inp_gcmoptions synchronously after dropping the the inpcb lock Fundamentally multicast needs a rewrite - but keep applying band-aids for now. Tested by: kp Reported by: novel, kp, lwhsu	2018-08-15 20:23:08 +00:00
Luiz Otavio O Souza	59b2022f94	Late style follow up on r312770. Submitted by: glebius X-MFC with: r312770 MFC after: 3 days	2018-08-15 15:44:30 +00:00
Jonathan T. Looney	a967df1c8f	Lower the default limits on the IPv4 reassembly queue. In particular, try to ensure that no bucket will have a reassembly queue larger than approximately 100 items. This limits the cost to find the correct reassembly queue when processing an incoming fragment. Due to the low limits on each bucket's length, increase the size of the hash table from 64 to 1024. Reviewed by: jhb Security: FreeBSD-SA-18:10.ip Security: CVE-2018-6923	2018-08-14 17:30:46 +00:00
Jonathan T. Looney	ff790bbad0	Implement a limit on on the number of IPv4 reassembly queues per bucket. There is a hashing algorithm which should distribute IPv4 reassembly queues across the available buckets in a relatively even way. However, if there is a flaw in the hashing algorithm which allows a large number of IPv4 fragment reassembly queues to end up in a single bucket, a per- bucket limit could help mitigate the performance impact of this flaw. Implement such a limit, with a default of twice the maximum number of reassembly queues divided by the number of buckets. Recalculate the limit any time the maximum number of reassembly queues changes. However, allow the user to override the value using a sysctl (net.inet.ip.maxfragbucketsize). Reviewed by: jhb Security: FreeBSD-SA-18:10.ip Security: CVE-2018-6923	2018-08-14 17:23:05 +00:00
Jonathan T. Looney	7b9c5eb0a5	Add a global limit on the number of IPv4 fragments. The IP reassembly fragment limit is based on the number of mbuf clusters, which are a global resource. However, the limit is currently applied on a per-VNET basis. Given enough VNETs (or given sufficient customization of enough VNETs), it is possible that the sum of all the VNET limits will exceed the number of mbuf clusters available in the system. Given the fact that the fragment limit is intended (at least in part) to regulate access to a global resource, the fragment limit should be applied on a global basis. VNET-specific limits can be adjusted by modifying the net.inet.ip.maxfragpackets and net.inet.ip.maxfragsperpacket sysctls. To disable fragment reassembly globally, set net.inet.ip.maxfrags to 0. To disable fragment reassembly for a particular VNET, set net.inet.ip.maxfragpackets to 0. Reviewed by: jhb Security: FreeBSD-SA-18:10.ip Security: CVE-2018-6923	2018-08-14 17:19:49 +00:00
Jonathan T. Looney	5d9bd45518	Improve hashing of IPv4 fragments. Currently, IPv4 fragments are hashed into buckets based on a 32-bit key which is calculated by (src_ip ^ ip_id) and combined with a random seed. However, because an attacker can control the values of src_ip and ip_id, it is possible to construct an attack which causes very deep chains to form in a given bucket. To ensure more uniform distribution (and lower predictability for an attacker), calculate the hash based on a key which includes all the fields we use to identify a reassembly queue (dst_ip, src_ip, ip_id, and the ip protocol) as well as a random seed. Reviewed by: jhb Security: FreeBSD-SA-18:10.ip Security: CVE-2018-6923	2018-08-14 17:15:47 +00:00
Michael Tuexen	0f1346f7f4	Remove a set but not used warning showing up in usrsctp.	2018-08-14 08:32:33 +00:00
Andrey V. Elsukov	62484790e0	Restore ability to send ICMP and ICMPv6 redirects. It was lost when tryforward appeared. Now ip[6]_tryforward will be enabled only when sending redirects for corresponding IP version is disabled via sysctl. Otherwise will be used default forwarding function. PR: 221137 Submitted by: mckay@ MFC after: 2 weeks	2018-08-14 07:54:14 +00:00
Michael Tuexen	839d21d62e	Use the stacb instead of the asoc in state macros. This is not a functional change. Just a preparation for upcoming dtrace state change provider support.	2018-08-13 13:58:45 +00:00
Michael Tuexen	61a2188021	Use consistently the macors to modify the assoc state. No functional change.	2018-08-13 11:56:21 +00:00
Michael Tuexen	812649d86f	Add explicit cast to silence a warning for the userland stack. Thanks to Felix Weinrank for providing the patch.	2018-08-12 14:05:15 +00:00
Devin Teske	ab9ed8a1bd	Fix misspellings of transmitter/transmitted Reviewed by: emaste, bcr Sponsored by: Smule, Inc. Differential Revision: https://reviews.freebsd.org/D16025	2018-08-10 20:37:32 +00:00
Andrey V. Elsukov	16bbf600d9	Remove unneeded ipsec-related includes. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D16637	2018-08-10 07:24:01 +00:00
Leandro Lupori	c8e2123b6a	[ppc] Fix kernel panic when using BOOTP_NFSROOT On PowerPC (and possibly other architectures), that doesn't use EARLY_AP_STARTUP, the config task queue may be used initialized. This was observed while trying to mount the root fs from NFS, as reported here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230168. This patch has 2 main changes: 1- Perform a basic initialization of qgroup_config, similar to what is done in taskqgroup_adjust, but simpler. This makes qgroup_config ready to be used during NFS root mount. 2- When EARLY_AP_STARTUP is not used, call inm_init() and in6m_init() right before SI_SUB_ROOT_CONF, because bootp needs to send multicast packages to request an IP. PR: Bug 230168 Reported by: sbruno Reviewed by: jhibbits, mmacy, sbruno Approved by: jhibbits Differential Revision: D16633	2018-08-09 14:04:51 +00:00
Randall Stewart	d18ea344e6	Fix a small bug in rack where it will end up sending the FIN twice. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16604	2018-08-08 13:36:49 +00:00
Jonathan T. Looney	95a914f631	Address concerns about CPU usage while doing TCP reassembly. Currently, the per-queue limit is a function of the receive buffer size and the MSS. In certain cases (such as connections with large receive buffers), the per-queue segment limit can be quite large. Because we process segments as a linked list, large queues may not perform acceptably. The better long-term solution is to make the queue more efficient. But, in the short-term, we can provide a way for a system administrator to set the maximum queue size. We set the default queue limit to 100. This is an effort to balance performance with a sane resource limit. Depending on their environment, goals, etc., an administrator may choose to modify this limit in either direction. Reviewed by: jhb Approved by: so Security: FreeBSD-SA-18:08.tcp Security: CVE-2018-6922	2018-08-06 17:36:57 +00:00
Randall Stewart	936b2b64ae	This fixes a bug in Rack where we were not properly using the correct value for Delayed Ack. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16579	2018-08-06 09:22:07 +00:00
Gleb Smirnoff	cc7963191d	Now that after r335979 the kernel addresses in API structures are fixed size, there is no reason left for the unions. Discussed with: brooks	2018-08-04 00:03:21 +00:00
Michael Tuexen	7bda966394	Add a dtrace provider for UDP-Lite. The dtrace provider for UDP-Lite is modeled after the UDP provider. This fixes the bug that UDP-Lite packets were triggering the UDP provider. Thanks to dteske@ for providing the dwatch module. Reviewed by: dteske@, markj@, rrs@ Relnotes: yes Differential Revision: https://reviews.freebsd.org/D16377	2018-07-31 22:56:03 +00:00
Michael Tuexen	51e08d53ae	Fix INET only builds. r336940 introduced an "unused variable" warning on platforms which support INET, but not INET6, like MALTA and MALTA64 as reported by Mark Millard. Improve the #ifdefs to address this issue. Sponsored by: Netflix, Inc.	2018-07-31 06:27:05 +00:00
Michael Tuexen	888973f5ae	Allow implicit TCP connection setup for TCP/IPv6. TCP/IPv4 allows an implicit connection setup using sendto(), which is used for TTCP and TCP fast open. This patch adds support for TCP/IPv6. While there, improve some tests for detecting multicast addresses, which are mapped. Reviewed by: bz@, kbowling@, rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16458	2018-07-30 21:27:26 +00:00
Michael Tuexen	e2662978b8	Send consistent SEG.WIN when using timewait codepath for TCP. When sending TCP segments from the timewait code path, a stored value of the last sent window is used. Use the same code for computing this in the timewait code path as in the main code path used in tcp_output() to avoiv inconsistencies. Reviewed by: rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16503	2018-07-30 21:13:42 +00:00
Michael Tuexen	8db239dc6b	Fix some TCP fast open issues. The following issues are fixed: * Whenever a TCP server with TCP fast open enabled, calls accept(), recv(), send(), and close() before the TCP-ACK segment has been received, the TCP connection is just dropped and the reception of the TCP-ACK segment triggers the sending of a TCP-RST segment. * Whenever a TCP server with TCP fast open enabled, calls accept(), recv(), send(), send(), and close() before the TCP-ACK segment has been received, the first byte provided in the second send call is not transferred. * Whenever a TCP client with TCP fast open enabled calls sendto() followed by close() the TCP connection is just dropped. Reviewed by: jtl@, kbowling@, rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16485	2018-07-30 20:35:50 +00:00
Michael Tuexen	6138da62a9	Add missing send/recv dtrace probes for TCP. These missing probe are mostly in the syncache and timewait code. Reviewed by: markj@, rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16369	2018-07-30 20:13:38 +00:00
Alan Somers	6040822c4e	Make timespecadd(3) and friends public The timespecadd(3) family of macros were imported from NetBSD back in r35029. However, they were initially guarded by #ifdef _KERNEL. In the meantime, we have grown at least 28 syscalls that use timespecs in some way, leading many programs both inside and outside of the base system to redefine those macros. It's better just to make the definitions public. Our kernel currently defines two-argument versions of timespecadd and timespecsub. NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define three-argument versions. Solaris also defines a three-argument version, but only in its kernel. This revision changes our definition to match the common three-argument version. Bump _FreeBSD_version due to the breaking KPI change. Discussed with: cem, jilles, ian, bde Differential Revision: https://reviews.freebsd.org/D14725	2018-07-30 15:46:40 +00:00
Randall Stewart	4ad5b7a0ac	This fixes a hole where rack could end up sending an invalid segment into the reassembly queue. This would happen if you enabled the data after close option. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16453	2018-07-30 10:23:29 +00:00
Andrew Turner	1e0582fd55	icmp_quotelen was accidentially changes in r336676, undo this. Sponsored by: DARPA, AFRL	2018-07-24 16:45:01 +00:00
Andrew Turner	5f901c92a8	Use the new VNET_DEFINE_STATIC macro when we are defining static VNET variables. Reviewed by: bz Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D16147	2018-07-24 16:35:52 +00:00
Randall Stewart	399973c33d	Delete the example tcp stack "fastpath" which was only put in has an example. Sponsored by: Netflix inc. Differential Revision: https://reviews.freebsd.org/D16420	2018-07-24 14:55:47 +00:00
Matt Macy	e5e3e746fe	Fix a potential use after free in getsockopt() access to inp_options Discussed with: jhb Reviewed by: sbruno, transport MFC after: 2 weeks Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14621	2018-07-22 20:02:14 +00:00
Matt Macy	2269988749	NULL out cc_data in pluggable TCP {cc}_cb_destroy When ABE was added (rS331214) to NewReno and leak fixed (rS333699) , it now has a destructor (newreno_cb_destroy) for per connection state. Other congestion controls may allocate and free cc_data on entry and exit, but the field is never explicitly NULLed if moving back to NewReno which only internally allocates stateful data (no entry contstructor) resulting in a situation where newreno_cb_destory might be called on a junk pointer. - NULL out cc_data in the framework after calling {cc}_cb_destroy - free(9) checks for NULL so there is no need to perform not NULL checks before calling free. - Improve a comment about NewReno in tcp_ccalgounload This is the result of a debugging session from Jason Wolfe, Jason Eggleston, and mmacy@ and very helpful insight from lstewart@. Submitted by: Kevin Bowling Reviewed by: lstewart Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16282	2018-07-22 05:37:58 +00:00
Michael Tuexen	34fc9072ce	Set the IPv4 version in the IP header for UDP and UDPLite.	2018-07-21 02:14:13 +00:00
Michael Tuexen	e1526d5a5b	Add missing dtrace probes for received UDP packets. Fire UDP receive probes when a packet is received and there is no endpoint consuming it. Fire the probe also if the TTL of the received packet is smaller than the minimum required by the endpoint. Clarify also in the man page, when the probe fires. Reviewed by: dteske@, markj@, rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16046	2018-07-20 15:32:20 +00:00
Michael Tuexen	0053ed28ff	Whitespace changes due to changes in ident.	2018-07-19 20:16:33 +00:00
Michael Tuexen	b0471b4b95	Revert https://svnweb.freebsd.org/changeset/base/336503 since I also ran the export script with different parameters.	2018-07-19 20:11:14 +00:00
Michael Tuexen	7679e49dd4	Whitespace changes due to change if ident.	2018-07-19 19:33:42 +00:00
Randall Stewart	8de9ac5eec	Bump the ICMP echo limits to match the RFC Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16333	2018-07-18 22:49:53 +00:00
Andrey V. Elsukov	acf673edf0	Move invoking of callout_stop(&lle->lle_timer) into llentry_free(). This deduplicates the code a bit, and also implicitly adds missing callout_stop() to in[6]_lltable_delete_entry() functions. PR: 209682, 225927 Submitted by: hselasky (previous version) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D4605	2018-07-17 11:33:23 +00:00
Sean Bruno	c8b1bdc31c	There was quite a bit of feedback on r336282 that has led to the submitter to want to revert it.	2018-07-14 23:53:51 +00:00
Sean Bruno	179a28b098	Fixup memory management for fetching options in ip_ctloutput() Submitted by: Jason Eggleston <jason@eggnet.com> Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14621	2018-07-14 16:19:46 +00:00
Mark Johnston	aaf268f9f6	Remove a duplicate check. PR: 229663 Submitted by: David Binderman <dcb314@hotmail.com> MFC after: 3 days	2018-07-11 14:54:56 +00:00
Brooks Davis	3a20f06a1c	Use uintptr_t alone when assigning to kvaddr_t variables. Suggested by: jhb	2018-07-10 13:03:06 +00:00
Michael Tuexen	c9da58534d	Add support for printing the TCP FO client-side cookie cache via the sysctl interface. This is similar to the TCP host cache. Reviewed by: pkelsey@, kbowling@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D14554	2018-07-10 10:50:43 +00:00
Michael Tuexen	a026a53a76	Use appropriate MSS value when populating the TCP FO client cookie cache When a client receives a SYN-ACK segment with a TFP fast open cookie, but without an MSS option, an MSS value from uninitialised stack memory is used. This patch ensures that in case no MSS option is included in the SYN-ACK, the appropriate value as given in RFC 7413 is used. Reviewed by: kbowling@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16175	2018-07-10 10:42:48 +00:00
Steven Hartland	65c3a353e6	Removed pointless NULL check Removed pointless NULL check after malloc with M_WAITOK which can never return NULL. Sponsored by: Multiplay	2018-07-10 08:05:32 +00:00
Andrey V. Elsukov	f7c4fdee1a	Add "record-state", "set-limit" and "defer-action" rule options to ipfw. "record-state" is similar to "keep-state", but it doesn't produce implicit O_PROBE_STATE opcode in a rule. "set-limit" is like "limit", but it has the same feature as "record-state", it is single opcode without implicit O_PROBE_STATE opcode. "defer-action" is targeted to be used with dynamic states. When rule with this opcode is matched, the rule's action will not be executed, instead dynamic state will be created. And when this state will be matched by "check-state", then rule action will be executed. This allows create a more complicated rulesets. Submitted by: lev MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D1776	2018-07-09 11:35:18 +00:00
Michael Tuexen	5f1347d7c9	Allow alternate TCP stack to populate the TCP FO client cookie cache. Without this patch, TCP FO could be used when using alternate TCP stack, but only existing entires in the TCP client cookie cache could be used. This cache was not populated by connections using alternate TCP stacks. Sponsored by: Netflix, Inc.	2018-07-07 12:28:16 +00:00
Michael Tuexen	c556884f8e	When initializing the TCP FO client cookie cache, take into account whether the TCP FO support is enabled or not for the client side. The code in tcp_fastopen_init() implicitly assumed that the sysctl variable V_tcp_fastopen_client_enable was initialized to 0. This was initially true, but was changed in r335610, which unmasked this bug. Thanks to Pieter de Goeje for reporting the issue on freebsd-net@	2018-07-07 11:18:26 +00:00
Brooks Davis	5c5e39e3d5	One more 32-bit fix for r335979. Reported by: tuexen	2018-07-06 13:34:45 +00:00
Brooks Davis	7524b4c14b	Correct breakage on 32-bit platforms from r335979.	2018-07-06 10:03:33 +00:00
Andrew Turner	2bf9501287	Create a new macro for static DPCPU data. On arm64 (and possible other architectures) we are unable to use static DPCPU data in kernel modules. This is because the compiler will generate PC-relative accesses, however the runtime-linker expects to be able to relocate these. In preparation to fix this create two macros depending on if the data is global or static. Reviewed by: bz, emaste, markj Sponsored by: ABT Systems Ltd Differential Revision: https://reviews.freebsd.org/D16140	2018-07-05 17:13:37 +00:00
Brooks Davis	f38b68ae8a	Make struct xinpcb and friends word-size independent. Replace size_t members with ksize_t (uint64_t) and pointer members (never used as pointers in userspace, but instead as unique idenitifiers) with kvaddr_t (uint64_t). This makes the structs identical between 32-bit and 64-bit ABIs. On 64-bit bit systems, the ABI is maintained. On 32-bit systems, this is an ABI breaking change. The ABI of most of these structs was previously broken in r315662. This also imposes a small API change on userspace consumers who must handle kernel pointers becoming virtual addresses. PR: 228301 (exp-run by antoine) Reviewed by: jtl, kib, rwatson (various versions) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15386	2018-07-05 13:13:48 +00:00
Hiroki Sato	5ba05d3d0e	- Fix a double unlock in inp_block_unblock_source() and lock leakage in inp_leave_group() which caused a panic. - Make order of CTR1() and IN_MULTI_LIST_LOCK() consistent around inm_merge().	2018-07-04 06:47:34 +00:00
Matt Macy	6573d7580b	epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066	2018-07-04 02:47:16 +00:00
Matt Macy	99208b820f	inpcb: don't gratuitously defer frees Don't defer frees in sysctl handlers. It isn't necessary and it just confuses things. revert: r333911, r334104, and r334125 Requested by: jtl	2018-07-02 05:19:44 +00:00
Kristof Provost	0d3d234cd1	carp: Set DSCP value CS7 Update carp to set DSCP value CS7(Network Traffic) in the flowlabel field of packets by default. Currently carp only sets TOS_LOWDELAY in IPv4 which was deprecated in 1998. This also implements sysctl that can revert carp back to it's old behavior if desired. This will allow implementation of QOS on modern network devices to make sure carp packets aren't dropped during interface contention. Submitted by: Nick Wolff <darkfiberiru AT gmail.com> Reviewed by: kp, mav (earlier version) Differential Revision: https://reviews.freebsd.org/D14536	2018-07-01 08:37:07 +00:00
Andrey V. Elsukov	6e081509db	Add NULL pointer check. encap_lookup_t method can be invoked by IP encap subsytem even if none of gif/gre/me interfaces are exist. Hash tables are allocated on demand, when first interface is created. So, make NULL pointer check before doing access to hash table. PR: 229378	2018-06-28 11:39:27 +00:00
Gleb Smirnoff	b8ab659396	Check the inp_flags under inp lock. Looks like the race was hidden before, the conversion of tcbinfo to CK_LIST have uncovered it.	2018-06-27 22:01:59 +00:00
Sean Bruno	af4da58655	Enable TCP_FASTOPEN by default for FreeBSD 12. Submitted by: kbowling Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D15959	2018-06-24 21:46:29 +00:00
Sean Bruno	45fc0718d8	Reap unused variable and assignment that had no effect. Noted by cross compiling with gcc on mips. Reviewed by: mmacy	2018-06-24 21:36:37 +00:00
Gleb Smirnoff	a00f4ac22f	Revert r334843, and partially revert r335180. tcp_outflags[] were defined since 4BSD and are defined nowadays in all its descendants. Removing them breaks third party application.	2018-06-23 06:53:53 +00:00
Randall Stewart	581a046a8b	This adds in an optimization so that we only walk one time through the mbuf chain during copy and TSO limiting. It is used by both Rack and now the FreeBSD stack. Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D15937	2018-06-21 21:03:58 +00:00
Matt Macy	e93fdbe212	raw_ip: validate inp in both loops Continuation of r335497. Also move the lock acquisition up to validate before referencing inp_cred. Reported by: pho	2018-06-21 20:18:23 +00:00
Matt Macy	3d348772e7	in_pcblookup_hash: validate inp before return Post r335356 it is possible to have an inpcb on the hash lists that is partially torn down. Validate before using. Also as a side effect of this change the lock ordering issue between hash lock and inpcb no longer exists allowing some simplification. Reported by: pho@	2018-06-21 18:40:15 +00:00
Matt Macy	e5c331cf78	raw_ip: validate inp Post r335356 it is possible to have an inpcb on the hash lists that is partially torn down. Validate before using. Reported by: pho	2018-06-21 17:24:10 +00:00
Matt Macy	46374cbf54	udp_ctlinput: don't refer to unpcb after we drop the lock Reported by: pho@	2018-06-21 06:10:52 +00:00
Randall Stewart	c6f76759ca	Make sure that the t_peakrate_thr is not compiled in by default until NF can upstream it. Reviewed by: and suggested lstewart Sponsored by: Netflix Inc.	2018-06-19 11:20:28 +00:00
Randall Stewart	f923a734b3	Move the tp set back to where it was before we started playing with the VNET sets. This way we have verified the INP settings before we go to the trouble of de-referencing it. Reviewed by: and suggested by lstewart Sponsored by: Netflix Inc.	2018-06-19 05:28:14 +00:00
Matt Macy	9e58ff6ff9	convert inpcbinfo hash and info rwlocks to epoch + mutex - Convert inpcbinfo info & hash locks to epoch for read and mutex for write - Garbage collect code that handled INP_INFO_TRY_RLOCK failures as INP_INFO_RLOCK which can no longer fail When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to 3%. Overall packet throughput rate limited by CPU affinity and NIC driver design choices. On the receiver unhalted core cycles samples in in_pcblookup_hash went from 13% to to 1.6% Tested by LLNW and pho@ Reviewed by: jtl Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15686	2018-06-19 01:54:00 +00:00
Randall Stewart	f994ead330	Move to using the inp->vnet pointer has suggested by lstewart. This is far better since the hpts system is using the inp as its basis anyway. Unfortunately his comments came late. Sponsored by: Netflix Inc.	2018-06-18 14:10:12 +00:00
Andrey V. Elsukov	20efcfc602	Switch RIB and RADIX_NODE_HEAD lock from rwlock(9) to rmlock(9). Using of rwlock with multiqueue NICs for IP forwarding on high pps produces high lock contention and inefficient. Rmlock fits better for such workloads. Reviewed by: melifaro, olivier Obtained from: Yandex LLC Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D15789	2018-06-16 08:26:23 +00:00
Michael Tuexen	43b223f42e	When retransmitting TCP SYN-ACK segments with the TCP timestamp option enabled use an updated timestamp instead of reusing the one used in the initial TCP SYN-ACK segment. This patch ensures that an updated timestamp is used when sending the SYN-ACK from the syncache code. It was already done if the SYN-ACK was retransmitted from the generic code. This makes the behaviour consistent and also conformant with the TCP specification. Reviewed by: jtl@, Jason Eggleston MFC after: 1 month Sponsored by: Neflix, Inc. Differential Revision: https://reviews.freebsd.org/D15634	2018-06-15 12:28:43 +00:00
Gleb Smirnoff	9293873e83	TCPOUTFLAGS no longer exists since r334843.	2018-06-14 22:25:10 +00:00
Michael Tuexen	33ef123090	Provide the ip6_plen in network byte order when calling ip6_output(). This is not strictly required by ip6_output(), since it overrides it, but it is needed for upcoming dtrace support.	2018-06-14 21:30:52 +00:00
Michael Tuexen	8d86bd564f	Whitespace changes.	2018-06-14 21:22:14 +00:00
Andrey V. Elsukov	eb548a1a5c	In m_megapullup() use m_getjcl() to allocate 9k or 16k mbuf when requested. It is better to try allocate a big mbuf, than just silently drop a big packet. A better solution could be reworking of libalias modules to be able use m_copydata()/m_copyback() instead of requiring the single contiguous buffer. PR: 229006 MFC after: 1 week	2018-06-14 11:15:39 +00:00
Randall Stewart	4aec110f70	This fixes several bugs that Larry Rosenman helped me find in Rack with respect to its handling of TCP Fast Open. Several fixes all related to TFO are included in this commit: 1) Handling of non-TFO retransmissions 2) Building the proper send-map when we are doing TFO 3) Dealing with the ack that comes back that includes the SYN and data. It appears that with this commit TFO now works :-) Thanks Larry for all your help!! Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15758	2018-06-14 03:27:42 +00:00
Matt Macy	feeef8509b	Fix PCBGROUPS build post CK conversion of pcbinfo	2018-06-13 23:19:54 +00:00
Andrey V. Elsukov	a5185adeb6	Rework if_gre(4) to use encap_lookup_t method to speedup lookup of needed interface when many gre interfaces are present. Remove rmlock from gre_softc, use epoch(9) and CK_LIST instead. Move more AF-related code into AF-related locations. Use hash table to speedup lookup of needed softc.	2018-06-13 11:11:33 +00:00
Matt Macy	483305b99c	Handle INP_FREED when looking up an inpcb When hash table lookups are not serialized with in_pcbfree it will be possible for callers to find an inpcb that has been marked free. We need to check for this and return NULL.	2018-06-13 04:23:49 +00:00
Randall Stewart	c9b4ac7587	This fixes missing VNET sets in the hpts system. Basically without this and running vnets with a TCP stack that uses some of the features is a recipe for panic (without this commit). Reported by: Larry Rosenman Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15757	2018-06-12 23:54:08 +00:00
Matt Macy	700e893c34	Defer inpcbport free in in_pcbremlists as well	2018-06-12 23:26:25 +00:00
Matt Macy	f09ee4fc01	Defer inpcbport free until after a grace period has elapsed This is a dependency for inpcbinfo rlock conversion to epoch	2018-06-12 22:18:27 +00:00
Matt Macy	b872626dbe	mechanical CK macro conversion of inpcbinfo lists This is a dependency for converting the inpcbinfo hash and info rlocks to epoch.	2018-06-12 22:18:20 +00:00
Matt Macy	addf2b2009	Defer inpcb deletion until after a grace period has elapsed Deferring the actual free of the inpcb until after a grace period has elapsed will allow us to convert the inpcbinfo info and hash read locks to epoch. Reviewed by: gallatin, jtl Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15510	2018-06-12 22:18:15 +00:00
Jonathan T. Looney	cff21e484b	Change RACK dependency on TCPHPTS from a build-time dependency to a load- time dependency. At present, RACK requires the TCPHPTS option to run. However, because modules can be moved from machine to machine, this dependency is really best assessed at load time rather than at build time. Reviewed by: rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15756	2018-06-11 14:27:19 +00:00
Matt Macy	3db28e6656	avoid 'tcp_outflags defined but not used'	2018-06-08 17:37:49 +00:00
Matt Macy	afbd6cfa72	hpts: remove redundant decl breaking gcc build	2018-06-08 17:37:43 +00:00
Randall Stewart	89e560f441	This commit brings in a new refactored TCP stack called Rack. Rack includes the following features: - A different SACK processing scheme (the old sack structures are not used). - RACK (Recent acknowledgment) where counting dup-acks is no longer done instead time is used to knwo when to retransmit. (see the I-D) - TLP (Tail Loss Probe) where we will probe for tail-losses to attempt to try not to take a retransmit time-out. (see the I-D) - Burst mitigation using TCPHTPS - PRR (partial rate reduction) see the RFC. Once built into your kernel, you can select this stack by either socket option with the name of the stack is "rack" or by setting the global sysctl so the default is rack. Note that any connection that does not support SACK will be kicked back to the "default" base FreeBSD stack (currently known as "default"). To build this into your kernel you will need to enable in your kernel: makeoptions WITH_EXTRA_TCP_STACKS=1 options TCPHPTS Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15525	2018-06-07 18:18:13 +00:00
Michael Tuexen	ff34bbe9c2	Improve compliance with RFC 4895 and RFC 6458. Silently dicard SCTP chunks which have been requested to be authenticated but are received unauthenticated no matter if support for SCTP authentication has been negotiated. This improves compliance with RFC 4895. When the application uses the SCTP_AUTH_CHUNK socket option to request a chunk to be received in an authenticated way, enable the SCTP authentication extension for the end-point. This improves compliance with RFC 6458. Discussed with: Peter Lei MFC after: 3 days	2018-06-06 19:27:06 +00:00
Sean Bruno	1a43cff92a	Load balance sockets with new SO_REUSEPORT_LB option. This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures: Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations: As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). This is a substantially different contribution as compared to its original incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@ for the substantive feedback that is included in this commit. Submitted by: Johannes Lundberg <johalun0@gmail.com> Obtained from: DragonflyBSD Relnotes: Yes Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003	2018-06-06 15:45:57 +00:00
Andrey V. Elsukov	590d0a43b6	Make in_delayed_cksum() be similar to IPv6 implementation. Use m_copyback() function to write checksum when it isn't located in the first mbuf of the chain. Handmade analog doesn't handle the case when parts of checksum are located in different mbufs. Also in case when mbuf is too short, m_copyback() will allocate new mbuf in the chain instead of making out of bounds write. Also wrap long line and remove now useless KASSERTs. X-MFC after: r334705	2018-06-06 13:01:53 +00:00
Tom Jones	1fdbfb909f	Use UDP len when calculating UDP checksums The length of the IP payload is normally equal to the UDP length, UDP Options (draft-ietf-tsvwg-udp-options-02) suggests using the difference between IP length and UDP length to create space for trailing data. Correct checksum length calculation to use the UDP length rather than the IP length when not offloading UDP checksums. Approved by: jtl (mentor) Differential Revision: https://reviews.freebsd.org/D15222	2018-06-06 07:04:40 +00:00
Andrey V. Elsukov	b941bc1d6e	Rework if_gif(4) to use new encap_lookup_t method to speedup lookup of needed interface when many gif interfaces are present. Remove rmlock from gif_softc, use epoch(9) and CK_LIST instead. Move more AF-related code into AF-related locations. Use hash table to speedup lookup of needed softc. Interfaces with GIF_IGNORE_SOURCE flag are stored in plain CK_LIST. Sysctl net.link.gif.parallel_tunnels is removed. The removal was planed 16 years ago, and actually it could work only for outbound direction. Each protocol, that can be handled by if_gif(4) interface is registered by separate encap handler, this helps avoid invoking the handler for unrelated protocols (GRE, PIM, etc.). This change allows dramatically improve performance when many gif(4) interfaces are used. Sponsored by: Yandex LLC	2018-06-05 21:24:59 +00:00
Andrey V. Elsukov	6d8fdfa9d5	Rework IP encapsulation handling code. Currently it has several disadvantages: - it uses single mutex to protect internal structures. It is used by data- and control- path, thus there are no parallelism at all. - it uses single list to keep encap handlers for both INET and INET6 families. - struct encaptab keeps unneeded information (src, dst, masks, protosw), that isn't used by code in the source tree. - matches are prioritized and when many tunneling interfaces are registered, encapcheck handler of each interface is invoked for each packet. The search takes O(n) for n interfaces. All this work is done with exclusive lock held. What this patch includes: - the datapath is converted to be lockless using epoch(9) KPI. - struct encaptab now linked using CK_LIST. - all unused fields removed from struct encaptab. Several new fields addedr: min_length is the minimum packet length, that encapsulation handler expects to see; exact_match is maximum number of bits, that can return an encapsulation handler, when it wants to consume a packet. - IPv6 and IPv4 handlers are stored in separate lists; - added new "encap_lookup_t" method, that will be used later. It is targeted to speedup lookup of needed interface, when gif(4)/gre(4) have many interfaces. - the need to use protosw structure is eliminated. The only pr_input method was used from this structure, so I don't see the need to keep using it. - encap_input_t method changed to avoid using mbuf tags to store softc pointer. Now it is passed directly trough encap_input_t method. encap_getarg() funtions is removed. - all sockaddr structures and code that uses them removed. We don't have any code in the tree that uses them. All consumers use encap_attach_func() method, that relies on invoking of encapcheck() to determine the needed handler. - introduced struct encap_config, it contains parameters of encap handler that is going to be registered by encap_attach() function. - encap handlers are stored in lists ordered by exact_match value, thus handlers that need more bits to match will be checked first, and if encapcheck method returns exact_match value, the search will be stopped. - all current consumers changed to use new KPI. Reviewed by: mmacy Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D15617	2018-06-05 20:51:01 +00:00
Mateusz Guzik	34c538c356	malloc: try to use builtins for zeroing at the callsite Plenty of allocation sites pass M_ZERO and sizes which are small and known at compilation time. Handling them internally in malloc loses this information and results in avoidable calls to memset. Instead, let the compiler take the advantage of it whenever possible. Discussed with: jeff	2018-06-02 22:20:09 +00:00
Michael Tuexen	13500cbb61	Don't overflow a buffer if we receive an INIT or INIT-ACK chunk without a RANDOM parameter but with a CHUNKS or HMAC-ALGO parameter. Please note that sending this combination violates the specification. Thnanks to Ronald E. Crane for reporting the issue for the userland stack. MFC after: 3 days	2018-06-02 16:28:10 +00:00
Michael Tuexen	c14f9fe5ef	Limit the retransmission timer for SYN-ACKs by TCPTV_REXMTMAX. Use the same logic to handle the SYN-ACK retransmission when sent from the syn cache code as when sent from the main code. MFC after: 3 days Sponsored by: Netflix, Inc.	2018-06-01 21:24:27 +00:00
Michael Tuexen	badef00d58	Ensure net.inet.tcp.syncache.rexmtlimit is limited by TCP_MAXRXTSHIFT. If the sysctl variable is set to a value larger than TCP_MAXRXTSHIFT+1, the array tcp_syn_backoff[] is accessed out of bounds. Discussed with: jtl@ MFC after: 3 days Sponsored by: Netflix, Inc.	2018-06-01 19:58:19 +00:00
Andrey V. Elsukov	12e7376216	Remove empty encap_init() function. MFC after: 2 weeks	2018-05-29 12:32:08 +00:00
Michael Tuexen	eef8d4a973	Use correct mask. Introduced in https://svnweb.freebsd.org/changeset/base/333603. Thanks to Irene Ruengler for testing and reporting the issue. MFC after: 1 week X-MFC-with: 333603	2018-05-28 13:31:47 +00:00
Matt Macy	62d733a11d	in_pcbladdr: remove debug code that snuck in with ifa epoch conversion r334118	2018-05-27 06:47:09 +00:00
Matt Macy	0f8d79d977	CK: update consumers to use CK macros across the board r334189 changed the fields to have names distinct from those in queue.h in order to expose the oversights as compile time errors	2018-05-24 23:21:23 +00:00
Matt Macy	fe524329e4	convert allocations to INVARIANTS M_ZERO	2018-05-24 01:04:56 +00:00
Matt Macy	4f6c66cc9c	UDP: further performance improvements on tx Cumulative throughput while running 64 netperf -H $DUT -t UDP_STREAM -- -m 1 on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps Single stream throughput increases from 910kpps to 1.18Mpps Baseline: https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg - Protect read access to global ifnet list with epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg - Protect short lived ifaddr references with epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg - Convert if_afdata read lock path to epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg A fix for the inpcbhash contention is pending sufficient time on a canary at LLNW. Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15409	2018-05-23 21:02:14 +00:00
Matt Macy	630ba2c514	udp: assign flowid to udp sockets round-robin On a 2x8x2 SKL this increases measured throughput with 64 netperf -H $DUT -t UDP_STREAM -- -m 1 from 590kpps to 1.1Mpps before: https://people.freebsd.org/~mmacy/2018.05.11/udpsender.svg after: https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg	2018-05-23 20:50:09 +00:00
Matt Macy	246a619924	epoch: allow for conditionally asserting that the epoch context fields are unused by zeroing on INVARIANTS builds	2018-05-23 17:00:05 +00:00
Mark Johnston	9f78e2b83d	Initialize the dumper struct before calling set_dumper(). Fields owned by the generic code were being left uninitialized, causing problems in clear_dumper() if an error occurred. Coverity CID: 1391200 X-MFC with: r333283	2018-05-22 16:01:56 +00:00
Matt Macy	f42a83f2a6	inpcb: revert deferred inpcb free pending further review	2018-05-21 16:13:43 +00:00
Michael Tuexen	c692df45fc	Only fillin data srucuture when actually stored.	2018-05-21 14:53:22 +00:00
Michael Tuexen	d3132db2b5	Do the appropriate accounting when ip_output() fails.	2018-05-21 14:52:18 +00:00
Michael Tuexen	95844fce7d	Make clear why there is an assignment, which is not necessary.	2018-05-21 14:51:20 +00:00
Ed Maste	3b9b6b1704	Pair CURVNET_SET and CURVNET_RESTORE in a block Per vnet(9), CURVNET_SET and CURVNET_RESTORE cannot be used as a single statement for a conditional and CURVNET_RESTORE must be in the same block as CURVNET_SET (or a subblock). Reviewed by: andrew Sponsored by: The FreeBSD Foundation	2018-05-21 13:08:44 +00:00
Ed Maste	15f8acc53f	Revert r333968, it broke all archs but i386 and amd64	2018-05-21 11:56:07 +00:00
Matt Macy	ed6bb714b2	in(6)_mcast: Expand out vnet set / restore macro so that they work in a conditional block Reported by: zec at fer.hr	2018-05-21 08:34:10 +00:00
Matt Macy	06b15160e1	ensure that vnet is set when doing in_leavegroup	2018-05-21 07:12:06 +00:00
Matt Macy	1a3d880c26	in(s)_moptions: free before tearing down inpcb	2018-05-20 20:08:21 +00:00
Matt Macy	47d2a58560	inpcb: defer destruction of inpcb until after a grace period has elapsed in_pcbfree will remove the incpb from the list and release the rtentry while the vnet is set, but the actual destruction will be deferred until any threads in a (not yet used) epoch section, no longer potentially have references.	2018-05-20 04:38:04 +00:00
Matt Macy	056b40e29c	inpcb: consolidate possible deletion in pcblist functions in to epoch deferred context.	2018-05-20 02:27:58 +00:00
Matt Macy	ddece765f3	in_pcb: add helper for deferring inpcb rele calls from list functions	2018-05-20 02:17:30 +00:00
Matt Macy	cb6bb2303e	ip(6)_freemoptions: defer imo destruction to epoch callback task Avoid the ugly unlock / lock of the inpcbinfo where we need to figure out what kind of lock we hold by simply deferring the operation to another context. (Also a small dependency for converting the pcbinfo read lock to epoch)	2018-05-20 00:22:28 +00:00
Matt Macy	f6960e207e	netinet silence warnings	2018-05-19 05:56:21 +00:00
Matt Macy	7a3c5b05e4	tcp sysctl fix may be uninitialized	2018-05-19 05:55:31 +00:00
Matt Macy	21f6fe2d7c	tcp fastopen: fix may be uninitialized	2018-05-19 05:55:00 +00:00
Matt Macy	d7c5a620e2	ifnet: Replace if_addr_lock rwlock with epoch + mutex Run on LLNW canaries and tested by pho@ gallatin: Using a 14-core, 28-HTT single socket E5-2697 v3 with a 40GbE MLX5 based ConnectX 4-LX NIC, I see an almost 12% improvement in received packet rate, and a larger improvement in bytes delivered all the way to userspace. When the host receiving 64 streams of netperf -H $DUT -t UDP_STREAM -- -m 1, I see, using nstat -I mce0 1 before the patch: InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 4.98 0.00 4.42 0.00 4235592 33 83.80 4720653 2149771 1235 247.32 4.73 0.00 4.20 0.00 4025260 33 82.99 4724900 2139833 1204 247.32 4.72 0.00 4.20 0.00 4035252 33 82.14 4719162 2132023 1264 247.32 4.71 0.00 4.21 0.00 4073206 33 83.68 4744973 2123317 1347 247.32 4.72 0.00 4.21 0.00 4061118 33 80.82 4713615 2188091 1490 247.32 4.72 0.00 4.21 0.00 4051675 33 85.29 4727399 2109011 1205 247.32 4.73 0.00 4.21 0.00 4039056 33 84.65 4724735 2102603 1053 247.32 After the patch InMpps OMpps InGbs OGbs err TCP Est %CPU syscalls csw irq GBfree 5.43 0.00 4.20 0.00 3313143 33 84.96 5434214 1900162 2656 245.51 5.43 0.00 4.20 0.00 3308527 33 85.24 5439695 1809382 2521 245.51 5.42 0.00 4.19 0.00 3316778 33 87.54 5416028 1805835 2256 245.51 5.42 0.00 4.19 0.00 3317673 33 90.44 5426044 1763056 2332 245.51 5.42 0.00 4.19 0.00 3314839 33 88.11 5435732 1792218 2499 245.52 5.44 0.00 4.19 0.00 3293228 33 91.84 5426301 1668597 2121 245.52 Similarly, netperf reports 230Mb/s before the patch, and 270Mb/s after the patch Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15366	2018-05-18 20:13:34 +00:00
Mark Johnston	b35822d9d0	Fix netdump configuration when VIMAGE is enabled. We need to set the current vnet before iterating over the global interface list. Because the dump device may only be set from the host, only proceed with configuration if the thread belongs to the default vnet. [1] Also fix a resource leak that occurs if the priv_check() in set_dumper() fails. Reported by: mmacy, sbruno [1] Reviewed by: sbruno X-MFC with: r333283 Differential Revision: https://reviews.freebsd.org/D15449	2018-05-17 04:08:57 +00:00
Lawrence Stewart	9891578a40	Plug a memory leak and potential NULL-pointer dereference introduced in r331214. Each TCP connection that uses the system default cc_newreno(4) congestion control algorithm module leaks a "struct newreno" (8 bytes of memory) at connection initialisation time. The NULL-pointer dereference is only germane when using the ABE feature, which is disabled by default. While at it: - Defer the allocation of memory until it is actually needed given that ABE is optional and disabled by default. - Document the ENOMEM errno in getsockopt(2)/setsockopt(2). - Document ENOMEM and ENOBUFS in tcp(4) as being synonymous given that they are used interchangeably throughout the code. - Fix a few other nits also accidentally omitted from the original patch. Reported by: Harsh Jain on freebsd-net@ Tested by: tjh@ Differential Revision: https://reviews.freebsd.org/D15358	2018-05-17 02:46:27 +00:00
Brooks Davis	514ef08caf	Unwrap a line that no longer requires wrapping.	2018-05-15 20:14:38 +00:00
Brooks Davis	423349f696	Remove stray tabs from in_lltable_dump_entry().	2018-05-15 20:13:00 +00:00
Stephen Hurd	f2cf90e264	Check that ifma_protospec != NULL in inm_lookup If ifma_protospec is NULL when inm_lookup() is called, there is a dereference in a NULL struct pointer. This ensures that struct is not NULL before comparing the address. Reported by: dumbbell Reviewed by: sbruno Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15440	2018-05-15 16:54:41 +00:00
Michael Tuexen	482d420926	sctp_get_mbuf_for_msg() should honor the allinone parameter. When it is not required that the buffer is not a chain, return a chain. This is based on a patch provided by Irene Ruengeler.	2018-05-14 15:16:51 +00:00
Michael Tuexen	589c42c2c8	Ensure that the MTU's used are multiple of 4. The length of SCTP packets is always a multiple of 4. Therefore, ensure that the MTUs used are also a multiple of 4. Thanks to Irene Ruengeler for providing an earlier version of this patch. MFC after: 1 week	2018-05-14 13:50:17 +00:00
Stephen Hurd	b69888c28f	Fix LORs in in6?_leave_group() r333175 updated the join_group functions, but not the leave_group ones. Reviewed by: sbruno Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15393	2018-05-11 21:42:27 +00:00
Warner Losh	33123867af	Use the full year, for real this time.	2018-05-09 20:26:37 +00:00
Warner Losh	603bbd0631	Minor style nits Use full copyright year. Remove 'All Rights Reserved' from new file (rights holder OK'd) Minor #ifdef motion and #endif tagging Remove __FBSDID macro from comments Sponsored by: Netflix OK'd by: rrs@	2018-05-09 14:11:35 +00:00
Michael Tuexen	45d41de5e6	Fix two typos reported by N. J. Mann, which were introduced in https://svnweb.freebsd.org/changeset/base/333382 by me. MFC after: 3 days	2018-05-08 20:39:35 +00:00
Michael Tuexen	9669e724d1	When reporting ERROR or ABORT chunks, don't use more data that is guaranteed to be contigous. Thanks to Felix Weinrank for finding and reporting this bug by fuzzing the usrsctp stack. MFC after: 3 days	2018-05-08 18:48:51 +00:00
Matt Macy	10d20c84ed	Fix spurious retransmit recovery on low latency networks TCP's smoothed RTT (SRTT) can be much larger than an actual observed RTT. This can be either because of hz restricting the calculable RTT to 10ms in VMs or 1ms using the default 1000hz or simply because SRTT recently incorporated a larger value. If an ACK arrives before the calculated badrxtwin (now + SRTT): tp->t_badrxtwin = ticks + (tp->t_srtt >> (TCP_RTT_SHIFT + 1)); We'll erroneously reset snd_una to snd_max. If multiple segments were dropped and this happens repeatedly the transmit rate will be limited to 1MSS per RTO until we've retransmitted all drops. Reported by: rstone Reviewed by: hiren, transport Approved by: sbruno MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D8556	2018-05-08 02:22:34 +00:00
Alexander Motin	167a34407c	Keep CARP state as INIT when net.inet.carp.allow=0. Currently when net.inet.carp.allow=0 CARP state remains as MASTER, which is not very useful (if there are other masters -- it can lead to split brain, if there are none -- it makes no sense). Having it as INIT makes it clear that carp packets are disabled. Submitted by: wg MFC after: 1 month Relnotes: yes Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D14477	2018-05-07 14:44:55 +00:00
Matt Macy	b6f6f88018	r333175 introduced deferred deletion of multicast addresses in order to permit the driver ioctl to sleep on commands to the NIC when updating multicast filters. More generally this permitted driver's to use an sx as a softc lock. Unfortunately this change introduced a race whereby a a multicast update would still be queued for deletion when ifconfig deleted the interface thus calling down in to _purgemaddrs and synchronously deleting _all_ of the multicast addresses on the interface. Synchronously remove all external references to a multicast address before enqueueing for delete. Reported by: lwhsu Approved by: sbruno	2018-05-06 20:34:13 +00:00
Michael Tuexen	67e8b08bbe	Ensure we are not dereferencing a NULL pointer. This was found by Coverity scanning the usrsctp stack (CID 203808). MFC after: 3 days	2018-05-06 14:19:50 +00:00
Mark Johnston	e505460228	Import the netdump client code. This is a component of a system which lets the kernel dump core to a remote host after a panic, rather than to a local storage device. The server component is available in the ports tree. netdump is particularly useful on diskless systems. The netdump(4) man page contains some details describing the protocol. Support for configuring netdump will be added to dumpon(8) in a future commit. To use netdump, the kernel must have been compiled with the NETDUMP option. The initial revision of netdump was written by Darrell Anderson and was integrated into Sandvine's OS, from which this version was derived. Reviewed by: bdrewery, cem (earlier versions), julian, sbruno MFC after: 1 month X-MFC note: use a spare field in struct ifnet Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15253	2018-05-06 00:38:29 +00:00
Matt Macy	28c001002a	Currently in_pcbfree will unconditionally wunlock the pcbinfo lock to avoid a LOR on the multicast list lock in the freemoptions routines. As it turns out, tcp_usr_detach can acquire the tcbinfo lock readonly. Trying to wunlock the pcbinfo lock in that context has caused a number of reported crashes. This change unclutters in_pcbfree and moves the handling of wunlock vs runlock of pcbinfo to the freemoptions routine. Reported by: mjg@, bde@, o.hartmann at walstatt.org Approved by: sbruno	2018-05-05 22:40:40 +00:00
Andrey V. Elsukov	5ada542398	Immediately propagate EACCES error code to application from tcp_output. In r309610 and r315514 the behavior of handling EACCES was changed, and tcp_output() now returns zero when EACCES happens. The reason of this change was a hesitation that applications that use TCP-MD5 will be affected by changes in project/ipsec. TCP-MD5 code returns EACCES when security assocition for given connection is not configured. But the same error code can return pfil(9), and this change has affected connections blocked by pfil(9). E.g. application doesn't return immediately when SYN segment is blocked, instead it waits when several tries will be failed. Actually, for TCP-MD5 application it doesn't matter will it get EACCES after first SYN, or after several tries. Security associtions must be configured before initiating TCP connection. I left the EACCES in the switch() to show that it has special handling. Reported by: Andreas Longwitz <longwitz at incore dot de> MFC after: 10 days	2018-05-04 09:28:12 +00:00
Sean Bruno	68ff29affa	cc_cubic: - Update cubic parameters to draft-ietf-tcpm-cubic-04 Submitted by: Matt Macy <mmacy@mattmacy.io> Reviewed by: lstewart Differential Revision: https://reviews.freebsd.org/D10556	2018-05-03 15:01:27 +00:00
Michael Tuexen	4c6a10903f	SImplify the call to tcp_drop(), since the handling of soft error is also done in tcp_drop(). No functional change. Sponsored by: Netflix, Inc.	2018-05-02 20:04:31 +00:00
Stephen Hurd	f3e1324b41	Separate list manipulation locking from state change in multicast Multicast incorrectly calls in to drivers with a mutex held causing drivers to have to go through all manner of contortions to use a non sleepable lock. Serialize multicast updates instead. Submitted by: mmacy <mmacy@mattmacy.io> Reviewed by: shurd, sbruno Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14969	2018-05-02 19:36:29 +00:00
Randall Stewart	803a230592	This change re-arranges the fields within the tcp-pcb so that they are more in order of cache line use as one passes through the tcp_input/output paths (non-errors most likely path). This helps speed up cache line optimization so that the tcp stack runs a bit more efficently. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15136	2018-04-26 21:41:16 +00:00
Sean Bruno	7875017ca9	Revert r332894 at the request of the submitter. Submitted by: Johannes Lundberg <johalun0_gmail.com> Sponsored by: Limelight Networks	2018-04-24 19:55:12 +00:00
Sean Bruno	7b7796eea5	Load balance sockets with new SO_REUSEPORT_LB option This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). Submitted by: Johannes Lundberg <johanlun0@gmail.com> Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003	2018-04-23 19:51:00 +00:00
Randall Stewart	fd389e7cd5	These two modules need the tcp_hpts.h file for when the option is enabled (not sure how LINT/build-universe missed this) opps. Sponsored by: Netflix Inc	2018-04-19 15:03:48 +00:00
Randall Stewart	3ee9c3c4eb	This commit brings in the TCP high precision timer system (tcp_hpts). It is the forerunner/foundational work of bringing in both Rack and BBR which use hpts for pacing out packets. The feature is optional and requires the TCPHPTS option to be enabled before the feature will be active. TCP modules that use it must assure that the base component is compile in the kernel in which they are loaded. MFC after: Never Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15020	2018-04-19 13:37:59 +00:00
Brooks Davis	3a4fc8a8a1	Remove support for the Arcnet protocol. While Arcnet has some continued deployment in industrial controls, the lack of drivers for any of the PCI, USB, or PCIe NICs on the market suggests such users aren't running FreeBSD. Evidence in the PR database suggests that the cm(4) driver (our sole Arcnet NIC) was broken in 5.0 and has not worked since. PR: 182297 Reviewed by: jhibbits, vangyzen Relnotes: yes Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15057	2018-04-13 21:18:04 +00:00
Brooks Davis	0437c8e3b1	Remove support for FDDI networks. Defines in net/if_media.h remain in case code copied from ifconfig is in use elsewere (supporting non-existant media type is harmless). Reviewed by: kib, jhb Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15017	2018-04-11 17:28:24 +00:00
Jonathan T. Looney	f897951965	Add missing header change from r332381. Sponsored by: Netflix, Inc. Pointy hat: jtl	2018-04-10 17:00:37 +00:00
Jonathan T. Looney	8b8718b5b0	Modify the net.inet.tcp.function_ids sysctl introduced in r331347. Export additional information which may be helpful to userspace consumers and rename the sysctl to net.inet.tcp.function_info. Sponsored by: Netflix, Inc.	2018-04-10 16:59:36 +00:00
Jonathan T. Looney	dcaffbd6fb	Move the TCP Blackbox Recorder probe in tcp_output.c to be with the other tracing/debugging code. Sponsored by: Netflix, Inc.	2018-04-10 15:54:29 +00:00
Jonathan T. Looney	4889b58ce8	Clean up some debugging code left in tcp_log_buf.c from r331347. Sponsored by: Netflix, Inc.	2018-04-10 15:51:37 +00:00
Michael Tuexen	efcf28ef77	Fix a logical inversion bug. Thanks to Irene Ruengeler for finding and reporting this bug. MFC after: 3 days	2018-04-08 12:08:20 +00:00
Michael Tuexen	3bcfd10fe1	Small cleanup, no functional change. MFC after: 3 days	2018-04-08 11:50:06 +00:00
Michael Tuexen	c3115feb7e	Fix a signed/unsigned warning showing up for the userland stack on some platforms. Thanks to Felix Weinrank for reporting the issue. MFC after:i 3 days	2018-04-08 11:37:00 +00:00
Brooks Davis	6469bdcdb6	Move most of the contents of opt_compat.h to opt_global.h. opt_compat.h is mentioned in nearly 180 files. In-progress network driver compabibility improvements may add over 100 more so this is closer to "just about everywhere" than "only some files" per the guidance in sys/conf/options. Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h is created on all architectures. Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the set of compiled files. Reviewed by: kib, cem, jhb, jtl Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14941	2018-04-06 17:35:35 +00:00
Jonathan T. Looney	8fa799bd74	If a user closes the socket before we call tcp_usr_abort(), then tcp_drop() may unlock the INP. Currently, tcp_usr_abort() does not check for this case, which results in a panic while trying to unlock the already-unlocked INP (not to mention, a use-after-free violation). Make tcp_usr_abort() check the return value of tcp_drop(). In the case where tcp_drop() returns NULL, tcp_usr_abort() can skip further steps to abort the connection and simply unlock the INP_INFO lock prior to returning. Reviewed by: glebius MFC after: 2 weeks Sponsored by: Netflix, Inc.	2018-04-06 17:20:37 +00:00
Jonathan T. Looney	45a48d07c3	Check that in_pcbfree() is only called once for each PCB. If that assumption is violated, "bad things" could follow. I believe such an assert would have detected some of the problems jch@ was chasing in PR 203175 (see r307551). We also use it in our internal TCP development efforts. And, in case a bug does slip through to released code, this change silently ignores subsequent calls to in_pcbfree(). Reviewed by: rrs Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D14990	2018-04-06 16:48:11 +00:00
Ed Maste	c73b6f4da9	Fix kernel memory disclosure in tcp_ctloutput strcpy was used to copy a string into a buffer copied to userland, which left uninitialized data after the terminating 0-byte. Use the same approach as in tcp_subr.c: strncpy and explicit '\0'. admbugs: 765, 822 MFC after: 1 day Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com> Reported by: Vlad Tsyrklevich Security: Kernel memory disclosure Sponsored by: The FreeBSD Foundation	2018-04-04 21:12:35 +00:00
Jonathan T. Looney	4b9dc36454	r330675 introduced an extra window check in the LRO code to ensure it captured and reported the highest window advertisement with the same SEQ/ACK. However, the window comparison uses modulo 216 math, rather than directly comparing the absolute values. Because windows use absolute values and not modulo 216 math (i.e. they don't wrap), we need to compare the absolute values. Reviewed by: gallatin MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D14937	2018-04-03 13:54:38 +00:00
Navdeep Parhar	a64564109a	Add a hook to allow the toedev handling an offloaded connection to provide accurate TCP_INFO. Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D14816	2018-04-03 01:08:54 +00:00
Brooks Davis	541d96aaaf	Use an accessor function to access ifr_data. This fixes 32-bit compat (no ioctl command defintions are required as struct ifreq is the same size). This is believed to be sufficent to fully support ifconfig on 32-bit systems. Reviewed by: kib Obtained from: CheriBSD MFC after: 1 week Relnotes: yes Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14900	2018-03-30 18:50:13 +00:00
Navdeep Parhar	b9a9e8e9bd	Fix RSS build (broken in r331309). Sponsored by: Chelsio Communications	2018-03-29 19:48:17 +00:00
Brooks Davis	69f0fecbd6	Remove infrastructure for token-ring networks. Reviewed by: cem, imp, jhb, jmallett Relnotes: yes Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D14875	2018-03-28 23:33:26 +00:00
Sean Bruno	e2041bfa7c	CC Cubic: fix underflow for cubic_cwnd() Singed calculations in cubic_cwnd() can result in negative cwnd value which is then cast to an unsigned value. Values less than 1 mss are generally bad for other parts of the code, also fixed. Submitted by: Jason Eggleston <jason@eggnet.com> Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14141	2018-03-26 19:53:36 +00:00
Jonathan T. Looney	e24e568336	Make the TCP blackbox code committed in r331347 be an optional feature controlled by the TCP_BLACKBOX option. Enable this as part of amd64 GENERIC. For now, leave it disabled on other platforms. Sponsored by: Netflix, Inc.	2018-03-24 12:48:10 +00:00
Jonathan T. Looney	9959c8b9e6	Fix compilation for platforms that don't support atomic_fetchadd_64() after r331347. Reported by: avg, br, jhibbits Sponsored by: Netflix, Inc.	2018-03-24 12:40:45 +00:00
Sean Bruno	72bfa0bf63	Revert r331379 as the "simple" lock changes have revealed a deeper problem and need for a rethink. Submitted by: Jason Eggleston <jason@eggnet.com> Sponsored by: Limelight Networks	2018-03-23 18:34:38 +00:00
Kristof Provost	effaab8861	netpfil: Introduce PFIL_FWD flag Forwarded packets passed through PFIL_OUT, which made it difficult for firewalls to figure out if they were forwarding or producing packets. This in turn is an issue for pf for IPv6 fragment handling: it needs to call ip6_output() or ip6_forward() to handle the fragments. Figuring out which was difficult (and until now, incorrect). Having pfil distinguish the two removes an ugly piece of code from pf. Introduce a new variant of the netpfil callbacks with a flags variable, which has PFIL_FWD set for forwarded packets. This allows pf to reliably work out if a packet is forwarded. Reviewed by: ae, kevans Differential Revision: https://reviews.freebsd.org/D13715	2018-03-23 16:56:44 +00:00
Sean Bruno	2a499acf59	Simple locking fixes in ip_ctloutput, ip6_ctloutput, rip_ctloutput. Submitted by: Jason Eggleston <jason@eggnet.com> Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14624	2018-03-22 22:29:32 +00:00
Jonathan T. Looney	2529f56ed3	Add the "TCP Blackbox Recorder" which we discussed at the developer summits at BSDCan and BSDCam in 2017. The TCP Blackbox Recorder allows you to capture events on a TCP connection in a ring buffer. It stores metadata with the event. It optionally stores the TCP header associated with an event (if the event is associated with a packet) and also optionally stores information on the sockets. It supports setting a log ID on a TCP connection and using this to correlate multiple connections that share a common log ID. You can log connections in different modes. If you are doing a coordinated test with a particular connection, you may tell the system to put it in mode 4 (continuous dump). Or, if you just want to monitor for errors, you can put it in mode 1 (ring buffer) and dump all the ring buffers associated with the connection ID when we receive an error signal for that connection ID. You can set a default mode that will be applied to a particular ratio of incoming connections. You can also manually set a mode using a socket option. This commit includes only basic probes. rrs@ has added quite an abundance of probes in his TCP development work. He plans to commit those soon. There are user-space programs which we plan to commit as ports. These read the data from the log device and output pcapng files, and then let you analyze the data (and metadata) in the pcapng files. Reviewed by: gnn (previous version) Obtained from: Netflix, Inc. Relnotes: yes Differential Revision: https://reviews.freebsd.org/D11085	2018-03-22 09:40:08 +00:00
Gleb Smirnoff	3108a71a34	Fix LINT-NOINET build initializing local to false. This is a dead code, since for NOINET build isipv6 is always true, but this dead code makes it compilable. Reported by: rpokala	2018-03-22 05:07:57 +00:00
Gleb Smirnoff	dd388cfd9b	The net.inet.tcp.nolocaltimewait=1 optimization prevents local TCP connections from entering the TIME_WAIT state. However, it omits sending the ACK for the FIN, which results in RST. This becomes a bigger deal if the sysctl net.inet.tcp.blackhole is 2. In this case RST isn't send, so the other side of the connection (also local) keeps retransmitting FINs. To fix that in tcp_twstart() we will not call tcp_close() immediately. Instead we will allocate a tcptw on stack and proceed to the end of the function all the way to tcp_twrespond(), to generate the correct ACK, then we will drop the last PCB reference. While here, make a few tiny improvements: - use bools for boolean variable - staticize nolocaltimewait - remove pointless acquisiton of socket lock Reported by: jtl Reviewed by: jtl Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D14697	2018-03-21 20:59:30 +00:00
Jonathan T. Looney	7fb2986ff6	If the INP lock is uncontested, avoid taking a reference and jumping through the lock-switching hoops. A few of the INP lookup operations that lock INPs after the lookup do so using this mechanism (to maintain lock ordering): 1. Lock lookup structure. 2. Find INP. 3. Acquire reference on INP. 4. Drop lock on lookup structure. 5. Acquire INP lock. 6. Drop reference on INP. This change provides a slightly shorter path for cases where the INP lock is uncontested: 1. Lock lookup structure. 2. Find INP. 3. Try to acquire the INP lock. 4. If successful, drop lock on lookup structure. Of course, if the INP lock is contested, the functions will need to revert to the previous way of switching locks safely. This saves a few atomic operations when the INP lock is uncontested. Discussed with: gallatin, rrs, rwatson MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D12911	2018-03-21 15:54:46 +00:00
Lawrence Stewart	370efe5ac8	Add support for the experimental Internet-Draft "TCP Alternative Backoff with ECN (ABE)" proposal to the New Reno congestion control algorithm module. ABE reduces the amount of congestion window reduction in response to ECN-signalled congestion relative to the loss-inferred congestion response. More details about ABE can be found in the Internet-Draft: https://tools.ietf.org/html/draft-ietf-tcpm-alternativebackoff-ecn The implementation introduces four new sysctls: - net.inet.tcp.cc.abe defaults to 0 (disabled) and can be set to non-zero to enable ABE for ECN-enabled TCP connections. - net.inet.tcp.cc.newreno.beta and net.inet.tcp.cc.newreno.beta_ecn set the multiplicative window decrease factor, specified as a percentage, applied to the congestion window in response to a loss-based or ECN-based congestion signal respectively. They default to the values specified in the draft i.e. beta=50 and beta_ecn=80. - net.inet.tcp.cc.abe_frlossreduce defaults to 0 (disabled) and can be set to non-zero to enable the use of standard beta (50% by default) when repairing loss during an ECN-signalled congestion recovery episode. It enables a more conservative congestion response and is provided for the purposes of experimentation as a result of some discussion at IETF 100 in Singapore. The values of beta and beta_ecn can also be set per-connection by way of the TCP_CCALGOOPT TCP-level socket option and the new CC_NEWRENO_BETA or CC_NEWRENO_BETA_ECN CC algo sub-options. Submitted by: Tom Jones <tj@enoti.me> Tested by: Tom Jones <tj@enoti.me>, Grenville Armitage <garmitage@swin.edu.au> Relnotes: Yes Differential Revision: https://reviews.freebsd.org/D11616	2018-03-19 16:37:47 +00:00
Alexander V. Chernikov	1435dcd94f	Fix outgoing TCP/UDP packet drop on arp/ndp entry expiration. Current arp/nd code relies on the feedback from the datapath indicating that the entry is still used. This mechanism is incorporated into the arpresolve()/nd6_resolve() routines. After the inpcb route cache introduction, the packet path for the locally-originated packets changed, passing cached lle pointer to the ether_output() directly. This resulted in the arp/ndp entry expire each time exactly after the configured max_age interval. During the small window between the ARP/NDP request and reply from the router, most of the packets got lost. Fix this behaviour by plugging datapath notification code to the packet path used by route cache. Unify the notification code by using single inlined function with the per-AF callbacks. Reported by: sthaug at nethelp.no Reviewed by: ae MFC after: 2 weeks	2018-03-17 17:05:48 +00:00
Michael Tuexen	1574b1e41e	Set the inp_vflag consistently for accepted TCP/IPv6 connections when net.inet6.ip6.v6only=0. Without this patch, the inp_vflag would have INP_IPV4 and the INP_IPV6 flags for accepted TCP/IPv6 connections if the sysctl variable net.inet6.ip6.v6only is 0. This resulted in netstat to report the source and destination addresses as IPv4 addresses, even they are IPv6 addresses. PR: 226421 Reviewed by: bz, hiren, kib MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D13514	2018-03-16 15:26:07 +00:00
Sean Bruno	d7fb35d13a	Update tcp_lro with tested bugfixes from Netflix and LLNW: rrs - Lets make the LRO code look for true dup-acks and window update acks fly on through and combine. rrs - Make the LRO engine a bit more aware of ack-only seq space. Lets not have it incorrectly wipe out newer acks for older acks when we have out-of-order acks (common in wifi environments). jeggleston - LRO eating window updates Based on all of the above I think we are RFC compliant doing it this way: https://tools.ietf.org/html/rfc1122 section 4.2.2.16 "Note that TCP has a heuristic to select the latest window update despite possible datagram reordering; as a result, it may ignore a window update with a smaller window than previously offered if neither the sequence number nor the acknowledgment number is increased." Submitted by: Kevin Bowling <kevin.bowling@kev009.com> Reviewed by: rstone gallatin Sponsored by: NetFlix and Limelight Networks Differential Revision: https://reviews.freebsd.org/D14540	2018-03-09 00:08:43 +00:00
Michael Tuexen	1c714531e8	When checking the TCP fast cookie length, conststently also check for the minimum length. This fixes a bug where cookies of length 2 bytes (which is smaller than the minimum length of 4) is provided by the server. Sponsored by: Netflix, Inc.	2018-02-27 22:12:38 +00:00
Patrick Kelsey	1f13c23f3d	Ensure signed comparison to avoid false trip of assert during VNET teardown. Reported by: lwhsu MFC after: 1 month	2018-02-26 20:31:16 +00:00
Patrick Kelsey	18a7530938	Greatly reduce the number of #ifdefs supporting the TCP_RFC7413 kernel option. The conditional compilation support is now centralized in tcp_fastopen.h and tcp_var.h. This doesn't provide the minimum theoretical code/data footprint when TCP_RFC7413 is disabled, but nearly all the TFO code should wind up being removed by the optimizer, the additional footprint in the syncache entries is a single pointer, and the additional overhead in the tcpcb is at the end of the structure. This enables the TCP_RFC7413 kernel option by default in amd64 and arm64 GENERIC. Reviewed by: hiren MFC after: 1 month Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14048	2018-02-26 03:03:41 +00:00
Patrick Kelsey	c560df6f12	This is an implementation of the client side of TCP Fast Open (TFO) [RFC7413]. It also includes a pre-shared key mode of operation in which the server requires the client to be in possession of a shared secret in order to successfully open TFO connections with that server. The names of some existing fastopen sysctls have changed (e.g., net.inet.tcp.fastopen.enabled -> net.inet.tcp.fastopen.server_enable). Reviewed by: tuexen MFC after: 1 month Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14047	2018-02-26 02:53:22 +00:00
Patrick Kelsey	798caa2ee5	Fix harmless locking bug in tfp_fastopen_check_cookie(). The keylist lock was not being acquired early enough. The only side effect of this bug is that the effective add time of a new key could be slightly later than it would have been otherwise, as seen by a TFO client. Reviewed by: tuexen MFC after: 1 month Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14046	2018-02-26 02:43:26 +00:00
Andrey V. Elsukov	b2fe54bc8b	Reinitialize IP header length after checksum calculation. It is used later by TCP-MD5 code. This fixes the problem with broken TCP-MD5 over IPv4 when NIC has disabled TCP checksum offloading. PR: 223835 MFC after: 1 week	2018-02-10 10:13:17 +00:00
Andrey V. Elsukov	b99a682320	Rework ipfw dynamic states implementation to be lockless on fast path. o added struct ipfw_dyn_info that keeps all needed for ipfw_chk and for dynamic states implementation information; o added DYN_LOOKUP_NEEDED() macro that can be used to determine the need of new lookup of dynamic states; o ipfw_dyn_rule now becomes obsolete. Currently it used to pass information from kernel to userland only. o IPv4 and IPv6 states now described by different structures dyn_ipv4_state and dyn_ipv6_state; o IPv6 scope zones support is added; o ipfw(4) now depends from Concurrency Kit; o states are linked with "entry" field using CK_SLIST. This allows lockless lookup and protected by mutex modifications. o the "expired" SLIST field is used for states expiring. o struct dyn_data is used to keep generic information for both IPv4 and IPv6; o struct dyn_parent is used to keep O_LIMIT_PARENT information; o IPv4 and IPv6 states are stored in different hash tables; o O_LIMIT_PARENT states now are kept separately from O_LIMIT and O_KEEP_STATE states; o per-cpu dyn_hp pointers are used to implement hazard pointers and they prevent freeing states that are locklessly used by lookup threads; o mutexes to protect modification of lists in hash tables now kept in separate arrays. 65535 limit to maximum number of hash buckets now removed. o Separate lookup and install functions added for IPv4 and IPv6 states and for parent states. o By default now is used Jenkinks hash function. Obtained from: Yandex LLC MFC after: 42 days Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D12685	2018-02-07 18:59:54 +00:00
John Baldwin	f17985319d	Export tcp_always_keepalive for use by the Chelsio TOM module. This used to work by accident with ld.bfd even though always_keepalive was marked as static. LLD honors static more correctly, so export this variable properly (including moving it into the tcp_* namespace). Reviewed by: bz, emaste MFC after: 2 weeks Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D14129	2018-01-30 23:01:37 +00:00
Michael Tuexen	cf6339882e	Add constant for the PAD chunk as defined in RFC 4820. This will be used by traceroute and traceroute6 soon. MFC after: 1 week	2018-01-27 13:46:55 +00:00
Michael Tuexen	07e75d0a37	Update references in comments, since the IDs have become an RFC long time ago. Also cleanup whitespaces. No functional change. MFC after: 1 week	2018-01-27 13:43:03 +00:00
Conrad Meyer	222daa421f	style: Remove remaining deprecated MALLOC/FREE macros Mechanically replace uses of MALLOC/FREE with appropriate invocations of malloc(9) / free(9) (a series of sed expressions). Something like: * MALLOC(a, b, ... -> a = malloc(... * FREE( -> free( * free((caddr_t) -> free( No functional change. For now, punt on modifying contrib ipfilter code, leaving a definition of the macro in its KMALLOC(). Reported by: jhb Reviewed by: cy, imp, markj, rmacklem Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D14035	2018-01-25 22:25:13 +00:00
Mark Johnston	5557762b72	Use tcpinfoh_t for TCP headers in the tcp:::debug-{drop,input} probes. The header passed to these probes has some fields converted to host order by tcp_fields_to_host(), so the tcpinfo_t translator doesn't do what we want. Submitted by: Hannes Mehnert MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D12647	2018-01-25 15:35:34 +00:00
Navdeep Parhar	09b0b8c058	Do not generate illegal mbuf chains during IP fragment reassembly. Only the first mbuf of the reassembled datagram should have a pkthdr. This was discovered with cxgbe(4) + IPSEC + ping with payload more than interface MTU. cxgbe can generate !M_WRITEABLE mbufs and this results in m_unshare being called on the reassembled datagram, and it complains: panic: m_unshare: m0 0xfffff80020f82600, m 0xfffff8005d054100 has M_PKTHDR PR: 224922 Reviewed by: ae@ MFC after: 1 week Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D14009	2018-01-24 05:09:21 +00:00
Ryan Stone	fc21c53f63	Reduce code duplication for inpcb route caching Add a new macro to clear both the L3 and L2 route caches, to hopefully prevent future instances where only the L3 cache was cleared when both should have been. MFC after: 1 week Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13989 Reviewed by: karels	2018-01-23 03:15:39 +00:00
Michael Tuexen	e9a3a1b13c	Fix a bug related to fast retransmissions. When processing a SACK advancing the cumtsn-ack in fast recovery, increment the miss-indications for all TSN's reported as missing. Thanks to Fabian Ising for finding the bug and to Timo Voelker for provinding a fix. This fix moves also CMT related initialisation of some variables to a more appropriate place. MFC after: 1 week	2018-01-16 21:58:38 +00:00
Michael Tuexen	46bf534caf	Don't provide a (meaningless) cmsg when proving a notification in a recvmsg() call. MFC after: 1 week	2018-01-15 21:59:20 +00:00
Pedro F. Giffuni	84f210952c	libalias: small memory allocation cleanups. Make the calloc wrappers behave as expected by using mallocarray. It is rather weird that the malloc wrappers also zeroes the memory: update a comment to reflect at least two cases where it is expected. Reviewed by: tuexen	2018-01-12 23:12:30 +00:00
Cy Schubert	f813e520c2	Correct the comment describing badrs which is bad router solicitiation, not bad router advertisement. MFC after: 3 days	2017-12-29 07:23:18 +00:00
Michael Tuexen	fa5867cbd6	White cleanups.	2017-12-26 16:33:55 +00:00
Michael Tuexen	f34a628e7e	Clearify CID 1008197. MFC after: 3 days	2017-12-26 16:12:04 +00:00
Michael Tuexen	0460135495	Clearify issue reported in CID 1008198. MFC after: 3 days	2017-12-26 16:06:11 +00:00
Michael Tuexen	f6ea123171	Fix CID 1008428. MFC after: 1 week	2017-12-26 15:29:11 +00:00
Michael Tuexen	4830aee72f	Fix CID 1008936.	2017-12-26 15:24:42 +00:00
Michael Tuexen	c9256941d0	Allow the first (and second) argument of sn_calloc() be a sum. This fixes a bug reported in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224103 PR: 224103	2017-12-26 14:37:47 +00:00
Michael Tuexen	cd90150413	When adding support for sending SCTP packets containing an ABORT chunk to ipfw in https://svnweb.freebsd.org/changeset/base/326233, a dependency on the SCTP stack was added to ipfw by accident. This was noted by Kevel Bowling in https://reviews.freebsd.org/D13594 where also a solution was suggested. This patch is based on Kevin's suggestion, but implements the required SCTP checksum computation without any dependency on other SCTP sources. While there, do some cleanups and improve comments. Thanks to Kevin Kevin Browling for reporting the issue and suggesting a fix.	2017-12-26 12:35:02 +00:00
Alexander Kabaev	151ba7933a	Do pass removing some write-only variables from the kernel. This reduces noise when kernel is compiled by newer GCC versions, such as one used by external toolchain ports. Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial) Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c) Differential Revision: https://reviews.freebsd.org/D10385	2017-12-25 04:48:39 +00:00
Andrey V. Elsukov	2aad62408b	Fix mbuf leak when TCPMD5_OUTPUT() method returns error. PR: 223817 MFC after: 1 week	2017-12-14 12:54:20 +00:00
Michael Tuexen	cd6340caf7	Cleaup, no functional change.	2017-12-13 17:11:57 +00:00
Gleb Smirnoff	66492fea49	Separate out send buffer autoscaling code into function, so that alternative TCP stacks may reuse it instead of pasting. Obtained from: Netflix	2017-12-07 22:36:58 +00:00
Michael Tuexen	9f0abda051	Retire SCTP_WITH_NO_CSUM option. This option was used in the early days to allow performance measurements extrapolating the use of SCTP checksum offloading. Since this feature is now available, get rid of this option. This also un-breaks the LINT kernel. Thanks to markj@ for making me aware of the problem.	2017-12-07 22:19:08 +00:00
Pedro F. Giffuni	fe267a5590	sys: general adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 2-Clause license, however the tool I was using misidentified many licenses so this was mostly a manual - error prone - task. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. No functional change intended.	2017-11-27 15:23:17 +00:00
Michael Tuexen	665c8a2ee5	Add to ipfw support for sending an SCTP packet containing an ABORT chunk. This is similar to the TCP case. where a TCP RST segment can be sent. There is one limitation: When sending an ABORT in response to an incoming packet, it should be tested if there is no ABORT chunk in the received packet. Currently, it is only checked if the first chunk is an ABORT chunk to avoid parsing the whole packet, which could result in a DOS attack. Thanks to Timo Voelker for helping me to test this patch. Reviewed by: bcr@ (man page part), ae@ (generic, non-SCTP part) Differential Revision: https://reviews.freebsd.org/D13239	2017-11-26 18:19:01 +00:00
Michael Tuexen	18442f0a5b	Fix SPDX line as suggested by pfg	2017-11-24 19:38:59 +00:00
Michael Tuexen	ad15e1548f	Unbreak compilation when using SCTP_DETAILED_STR_STATS option. MFC after: 1 week	2017-11-24 12:18:48 +00:00
Michael Tuexen	b7d2b5d5b1	Add SPDX line.	2017-11-24 11:25:53 +00:00
Mark Johnston	7a5c730561	Use the right variable for the IP header parameter to tcp:::send. This addresses a regression from r311225. MFC after: 1 week	2017-11-22 14:13:40 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Michael Tuexen	3e87bccde3	Fix the handling of ERROR chunks which a lot of error causes. While there, clean up the code. Thanks to Felix Weinrank who found the bug by using fuzz-testing the SCTP userland stack. MFC after: 1 week	2017-11-15 22:13:10 +00:00
Michael Tuexen	d0f6ab7920	Simply the code and use the full buffer for contigous chunk representation. MFC after: 1 week	2017-11-14 02:30:21 +00:00
Gleb Smirnoff	3e21cbc802	Style r320614: don't initialize at declaration, new line after declarations, shorten variable name to avoid extra long lines. No functional changes.	2017-11-13 22:16:47 +00:00
Michael Tuexen	469a65d1f8	Cleanup the handling of control chunks. While there fix some minor bug related to clearing the assoc retransmit counter and the dup TSN handling of NR-SACK chunks. MFC after: 3 days	2017-11-12 21:43:33 +00:00
Konstantin Belousov	06193f0be0	Use hardware timestamps to report packet timestamps for SO_TIMESTAMP and other similar socket options. Provide new control message SCM_TIME_INFO to supply information about timestamp. Currently it indicates that the timestamp was hardware-assisted and high-precision, for software timestamps the message is not returned. Reserved fields are added to ABI to report additional info about it, it is expected that raw hardware clock value might be useful for some applications. Reviewed by: gallatin (previous version), hselasky Sponsored by: Mellanox Technologies MFC after: 2 weeks X-Differential revision: https://reviews.freebsd.org/D12638	2017-11-07 09:46:26 +00:00
Michael Tuexen	253a63b817	Fix an accounting bug where data was counted twice if on the read queue and on the ordered or unordered queue. While there, improve the checking in INVARIANTs when computing the a_rwnd. MFC after: 3 days	2017-11-05 11:59:33 +00:00
Michael Tuexen	28a6adde1d	Allow the setting of the MTU for future paths using an SCTP socket option. This functionality was missing. MFC after: 1 week	2017-11-03 20:46:12 +00:00
Michael Tuexen	ba5fc4cf78	Fix the reporting of the MTU for SCTP sockets when using IPv6. MFC after: 1 week	2017-11-01 16:32:11 +00:00
Michael Tuexen	966dfbf910	Fix parsing error when processing cmsg in SCTP send calls. Thei bug is related to a signed/unsigned mismatch. This should most likely fix the issue in sctp_sosend reported by Dmitry Vyukov on the freebsd-hackers mailing list and found by running syzkaller.	2017-10-27 19:27:05 +00:00
Michael Tuexen	8d9b040dd4	Fix a bug reported by Felix Weinrank using the libfuzzer on the userland stack. MFC after: 3 days	2017-10-25 09:12:22 +00:00
Michael Tuexen	701492a5f6	Fix a bug in handling special ABORT chunks. Thanks to Felix Weinrank for finding this issue using libfuzzer with the userland stack. MFC after: 3 days	2017-10-24 16:24:12 +00:00
Michael Tuexen	adc59f7f46	Fix a locking issue found by running AFL on the userland stack. Thanks to Felix Weinrank for reporting the issue. MFC after: 3 days	2017-10-24 14:28:56 +00:00
Alexander Motin	81098a018e	Relax per-ifnet cif_vrs list double locking in carp(4). In all cases where cif_vrs list is modified, two locks are held: per-ifnet CIF_LOCK and global carp_sx. It means to read that list only one of them is enough to be held, so we can skip CIF_LOCK when we already have carp_sx. This fixes kernel panic, caused by attempts of copyout() to sleep while holding non-sleepable CIF_LOCK mutex. Discussed with: glebius MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2017-10-19 09:01:15 +00:00
Michael Tuexen	af03054c8a	Fix a signed/unsigned warning. MFC after: 1 week	2017-10-18 21:08:35 +00:00
Michael Tuexen	7f75695a3e	Abort an SCTP association, when a DATA chunk is followed by an unknown chunk with a length smaller than the minimum length. Thanks to Felix Weinrank for making me aware of the problem. MFC after: 3 days	2017-10-18 20:17:44 +00:00
Michael Tuexen	0d5af38ceb	Revert change which got in accidently.	2017-10-18 18:59:35 +00:00
Michael Tuexen	3ed8d364a7	Fix a bug introduced in r324638. Thanks to Felix Weinrank for making me aware of this. MFC after: 3 days	2017-10-18 18:56:56 +00:00
Michael Tuexen	80a2d1406f	Fix the handling of parital and too short chunks. Ensure that the current behaviour is consistent: stop processing of the chunk, but finish the processing of the previous chunks. This behaviour might be changed in a later commit to ABORT the assoication due to a protocol violation, but changing this is a separate issue. MFC after: 3 days	2017-10-15 19:33:30 +00:00
Michael Tuexen	8c8e10b763	Code cleanup, not functional change. This avoids taking a pointer of a packed structure which allows simpler compilation of the userland stack. MFC after: 1 week	2017-10-14 10:02:59 +00:00
Gleb Smirnoff	3bdf4c4274	Declare more TCP globals in tcp_var.h, so that alternative TCP stacks can use them. Gather all TCP tunables in tcp_var.h in one place and alphabetically sort them, to ease maintainance of the list. Don't copy and paste declarations in tcp_stacks/fastpath.c.	2017-10-11 20:36:09 +00:00
Gleb Smirnoff	e29c55e4bb	Declare pmtud_blackhole global variables in tcp_timer.h, so that alternative TCP stacks can legally use them.	2017-10-06 20:33:40 +00:00
Michael Tuexen	ff76c8c9fd	Ensure that the accept ABORT chunks with the T-bit set only the a non-zero matching peer tag is provided. MFC after: 1 week	2017-10-05 13:29:54 +00:00
Julien Charbon	5fcd2d9bfc	Forgotten bits in r324179: Include sys/syslog.h if INVARIANTS is not defined MFC after: 1 week X-MFC with: r324179 Pointy hat to: jch	2017-10-02 09:45:17 +00:00
Patrick Kelsey	3f43239f21	The soisconnected() call removed from syncache_socket() in r307966 was not extraneous in the TCP Fast Open (TFO) passive-open case. In the TFO passive-open case, syncache_socket() is being called during processing of a TFO SYN bearing a valid cookie, and a call to soisconnected() is required in order to allow the application to immediately consume any data delivered in the SYN and to have a chance to generate response data to accompany the SYN-ACK. The removal of this call to soisconnected() effectively converted all TFO passive opens to having the same RTT cost as a standard 3WHS. This commit adds a call to soisconnected() to syncache_tfo_expand() so that it is only in the TFO passive-open path, thereby restoring TFO passve-open RTT performance and preserving the non-TFO connection-rate performance gains realized by r307966. MFC after: 1 week Sponsored by: Limelight Networks	2017-10-01 23:37:17 +00:00
Julien Charbon	dfa1f80ce9	Fix an infinite loop in tcp_tw_2msl_scan() when an INP_TIMEWAIT inp has been destroyed before its tcptw with INVARIANTS undefined. This is a symmetric change of r307551: A INP_TIMEWAIT inp should not be destroyed before its tcptw, and INVARIANTS will catch this case. If INVARIANTS is undefined it will emit a log(LOG_ERR) and avoid a hard to debug infinite loop in tcp_tw_2msl_scan(). Reported by: Ben Rubson, hselasky Submitted by: hselasky Tested by: Ben Rubson, jch MFC after: 1 week Sponsored by: Verisign, inc Differential Revision: https://reviews.freebsd.org/D12267	2017-10-01 21:20:28 +00:00
Andrey V. Elsukov	f415d666c3	Some mbuf related fixes in icmp_error() * check mbuf length before doing mtod() and accessing to IP header; * update oip pointer and all depending pointers after m_pullup(); * remove extra checks and extra parentheses, wrap long lines; PR: 222670 Reported by: Prabhakar Lakhera MFC after: 1 week	2017-09-29 06:24:45 +00:00
Michael Tuexen	09c53cb6cc	Remove unused function. MFC after: 1 week	2017-09-27 13:05:23 +00:00
Sepherosa Ziehau	fc572e261f	tcp: Don't "negotiate" MSS. _NO_ OSes actually "negotiate" MSS. RFC 879: "... This Maximum Segment Size (MSS) announcement (often mistakenly called a negotiation) ..." This negotiation behaviour was introduced 11 years ago by r159955 without any explaination about why FreeBSD had to "negotiate" MSS: In syncache_respond() do not reply with a MSS that is larger than what the peer announced to us but make it at least tcp_minmss in size. Sponsored by: TCP/IP Optimization Fundraise 2005 The tcp_minmss behaviour is still kept. Syncookie fix was prodded by tuexen, who also helped to test this patch w/ packetdrill. Reviewed by: tuexen, karels, bz (previous version) MFC after: 2 week Sponsored by: Microsoft Differential Revision: https://reviews.freebsd.org/D12430	2017-09-27 05:52:37 +00:00
Michael Tuexen	d28a3a393b	Add missing locking. Found by Coverity while scanning the usrsctp library. MFC after: 1 week	2017-09-22 06:33:01 +00:00
Michael Tuexen	afb908dada	Add missing socket lock. MFC after: 1 week	2017-09-22 06:07:47 +00:00
Michael Tuexen	cdd2d7d4a5	Code cleanup, no functional change. MFC after: 1 week	2017-09-21 11:56:31 +00:00
Michael Tuexen	53999485e0	Free the control structure after using is, not before. Found by Coverity while scanning the usrsctp library. MFC after: 1 week	2017-09-21 09:47:56 +00:00
Michael Tuexen	d0d8c7de19	No need to wakeup, since sctp_add_to_readq() does it. MFC after: 1 week	2017-09-21 09:18:05 +00:00
Michael Tuexen	2c62ba7377	Protect the address workqueue timer by a mutex. MFC after: 1 week	2017-09-20 21:29:54 +00:00
Michael Tuexen	3ec509bcd3	Fix a warning. MFC after: 1 week	2017-09-19 20:24:13 +00:00
Michael Tuexen	564a95f485	Avoid an overflow when computing the staleness. This issue was found by running libfuzz on the userland stack. MFC after: 1 week	2017-09-19 20:09:58 +00:00
Michael Tuexen	ad608f06ed	Remove a no longer used variable. Reported by: Felix Weinrank MFC after: 1 week	2017-09-19 15:00:19 +00:00
Michael Tuexen	72e23aba22	Fix an accounting bug and use sctp_timer_start to start a timer. MFC after: 1 week	2017-09-17 09:27:27 +00:00
Michael Tuexen	fe40f49bb3	Remove code not used on any platform currently supported. MFC after: 1 week	2017-09-16 21:26:06 +00:00
Michael Tuexen	292efb1bc0	Export the UDP encapsualation port and the path state.	2017-09-12 21:08:50 +00:00
Michael Tuexen	e5cccc35c3	Add support to print the TCP stack being used. Sponsored by: Netflix, Inc.	2017-09-12 13:34:43 +00:00
Michael Tuexen	7b5f06fbcc	Fix MTU computation. Coverity scanning usrsctp pointed to this code... MFC after: 3 days	2017-09-09 21:03:40 +00:00
Michael Tuexen	f55c326691	Fix locking issues found by Coverity scanning the usrsctp library. MFC after: 3 days	2017-09-09 20:44:56 +00:00
Michael Tuexen	0c4622dab2	Silence a Coverity warning from scanning the usrsctp library. MFC after: 3 days	2017-09-09 20:08:26 +00:00
Michael Tuexen	6c2cfc0419	Savely remove a chunk from the control queue. This bug was found by Coverity scanning the usrsctp library. MFC after: 3 days	2017-09-09 19:49:50 +00:00
Hans Petter Selasky	95ed5015ec	Add support for generic backpressure indicator for ratelimited transmit queues aswell as non-ratelimited ones. Add the required structure bits in order to support a backpressure indication with ratelimited connections aswell as non-ratelimited ones. The backpressure indicator is a value between zero and 65535 inclusivly, indicating if the destination transmit queue is empty or full respectivly. Applications can use this value as a decision point for when to stop transmitting data to avoid endless ENOBUFS error codes upon transmitting an mbuf. This indicator is also useful to reduce the latency for ratelimited queues. Reviewed by: gallatin, kib, gnn Differential Revision: https://reviews.freebsd.org/D11518 Sponsored by: Mellanox Technologies	2017-09-06 13:56:18 +00:00
Michael Tuexen	3d5af7a127	Fix blackhole detection. There were two bugs related to the blackhole detection: * The smalles size was tried more than two times. * The restored MSS was not the original one, but the second candidate. MFC after: 1 week Sponsored by: Netflix, Inc.	2017-08-28 11:41:18 +00:00
Sean Bruno	32a04bb81d	Use counter(9) for PLPMTUD counters. Remove unused PLPMTUD sysctl counters. Bump UPDATING and FreeBSD Version to indicate a rebuild is required. Submitted by: kevin.bowling@kev009.com Reviewed by: jtl Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D12003	2017-08-25 19:41:38 +00:00
Michael Tuexen	e8aba2eb21	Avoid TCP log messages which are false positives. This is https://svnweb.freebsd.org/changeset/base/322812, just for alternate TCP stacks. XMFC with: 322812	2017-08-23 15:08:51 +00:00
Michael Tuexen	4da9052a05	Avoid TCP log messages which are false positives. The check for timestamps are too early to handle SYN-ACK correctly. So move it down after the corresponing processing has been done. PR: 216832 Obtained from: antonfb@hesiod.org MFC after: 1 week	2017-08-23 14:50:08 +00:00
Michael Tuexen	63ec505a4f	Ensure inp_vflag is consistently set for TCP endpoints. Make sure that the flags INP_IPV4 and INP_IPV6 are consistently set for inpcbs used for TCP sockets, no matter if the setting is derived from the net.inet6.ip6.v6only sysctl or the IPV6_V6ONLY socket option. For UDP this was already done right. PR: 221385 MFC after: 1 week	2017-08-18 07:27:15 +00:00
Oleg Bulyzhin	ff21796d25	Fix comment typo.	2017-08-09 10:46:34 +00:00
Dag-Erling Smørgrav	d80fbbee5f	Correct sysctl names.	2017-08-09 07:24:58 +00:00
Bjoern A. Zeeb	ae69ad884d	After inpcb route caching was put back in place there is no need for flowtable anymore (as flowtable was never considered to be useful in the forwarding path). Reviewed by: np Differential Revision: https://reviews.freebsd.org/D11448	2017-07-27 13:03:36 +00:00
Ed Maste	1edbb54fe9	cc_cubic: restore braces around if-condition block r307901 was reverted in r321480, restoring an incorrect block delimitation bug present in the original cc_cubic commit. Restore only the bugfix (brace addition) from r307901. CID: 1090182 Approved by: sbruno	2017-07-26 21:23:09 +00:00
Sean Bruno	43053c125a	Revert r307901 - Inform CC modules about loss events. This was discussed between various transport@ members and it was requested to be reverted and discussed. Submitted by: Kevin Bowling <kevin.bowling@kev009.com> Reported by: lawrence Reviewed by: hiren Sponsored by: Limelight Networks	2017-07-25 15:08:52 +00:00
Sean Bruno	5d53981a18	Revert r308180 - Set slow start threshold more accurrately on loss ... This was discussed between various transport@ members and it was requested to be reverted and discussed. Submitted by: kevin Reported by: lawerence Reviewed by: hiren	2017-07-25 15:03:05 +00:00
Michael Tuexen	e5a9c519bc	Remove duplicate statement.	2017-07-25 11:05:53 +00:00
Michael Tuexen	9dd6ca9602	Deal with listening socket correctly.	2017-07-20 14:50:13 +00:00
Michael Tuexen	bbc9dfbc08	Fix the explicit EOR mode. If the final messages is not complete, send an ABORT. Joint work with rrs@ MFC after: 1 week	2017-07-20 11:09:33 +00:00
Michael Tuexen	1f76872c36	Avoid shadowed variables. MFC after: 1 week	2017-07-19 15:12:23 +00:00
Michael Tuexen	5ba7f91f9d	Use memset/memcpy instead of bzero/bcopy. Just use one variant instead of both. Use the memset/memcpy ones since they cause less problems in crossplatform deployment. MFC after: 1 week	2017-07-19 14:28:58 +00:00
Michael Tuexen	28cd0699b6	Fix the accounting and add code to detect errors in accounting. Joint work with rrs@ MFC after: 1 week	2017-07-19 12:27:40 +00:00
Michael Tuexen	d32ed2c735	Fix the handling of Explicit EOR mode. While there, appropriately handle the overhead depending on the usage of DATA or I-DATA chunks. Take the overhead only into account, when required. Joint work with rrs@ MFC after: 1 week	2017-07-15 19:54:03 +00:00
Konstantin Belousov	5cead59181	Correct sysent flags for dynamically loaded syscalls. Using the https://github.com/google/capsicum-test/ suite, the PosixMqueue.CapModeForked test was failing due to an ECAPMODE after calling kmq_notify(). On further inspection, the dynamically loaded syscall entry was initialized with sy_flags zeroed out, since SYSCALL_INIT_HELPER() left sysent.sy_flags with the default value. Add a new helper SYSCALL{,32}_INIT_HELPER_F() which takes an additional argument to specify the sy_flags value. Submitted by: Siva Mahadevan <smahadevan@freebsdfoundation.org> Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D11576	2017-07-14 09:34:44 +00:00
Jonathan T. Looney	cb503ae22d	Don't overpromote values when calculating len in tcp_output(). sbavail() returns u_int and sendwin is a uint32_t. Therefore, min() (which operates on two u_int values) is able to correctly calculate the minimum of these two arguments. Reported by: rrs MFC after: 1 week Sponsored by: Netflix	2017-07-05 16:10:30 +00:00
Michael Tuexen	1698cbd919	Move to open state after plausibility checks. When doing this too early, the MIB counters go wrong. MFC after: 1 week	2017-07-04 18:24:50 +00:00
Michael Tuexen	afffa1a9ad	Don't hold if refcount on an stcb when it is not needed. This improves the consistency with other parts of the code.	2017-07-04 18:04:44 +00:00
Sean Bruno	ac952dd274	Add a sysctl to toggle the use of the sockets LOWAT when calculating auto window growth Submitted by: j@nitrology.com (Jason Wolfe) Reviewed by: gnn hiren Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11016	2017-07-03 19:39:58 +00:00
Michael Tuexen	f4358911bf	Handle sctp_get_next_param() in a consistent way. This addresses an issue found by Felix Weinrank using libfuzz. While there, use also consistent nameing. MFC after: 3 days	2017-06-23 21:01:57 +00:00
Michael Tuexen	d44b45df2c	Check the length of a COOKIE chunk before accessing fields in it. Thanks to Felix Weinrank for reporting the issue he found by using libFuzzer. MFC after: 3 days	2017-06-23 10:09:49 +00:00
Michael Tuexen	1a7abbb3be	Use a longer buffer for messages in ERROR chunks. This allows them to be sent in a non truncated way and addresses a warning given by newver versions of gcc. Thanks to Anselm Jonas Scholl for reporting it and providing a patch.	2017-06-23 09:27:31 +00:00
Michael Tuexen	94f66d603a	Honor the backlog field.	2017-06-23 08:35:54 +00:00
Michael Tuexen	3017b21bb6	Improve compilation on platforms different from FreeBSD.	2017-06-23 08:34:01 +00:00
Gleb Smirnoff	779f106aa1	Listening sockets improvements. o Separate fields of struct socket that belong to listening from fields that belong to normal dataflow, and unionize them. This shrinks the structure a bit. - Take out selinfo's from the socket buffers into the socket. The first reason is to support braindamaged scenario when a socket is added to kevent(2) and then listen(2) is cast on it. The second reason is that there is future plan to make socket buffers pluggable, so that for a dataflow socket a socket buffer can be changed, and in this case we also want to keep same selinfos through the lifetime of a socket. - Remove struct struct so_accf. Since now listening stuff no longer affects struct socket size, just move its fields into listening part of the union. - Provide sol_upcall field and enforce that so_upcall_set() may be called only on a dataflow socket, which has buffers, and for listening sockets provide solisten_upcall_set(). o Remove ACCEPT_LOCK() global. - Add a mutex to socket, to be used instead of socket buffer lock to lock fields of struct socket that don't belong to a socket buffer. - Allow to acquire two socket locks, but the first one must belong to a listening socket. - Make soref()/sorele() to use atomic(9). This allows in some situations to do soref() without owning socket lock. There is place for improvement here, it is possible to make sorele() also to lock optionally. - Most protocols aren't touched by this change, except UNIX local sockets. See below for more information. o Reduce copy-and-paste in kernel modules that accept connections from listening sockets: provide function solisten_dequeue(), and use it in the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4), infiniband, rpc. o UNIX local sockets. - Removal of ACCEPT_LOCK() global uncovered several races in the UNIX local sockets. Most races exist around spawning a new socket, when we are connecting to a local listening socket. To cover them, we need to hold locks on both PCBs when spawning a third one. This means holding them across sonewconn(). This creates a LOR between pcb locks and unp_list_lock. - To fix the new LOR, abandon the global unp_list_lock in favor of global unp_link_lock. Indeed, separating these two locks didn't provide us any extra parralelism in the UNIX sockets. - Now call into uipc_attach() may happen with unp_link_lock hold if, we are accepting, or without unp_link_lock in case if we are just creating a socket. - Another problem in UNIX sockets is that uipc_close() basicly did nothing for a listening socket. The vnode remained opened for connections. This is fixed by removing vnode in uipc_close(). Maybe the right way would be to do it for all sockets (not only listening), simply move the vnode teardown from uipc_detach() to uipc_close()? Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D9770	2017-06-08 21:30:34 +00:00
Jonathan T. Looney	dc6a41b936	Add the infrastructure to support loading multiple versions of TCP stack modules. It adds support for mangling symbols exported by a module by prepending a string to them. (This avoids overlapping symbols in the kernel linker.) It allows the use of a macro as the module name in the DECLARE_MACRO() and MACRO_VERSION() macros. It allows the code to register stack aliases (e.g. both a generic name ["default"] and version-specific name ["default_10_3p1"]). With these changes, it is trivial to compile TCP stack modules with the name defined in the Makefile and to load multiple versions of the same stack simultaneously. This functionality can be used to enable side-by-side testing of an old and new version of the same TCP stack. It also could support upgrading the TCP stack without a reboot. Reviewed by: gnn, sjg (makefiles only) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D11086	2017-06-08 20:41:28 +00:00
Gleb Smirnoff	3acfe1e1b0	This code was missing socket unlock and socket buffer lock, but it worked since right now these two locks are the same.	2017-06-08 06:37:11 +00:00
Gleb Smirnoff	12d8a8e7a3	The desired lock here is socket buffer, not socket. Right now they match, but won't in future.	2017-06-08 06:34:09 +00:00
Michael Tuexen	8cb5a8e90a	Fix the ICMP6 handling for TCP. The ICMP6 packets might not be contained in a single mbuf. So don't assume this. Keep the IPv4 and IPv6 code in sync and make explicit that the syncache code only need the TCP sequence number, not the complete TCP header. MFC after: 3 days Sponsored by: Netflix, Inc.	2017-06-03 21:53:58 +00:00
Michael Tuexen	98732609d5	Improve comments to describe what the code does. Reported by: jtl Sponsored by: Netflix, Inc.	2017-06-01 15:11:18 +00:00
Jonathan T. Looney	382a6bbcf1	Enforce the limit on ICMP messages before doing work to formulate the response. Delete an unneeded rate limit for UDP under IPv6. Because ICMP6 messages have their own rate limit, it is unnecessary to apply a second rate limit to UDP messages. Reviewed by: glebius MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D10387	2017-05-30 14:32:44 +00:00
Michael Tuexen	5d08768a2b	Use the SCTP_PCB_FLAGS_ACCEPTING flags to check for listeners. While there, use a macro for checking the listen state to allow for easier changes if required. This done to help glebius@ with his listen changes.	2017-05-26 16:29:00 +00:00
Gleb Smirnoff	6a6cefac3d	o Rearrange struct inpcb fields to optimize the TCP output code path considering cache line hits and misses. Put the lock and hash list glue into the first cache line, put inp_refcount inp_flags inp_socket into the second cache line. o On allocation zero out entire structure except the lock and list entries, including inp_route inp_lle inp_gencnt. When inp_route and inp_lle were introduced, they were added below inp_zero_size, resulting on not being cleared after free/alloc. This definitely was a source of bugs with route caching. Could be that r315956 has just fixed one of them. The inp_gencnt is reinitialized on every alloc, so it is safe to clear it. This has been proved to improve TCP performance at Netflix. Obtained from: rrs Differential Revision: D10686	2017-05-24 17:47:16 +00:00
Michael Tuexen	5dba6ada91	The connect() system call should return -1 and set errno to EAFNOSUPPORT if it is called on a TCP socket * with an IPv6 address and the socket is bound to an IPv4-mapped IPv6 address. * with an IPv4-mapped IPv6 address and the socket is bound to an IPv6 address. Thanks to Jonathan T. Leighton for reporting this issue. Reviewed by: bz gnn MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D9163	2017-05-22 15:29:10 +00:00
Andrey V. Elsukov	38cc96a887	Set M_BCAST and M_MCAST flags on mbuf sent via divert socket. r290383 has changed how mbufs sent by divert socket are handled. Previously they are always handled by slow path processing in ip_input(). Now ip_tryforward() is invoked from ip_input() before in_broadcast() check. Since diverted packet lost all mbuf flags, it passes the broadcast check in ip_tryforward() due to missing M_BCAST flag. In the result the broadcast packet is forwarded to the wire instead of be consumed by network stack. Add in_broadcast() check to the div_output() function. And restore the M_BCAST flag if destination address is broadcast for the given network interface. PR: 209491 MFC after: 1 week	2017-05-17 09:04:09 +00:00
Ed Maste	3e85b721d6	Remove register keyword from sys/ and ANSIfy prototypes A long long time ago the register keyword told the compiler to store the corresponding variable in a CPU register, but it is not relevant for any compiler used in the FreeBSD world today. ANSIfy related prototypes while here. Reviewed by: cem, jhb Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D10193	2017-05-17 00:34:34 +00:00
Gleb Smirnoff	cc487c1697	Reduce in_pcbinfo_init() by two params. No users supply any flags to this function (they used to say UMA_ZONE_NOFREE), so flag parameter goes away. The zone_fini parameter also goes away. Previously no protocols (except divert) supplied zone_fini function, so inpcb locks were leaked with slabs. This was okay while zones were allocated with UMA_ZONE_NOFREE flag, but now this is a leak. Fix that by suppling inpcb_fini() function as fini method for all inpcb zones.	2017-05-15 21:58:36 +00:00
Enji Cooper	bd7459366e	Add missing braces around MCAST_EXCLUDE check when KTR support is compiled into the kernel This ensures that .iss_asm (the number of ASM listeners) isn't incorrectly decremented for MLD-layer source datagrams when inspecting im*s_st[1] (the second state in the structure). MFC after: 2 months PR: 217509 [1] Reported by: Coverity (Isilon) Reviewed by: ae ("This patch looks correct to me." [1]) Submitted by: Miles Ohlrich <miles.ohlrich@isilon.com> Sponsored by: Dell EMC Isilon	2017-05-13 18:41:24 +00:00
Gleb Smirnoff	7637c57ee1	There is no good reason for TCP reassembly zone to be UMA_ZONE_NOFREE. It has strong locking model, doesn't have any timers associated with entries. The entries theirselves are referenced only from the tcpcb zone, which itself is a normal zone, without the UMA_ZONE_NOFREE flag.	2017-05-10 23:32:31 +00:00
Eugene Grosbein	1a356b8b90	ipfw nat and natd support multiple aliasing instances with "nat global" feature that chooses right alias_address for outgoing packets that already have corresponding state in one of aliasing instances. This feature works just fine for ICMP, UDP, TCP and SCTP packes but not for others. For example, outgoing PPtP/GRE packets always get alias_address of latest configured instance no matter whether such packets have corresponding state or not. This change unbreaks translation of transit PPtP/GRE connections for "nat global" case fixing a bug in static ProtoAliasOut() function that ignores its "create" argument and performs translation regardless of its value. This static function is called only by LibAliasOutLocked() function and only for packers other than ICMP, UDP, TCP and SCTP. LibAliasOutLocked() passes its "create" argument unmodified. We have only two consumers of LibAliasOutLocked() in the source tree calling it with "create" unequal to 1: "ipfw nat global" code and similar natd code having same problem. All other consumers of LibAliasOutLocked() call it with create = 1 and the patch is "no-op" for such cases. PR: 218968 Approved by: ae, vsevolod (mentor) MFC after: 1 week	2017-05-10 19:41:52 +00:00
Michael Tuexen	10e0318afa	Allow SCTP to use the hostcache. This patch allows the MTU stored in the hostcache to be used as an initial value for SCTP paths. When an ICMP PTB message is received, store the MTU in the hostcache. MFC after: 1 week	2017-04-29 19:20:50 +00:00
Michael Tuexen	4f43a14a85	Don't set the DF-bit on timer based retransmissions. MFC after: 1 week	2017-04-29 09:57:27 +00:00
Michael Tuexen	b6ecf43450	Set the DF bit for responses to out-of-the-blue packets. MFC after: 1 week	2017-04-28 15:38:34 +00:00
Michael Tuexen	d274bcc661	Fix an issue with MTU calculation if an ICMP messaeg is received for an SCTP/UDP packet. MFC after: 1 week	2017-04-26 20:21:05 +00:00
Michael Tuexen	6ebfa5ee14	Use consistently uint32_t for mtu values. This does not change functionality, but this cleanup is need for further improvements of ICMP handling. MFC after: 1 week	2017-04-26 19:26:40 +00:00
Michael Tuexen	ebfd753408	When a SYN-ACK is received in SYN-SENT state, RFC 793 requires the validation of SEG.ACK as the first step. If the ACK is not acceptable, a RST segment should be sent and the segment should be dropped. Up to now, the segment was partially processed. This patch moves the check for the SEG.ACK validation up to the front as required. Reviewed by: hiren, gnn MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D10424	2017-04-26 06:20:58 +00:00

... 5 6 7 8 9 ...

6389 Commits