freebsd-dev

Author	SHA1	Message	Date
Gleb Smirnoff	cef9f220cd	Remove 'dir' argument in ng_ipfw_input, since ip_fw_args now has this info. While here make 'tee' boolean.	2019-03-14 22:30:05 +00:00
Gleb Smirnoff	1830dae3d3	Make second argument of ip_divert(), that specifies packet direction a bool. This allows pf(4) to avoid including ipfw(4) private files.	2019-03-14 22:23:09 +00:00
Bjoern A. Zeeb	b25d74e06c	Improve ARP logging. r344504 added an extra ARP_LOG() call in case of an if_output() failure. It turns out IPv4 can be noisy. In order to not spam the console by default: (a) add a counter for these events so people can keep better track of how often it happens, and (b) add a sysctl to select the default ARP_LOG log level and set it to INFO avoiding the one (the new) DEBUG level by default. Claim a spare (1st one after 10 years since the stats were added) in order to not break netstat from FreeBSD 12->13 updates in the future. Reviewed by: karels Differential Revision: https://reviews.freebsd.org/D19490	2019-03-09 01:12:59 +00:00
Michael Tuexen	3a35ad54a8	Fix locking bug. MFC after: 3 days	2019-03-08 18:17:57 +00:00
Michael Tuexen	a458a6e620	Some cleanup and consistency improvements. MFC after: 3 days	2019-03-08 18:16:19 +00:00
Michael Tuexen	e6dcce69ca	After removing an entry from the stream scheduler list, set the pointers to NULL, since we are checking for it in case the element gets inserted again. This issue was found by running syzkaller. MFC after: 3 days	2019-03-07 08:43:20 +00:00
Michael Tuexen	be62c88b80	Allocate an assocition id and register the stcb with holding the lock. This avoids a race where stcbs can be found, which are not completely initialized. This was found by running syzkaller. MFC after: 3 days	2019-03-03 19:55:06 +00:00
Michael Tuexen	5f98c80550	Remove debug output. MFC after: 3 days	2019-03-02 16:10:11 +00:00
Michael Tuexen	bab9988af5	Allow SCTP stream reconfiguration operations only in ESTABLISHED state. This issue was found by running syzkaller. MFC after: 3 days	2019-03-02 14:30:27 +00:00
Michael Tuexen	49f1449309	Handle the case when calling the IPPROTO_SCTP level socket option SCTP_STATUS on an association with no primary path (early state). This issue was found by running syzkaller. MFC after: 3 days	2019-03-02 14:15:33 +00:00
Michael Tuexen	e57d481c5e	Report the correct length when using the IPPROTO_SCTP level socket options SCTP_GET_PEER_ADDRESSES and SCTP_GET_LOCAL_ADDRESSES.	2019-03-02 13:12:37 +00:00
Michael Tuexen	20ab225b61	Honor the memory limits provided when processing the IPPROTO_SCTP level socket option SCTP_GET_LOCAL_ADDRESSES in a getsockopt() call. Thanks to Thomas Barabosch for reporting the issue which was found by running syzkaller. MFC after: 3 days	2019-03-01 18:47:41 +00:00
Michael Tuexen	3aee58ca76	Improve consistency, not functional change. MFC after: 3 days	2019-03-01 15:57:55 +00:00
John Baldwin	dbcc200058	Various cleanups to the management of multiple TCP stacks. - Use strlcpy() with sizeof() instead of strncpy(). - Simplify initialization of TCP functions structures. init_tcp_functions() was already called before the first call to register a stack. Just inline the work in the SYSINIT and remove the racy helper variable. Instead, KASSERT that the rw lock is initialized when registering a stack. - Protect the default stack via a direct pointer comparison. The default stack uses the name "freebsd" instead of "default" so this protection wasn't working for the default stack anyway. Reviewed by: rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19152	2019-02-27 20:24:23 +00:00
Bjoern A. Zeeb	a4c69b8bc0	Make arp code return (more) errors. arprequest() is a void function and in case of error we simply return without any feedback. In case of any local operation or *if_output() failing no feedback is send up the stack for the packet which triggered the arp request to be sent. arpresolve_full() has three pre-canned possible errors returned (if we have not yet sent enough arp requests or if we tried often enough without success) otherwise "no error" is returned. Make arprequest() an "internal" function arprequest_internal() which does return a possible error to the caller. Preserve arprequest() as a void wrapper function for external consumers. In arpresolve_full() add an extra error checking. Use the arprequest_internal() function and only return an error if non of the three ones (mentioend above) are already set. This will return possible errors all the way up the stack and allows functions and programs to react on the send errors rather than leaving them in the dark. Also they might get more detailed feedback of why packets cannot be sent and they will receive it quicker. Reviewed by: karels, hselasky Differential Revision: https://reviews.freebsd.org/D18904	2019-02-24 22:49:56 +00:00
Gleb Smirnoff	0dfc145abe	Support struct ip_mreqn as argument for IP_ADD_MEMBERSHIP. Legacy support for struct ip_mreq remains in place. The struct ip_mreqn is Linux extension to classic BSD multicast API. It has extra field allowing to specify the interface index explicitly. In Linux it used as argument for IP_MULTICAST_IF and IP_ADD_MEMBERSHIP. FreeBSD kernel also declares this structure and supports it as argument to IP_MULTICAST_IF since r170613. So, we have structure declared but not fully supported, this confused third party application configure scripts. Code handling IP_ADD_MEMBERSHIP was mixed together with code for IP_ADD_SOURCE_MEMBERSHIP. Bringing legacy and new structure support into the mess would made the "argument switcharoo" intolerable, so code was separated into its own switch case clause. MFC after: 3 months Differential Revision: https://reviews.freebsd.org/D19276	2019-02-23 06:03:18 +00:00
Michael Tuexen	560c058683	The receive buffer autoscaling for TCP is based on a linear growth, which is acceptable in the congestion avoidance phase, but not during slow start. The MTU is is also not taken into account. Use a method instead, which is based on exponential growth working also in slow start and being independent from the MTU. This is joint work with rrs@. Reviewed by: rrs@, Richard Scheffenegger Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18375	2019-02-21 10:35:32 +00:00
Michael Tuexen	a1f0e13475	This patch addresses an issue brought up by bz@ in D18968: When TCP_REASS_LOGGING is defined, a NULL pointer dereference would happen, if user data was received during the TCP handshake and BB logging is used. A KASSERT is also added to detect tcp_reass() calls with illegal parameter combinations. Reported by: bz@ Reviewed by: rrs@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D19254	2019-02-21 09:34:47 +00:00
Michael Tuexen	3b853844d7	Reduce the TCP initial retransmission timeout from 3 seconds to 1 second as allowed by RFC 6298. Reviewed by: kbowling@, Richard Scheffenegger Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18941	2019-02-20 18:03:43 +00:00
Michael Tuexen	c6dcb64b18	Use exponential backoff for retransmitting SYN segments as specified in the TCP RFCs. Reviewed by: rrs@, Richard Scheffenegger Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18974	2019-02-20 17:56:38 +00:00
Michael Tuexen	e82fdca156	Fix a byte ordering issue for the advertised receiver window in ACK segments sent in TIMEWAIT state, which I introduced in r336937. MFC after: 3 days Sponsored by: Netflix, Inc.	2019-02-15 09:45:17 +00:00
Andrey V. Elsukov	c7ee62fcd5	In r335015 PCB destroing was made deferred using epoch_call(). But ipsec_delete_pcbpolicy() uses some VNET-virtualized variables, and thus it needs VNET context, that is missing during gtaskqueue executing. Use inp_vnet context to set curvnet in in_pcbfree_deferred(). PR: 235684 MFC after: 1 week	2019-02-13 15:46:05 +00:00
Kristof Provost	3838c6a3e6	garp: Fix vnet related panic for gratuitous arp Gratuitous ARP packets are sent from a timer, which means we don't have a vnet context set. As a result we panic trying to send the packet. Set the vnet context based on the interface associated with the interface address. To reproduce: sysctl net.link.ether.inet.garp_rexmit_count=2 ifconfig vtnet1 10.0.0.1/24 up PR: 235699 Reviewed by: vangyzen@ MFC after: 1 week	2019-02-12 21:22:57 +00:00
Michael Tuexen	aef0641755	Improve input validation for raw IPv4 socket using the IP_HDRINCL option. This issue was found by running syzkaller on OpenBSD. Greg Steuck made me aware that the problem might also exist on FreeBSD. Reported by: Greg Steuck MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D18834	2019-02-12 10:17:21 +00:00
Michael Tuexen	d9707e43df	Fix a locking issue when reporing outbount messages. MFC after: 3 days	2019-02-10 14:02:14 +00:00
Michael Tuexen	507bb10421	Fix a locking issue in the IPPROTO_SCTP level SCTP_PEER_ADDR_THLDS socket option. The problem affects only setsockopt with invalid parameters. This issue was found by syzkaller. MFC after: 3 days	2019-02-10 13:55:32 +00:00
Michael Tuexen	6cf360772f	Fix a locking bug in the IPPROTO_SCTP level SCTP_EVENT socket option. This occurs when call setsockopt() with invalid parameters. This issue was found by syzkaller. MFC after: 3 days	2019-02-10 10:42:16 +00:00
Michael Tuexen	333669e016	Fix locking for IPPROTO_SCTP level SCTP_DEFAULT_PRINFO socket option. This problem occurred when calling setsockopt() will invalid parameters. This issue was found by running syzkaller. MFC after: 3 days	2019-02-10 08:28:56 +00:00
Michael Tuexen	aa36fbd6fa	Ensure that when using the TCP CDG congestion control and setting the sysctl variable net.inet.tcp.cc.cdg.smoothing_factor to 0, the smoothing is disabled. Without this patch, a division by zero orrurs. PR: 193762 Reviewed by: lstewart@, rrs@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D19071	2019-02-08 20:42:49 +00:00
Michael Tuexen	baed5270e1	Only reduce the PMTU after the send call. The only way to increase it, is via PMTUD. This fixes an MTU issue reported by Timo Voelker. MFC after: 3 days	2019-02-05 10:29:31 +00:00
Michael Tuexen	e4c42fa266	Fix an off-by-one error in the input validation of the SCTP_RESET_STREAMS socketoption. This was found by running syzkaller. MFC after: 3 days	2019-02-05 10:13:51 +00:00
Warner Losh	52467047aa	Regularize the Netflix copyright Use recent best practices for Copyright form at the top of the license: 1. Remove all the All Rights Reserved clauses on our stuff. Where we piggybacked others, use a separate line to make things clear. 2. Use "Netflix, Inc." everywhere. 3. Use a single line for the copyright for grep friendliness. 4. Use date ranges in all places for our stuff. Approved by: Netflix Legal (who gave me the form), adrian@ (pmc files)	2019-02-04 21:28:25 +00:00
Michael Tuexen	116ef4d6e7	When handling SYN-ACK segments in the SYN-RCVD state, set tp->snd_wnd consistently. This inconsistency was observed when working on the bug reported in PR 235256, although it does not fix the reported issue. The fix for the PR will be a separate commit. PR: 235256 Reviewed by: rrs@, Richard Scheffenegger MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D19033	2019-02-01 12:33:00 +00:00
Gleb Smirnoff	547392731f	Repair siftr(4): PFIL_IN and PFIL_OUT are defines of some value, relying on them having particular values can break things.	2019-02-01 08:10:26 +00:00
Gleb Smirnoff	b252313f0b	New pfil(9) KPI together with newborn pfil API and control utility. The KPI have been reviewed and cleansed of features that were planned back 20 years ago and never implemented. The pfil(9) internals have been made opaque to protocols with only returned types and function declarations exposed. The KPI is made more strict, but at the same time more extensible, as kernel uses same command structures that userland ioctl uses. In nutshell [KA]PI is about declaring filtering points, declaring filters and linking and unlinking them together. New [KA]PI makes it possible to reconfigure pfil(9) configuration: change order of hooks, rehook filter from one filtering point to a different one, disconnect a hook on output leaving it on input only, prepend/append a filter to existing list of filters. Now it possible for a single packet filter to provide multiple rulesets that may be linked to different points. Think of per-interface ACLs in Cisco or Juniper. None of existing packet filters yet support that, however limited usage is already possible, e.g. default ruleset can be moved to single interface, as soon as interface would pride their filtering points. Another future feature is possiblity to create pfil heads, that provide not an mbuf pointer but just a memory pointer with length. That would allow filtering at very early stages of a packet lifecycle, e.g. when packet has just been received by a NIC and no mbuf was yet allocated. Differential Revision: https://reviews.freebsd.org/D18951	2019-01-31 23:01:03 +00:00
Brooks Davis	435a8c1560	Add a simple port filter to SIFTR. SIFTR does not allow any kind of filtering, but captures every packet processed by the TCP stack. Often, only a specific session or service is of interest, and doing the filtering in post-processing of the log adds to the overhead of SIFTR. This adds a new sysctl net.inet.siftr.port_filter. When set to zero, all packets get captured as previously. If set to any other value, only packets where either the source or the destination ports match, are captured in the log file. Submitted by: Richard Scheffenegger Reviewed by: Cheng Cui Differential Revision: https://reviews.freebsd.org/D18897	2019-01-30 17:44:30 +00:00
Michael Tuexen	bf7fcdb18a	Fix the detection of ECN-setup SYN-ACK packets. RFC 3168 defines an ECN-setup SYN-ACK packet as on with the ECE flags set and the CWR flags not set. The code was only checking if ECE flag is set. This patch adds the check to verify that the CWR flags is not set. Submitted by: Richard Scheffenegger Reviewed by: tuexen@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18996	2019-01-28 12:45:31 +00:00
Michael Tuexen	f635b1c264	Don't include two header files when not needed. This allows the part of the rewrite of TCP reassembly in this files to be MFCed to stable/11 with manual change. MFC after: 3 days Sponsored by: Netflix, Inc.	2019-01-25 17:08:28 +00:00
Michael Tuexen	7dc90a1de0	Fix a bug in the restart window computation of TCP New Reno When implementing support for IW10, an update in the computation of the restart window used after an idle phase was missed. To minimize code duplication, implement the logic in tcp_compute_initwnd() and call it. This fixes a bug in NewReno, which was not aware of IW10. Submitted by: Richard Scheffenegger Reviewed by: tuexen@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18940	2019-01-25 13:57:09 +00:00
Michael Tuexen	989321df11	Get the arithmetic right... MFC after: 3 days Sponsored by: Netflix, Inc.	2019-01-24 16:47:18 +00:00
Michael Tuexen	42395cbe31	Kill a trailing whitespace character... MFC after: 3 days Sponsored by: Netflix, Inc.	2019-01-24 16:43:13 +00:00
Michael Tuexen	34bb795ba1	Update a comment to reflect the current reality. SYN-cache entries live for abaut 12 seconds, not 45, when default setting are used. MFC after: 1 week Sponsored by: Netflix, Inc.	2019-01-24 16:40:14 +00:00
Mark Johnston	49cf58e559	Style. Reviewed by: bz MFC after: 3 days Sponsored by: The FreeBSD Foundation	2019-01-23 22:19:49 +00:00
Mark Johnston	c06cc56e39	Fix an LLE lookup race. After the afdata read lock was converted to epoch(9), readers could observe a linked LLE and block on the LLE while a thread was unlinking the LLE. The writer would then release the lock and schedule the LLE for deferred free, allowing readers to continue and potentially schedule the LLE timer. By the point the timer fires, the structure is freed, typically resulting in a crash in the callout subsystem. Fix the problem by modifying the lookup path to check for the LLE_LINKED flag upon acquiring the LLE lock. If it's not set, the lookup fails. PR: 234296 Reviewed by: bz Tested by: sbruno, Victor <chernov_victor@list.ru>, Mike Andrews <mandrews@bit0.com> MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18906	2019-01-23 22:18:23 +00:00
Brooks Davis	c53d6b90ba	Make SIFTR work again after r342125 (D18443). Correct a logic error. Only disable when already enabled or enable when disabled. Submitted by: Richard Scheffenegger Reviewed by: Cheng Cui Obtained from: Cheng Cui MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D18885	2019-01-18 21:46:38 +00:00
Michael Tuexen	d9ba240c1c	Limit the user-controllable amount of memory the kernel allocates via IPPROTO_SCTP level socket options. This issue was found by running syzkaller. MFC after: 1 week	2019-01-16 11:33:47 +00:00
Stephen Hurd	500759395a	Fix window update issue when scaling disabled When the TCP window scale option is not used, and the window opens up enough in one soreceive, a window update will not be sent. For example, if recwin == 65535, so->so_rcv.sb_hiwat >= 262144, and so->so_rcv.sb_hiwat <= 524272, the window update will never be sent. This is because recwin and adv are clamped to TCP_MAXWIN << tp->rcv_scale, and so will never be >= so->so_rcv.sb_hiwat / 4 or <= so->so_rcv.sb_hiwat / 8. This patch ensures a window update is sent if the window opens by TCP_MAXWIN << tp->rcv_scale, which should only happen when the window size goes from zero to the max expressible. This issue looks like it was introduced in r306769 when recwin was clamped to TCP_MAXWIN << tp->rcv_scale. MFC after: 1 week Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D18821	2019-01-15 17:40:19 +00:00
Michael Tuexen	10731c54b6	Fix getsockopt() for IP_OPTIONS/IP_RETOPTS. r336616 copies inp->inp_options using the m_dup() function. However, this function expects an mbuf packet header at the beginning, which is not true in this case. Therefore, use m_copym() instead of m_dup(). This issue was found by syzkaller. Reviewed by: mmacy@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18753	2019-01-09 06:36:57 +00:00
Gleb Smirnoff	a68cc38879	Mechanical cleanup of epoch(9) usage in network stack. - Remove macros that covertly create epoch_tracker on thread stack. Such macros a quite unsafe, e.g. will produce a buggy code if same macro is used in embedded scopes. Explicitly declare epoch_tracker always. - Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read locking macros to what they actually are - the net_epoch. Keeping them as is is very misleading. They all are named FOO_RLOCK(), while they no longer have lock semantics. Now they allow recursion and what's more important they now no longer guarantee protection against their companion WLOCK macros. Note: INP_HASH_RLOCK() has same problems, but not touched by this commit. This is non functional mechanical change. The only functionally changed functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter epoch recursively. Discussed with: jtl, gallatin	2019-01-09 01:11:19 +00:00
Mark Johnston	2f2ddd68a5	Support MSG_DONTWAIT in send(2). As it does for recv(2), MSG_DONTWAIT indicates that the call should not block, returning EAGAIN instead. Linux and OpenBSD both implement this, so the change makes porting easier, especially since we do not return EINVAL or so when unrecognized flags are specified. Submitted by: Greg V <greg@unrelenting.technology> Reviewed by: tuexen MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18728	2019-01-04 17:31:50 +00:00
Michael Tuexen	09423f72fd	Fix a regression in the TCP handling of received segments. When receiving TCP segments the stack protects itself by limiting the resources allocated for a TCP connections. This patch adds an exception to these limitations for the TCP segement which is the next expected in-sequence segment. Without this patch, TCP connections may stall and finally fail in some cases of packet loss. Reported by: jhb@ Reviewed by: jtl@, rrs@ MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18580	2018-12-20 16:05:30 +00:00
Hiren Panchasara	51e712f865	Revert r331567 CC Cubic: fix underflow for cubic_cwnd() This change is causing TCP connections using cubic to hang. Need to dig more to find exact cause and fix it. Reported by: tj at mrsk dot me, Matt Garber (via twitter) Discussed with: sbruno (previously), allanjude, cperciva MFC after: 3 days	2018-12-15 17:01:16 +00:00
Brooks Davis	855acb84ca	Fix bugs in plugable CC algorithm and siftr sysctls. Use the sysctl_handle_int() handler to write out the old value and read the new value into a temporary variable. Use the temporary variable for any checks of values rather than using the CAST_PTR_INT() macro on req->newptr. The prior usage read directly from userspace memory if the sysctl() was called correctly. This is unsafe and doesn't work at all on some architectures (at least i386.) In some cases, the code could also be tricked into reading from kernel memory and leaking limited information about the contents or crashing the system. This was true for CDG, newreno, and siftr on all platforms and true for i386 in all cases. The impact of this bug is largest in VIMAGE jails which have been configured to allow writing to these sysctls. Per discussion with the security officer, we will not be issuing an advisory for this issue as root access and a non-default config are required to be impacted. Reviewed by: markj, bz Discussed with: gordon (security officer) MFC after: 3 days Security: kernel information leak, local DoS (both require root) Differential Revision: https://reviews.freebsd.org/D18443	2018-12-15 15:06:22 +00:00
Mateusz Guzik	cc426dd319	Remove unused argument to priv_check_cred. Patch mostly generated with cocinnelle: @@ expression E1,E2; @@ - priv_check_cred(E1,E2,0) + priv_check_cred(E1,E2) Sponsored by: The FreeBSD Foundation	2018-12-11 19:32:16 +00:00
Mark Johnston	9d2877fc3d	Clamp the INPCB port hash tables to IPPORT_MAX + 1 chains. Memory beyond that limit was previously unused, wasting roughly 1MB per 8GB of RAM. Also retire INP_PCBLBGROUP_PORTHASH, which was identical to INP_PCBPORTHASH. Reviewed by: glebius MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D17803	2018-12-05 17:06:00 +00:00
Andrey V. Elsukov	d66f9c86fa	Add ability to request listing and deleting only for dynamic states. This can be useful, when net.inet.ip.fw.dyn_keep_states is enabled, but after rules reloading some state must be deleted. Added new flag '-D' for such purpose. Retire '-e' flag, since there can not be expired states in the meaning that this flag historically had. Also add "verbose" mode for listing of dynamic states, it can be enabled with '-v' flag and adds additional information to states list. This can be useful for debugging. Obtained from: Yandex LLC MFC after: 2 months Sponsored by: Yandex LLC	2018-12-04 16:12:43 +00:00
Michael Tuexen	c8b53ced95	Limit option_len for the TCP_CCALGOOPT. Limiting the length to 2048 bytes seems to be acceptable, since the values used right now are using 8 bytes. Reviewed by: glebius, bz, rrs MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18366	2018-11-30 10:50:07 +00:00
Mark Johnston	79db6fe7aa	Plug some networking sysctl leaks. Various network protocol sysctl handlers were not zero-filling their output buffers and thus would export uninitialized stack memory to userland. Fix a number of such handlers. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: tuexen MFC after: 3 days Security: kernel memory disclosure Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18301	2018-11-22 20:49:41 +00:00
Michael Tuexen	ad2be38941	A TCP stack is required to check SEG.ACK first, when processing a segment in the SYN-SENT state as stated in Section 3.9 of RFC 793, page 66. Ensure this is also done by the TCP RACK stack. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18034	2018-11-22 20:05:57 +00:00
Michael Tuexen	fef56019e9	Ensure that the TCP RACK stack honours the setting of the net.inet.tcp.drop_synfin sysctl-variable. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18033	2018-11-22 20:02:39 +00:00
Michael Tuexen	7e729f0787	Ensure that the default RTT stack can make an RTT measurement if the TCP connection was initiated using the RACK stack, but the peer does not support the TCP RACK extension. This ensures that the TCP behaviour on the wire is the same if the TCP connection is initated using the RACK stack or the default stack. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18032	2018-11-22 19:56:52 +00:00
Michael Tuexen	794107181a	Ensure that TCP RST-segments announce consistently a receiver window of zero. This was already done when sending them via tcp_respond(). Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D17949	2018-11-22 19:49:52 +00:00
Michael Tuexen	3bea9a2664	Improve two KASSERTs in the TCP RACK stack. There are two locations where an always true comparison was made in a KASSERT. Replace this by an appropriate check and use a consistent panic message. Also use this code when checking a similar condition. PR: 229664 Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18021	2018-11-21 18:19:15 +00:00
Andrey V. Elsukov	5786c6b9f9	Make multiline APPLY_MASK() macro to be function-like. Reported by: cem MFC after: 1 week	2018-11-20 18:38:28 +00:00
Bjoern A. Zeeb	945aad9c62	Improve the comment for arpresolve_full() in if_ether.c. No functional changes. MFC after: 6 weeks	2018-11-17 16:13:09 +00:00
Bjoern A. Zeeb	90d99b6587	Retire arpresolve_addr(), which is not used anywhere, from if_ether.c.	2018-11-17 16:08:36 +00:00
Jonathan T. Looney	2157f3c36a	Add some additional length checks to the IPv4 fragmentation code. Specifically, block 0-length fragments, even when the MF bit is clear. Also, ensure that every fragment with the MF bit clear ends at the same offset and that no subsequently-received fragments exceed that offset. Reviewed by: glebius, markj MFC after: 3 days Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D17922	2018-11-16 18:32:48 +00:00
Mark Johnston	86af1d0241	Ensure that IP fragments do not extend beyond IP_MAXPACKET. Such fragments are obviously invalid, and when processed may end up violating the sort order (by offset) of fragments of a given packet. This doesn't appear to be exploitable, however. Reviewed by: emaste Discussed with: jtl MFC after: 3 days Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17914	2018-11-10 03:00:36 +00:00
Ed Maste	2bfaf585ca	Avoid buffer underwrite in icmp_error icmp_error allocates either an mbuf (with pkthdr) or a cluster depending on the size of data to be quoted in the ICMP reply, but the calculation failed to account for the additional padding that m_align may apply. Include the ip header in the size passed to m_align. On 64-bit archs this will have the net effect of moving everything 4 bytes later in the mbuf or cluster. This will result in slightly pessimal alignment for the ICMP data copy. Also add an assertion that we do not move m_data before the beginning of the mbuf or cluster. Reported by: A reddit user Reviewed by: bz, jtl MFC after: 3 days Security: CVE-2018-17156 Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17909	2018-11-08 20:17:36 +00:00
Michael Tuexen	8553b984a5	Don't use a function when neither INET nor INET6 are defined. This is a valid case for the userland stack, where this fixes two set-but-not-used warnings in this case. Thanks to Christian Wright for reporting the issue.	2018-11-06 12:55:03 +00:00
Jonathan T. Looney	54e675342b	m_pulldown() may reallocate n. Update the oip pointer after the m_pulldown() call. MFC after: 2 weeks Sponsored by: Netflix	2018-11-02 19:14:15 +00:00
Bjoern A. Zeeb	e2c532f156	carpstats are the last virtualised variable in the file and end up at the end of the vnet_set. The generated code uses an absolute relocation at one byte beyond the end of the carpstats array. This means the relocation for the vnet does not happen for carpstats initialisation and as a result the kernel panics on module load. This problem has only been observed with carp and only on i386. We considered various possible solutions including using linker scripts to add padding to all kernel modules for pcpu and vnet sections. While the symbols (by chance) stay in the order of appearance in the file adding an unused non-file-local variable at the end of the file will extend the size of set_vnet and hence make the absolute relocation for carpstats work (think of this as a single-module set_vnet padding). This is a (tmporary) hack. It is the least intrusive one as we need a timely solution for the upcoming release. We will revisit the problem in HEAD. For a lot more information and the possible alternate solutions please see the PR and the references therein. PR: 230857 MFC after: 3 days	2018-11-01 17:26:18 +00:00
Mark Johnston	d9ff5789be	Remove redundant checks for a NULL lbgroup table. No functional change intended. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17108	2018-11-01 15:52:49 +00:00
Mark Johnston	79ee680b65	Improve style in in_pcbinslbgrouphash() and related subroutines. No functional change intended. MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17107	2018-11-01 15:51:49 +00:00
Michael Tuexen	6999f6975c	Remove debug code which slipped in accidently. MFC after: 4 weeks X-MFC with: r339989 Sponsored by: Netflix, Inc.	2018-11-01 11:41:40 +00:00
Michael Tuexen	099ab39f44	Improve a comment to refer to the actual sections in the TCP specification for the comparisons made. Thanks to lstewart@ for the suggestion. MFC after: 4 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D17595	2018-11-01 11:35:28 +00:00
Bjoern A. Zeeb	201100c58b	Initial implementation of draft-ietf-6man-ipv6only-flag. This change defines the RA "6" (IPv6-Only) flag which routers may advertise, kernel logic to check if all routers on a link have the flag set and accordingly update a per-interface flag. If all routers agree that it is an IPv6-only link, ether_output_frame(), based on the interface flag, will filter out all ETHERTYPE_IP/ARP frames, drop them, and return EAFNOSUPPORT to upper layers. The change also updates ndp to show the "6" flag, ifconfig to display the IPV6_ONLY nd6 flag if set, and rtadvd to allow announcing the flag. Further changes to tcpdump (contrib code) are availble and will be upstreamed. Tested the code (slightly earlier version) with 2 FreeBSD IPv6 routers, a FreeBSD laptop on ethernet as well as wifi, and with Win10 and OSX clients (which did not fall over with the "6" flag set but not understood). We may also want to (a) implement and RX filter, and (b) over time enahnce user space to, say, stop dhclient from running when the interface flag is set. Also we might want to start IPv6 before IPv4 in the future. All the code is hidden under the EXPERIMENTAL option and not compiled by default as the draft is a work-in-progress and we cannot rely on the fact that IANA will assign the bits as requested by the draft and hence they may change. Dear 6man, you have running code. Discussed with: Bob Hinden, Brian E Carpenter	2018-10-30 20:08:48 +00:00
Mark Johnston	da7d7778b0	Expose some netdump configuration parameters through sysctl. Reviewed by: cem MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D17755	2018-10-29 21:16:26 +00:00
Eugene Grosbein	1a5995cc88	Prevent ip_input() from panicing due to unprotected access to INADDR_HASH. PR: 220078 MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D12457 Tested-by: Cassiano Peixoto and others	2018-10-27 04:59:35 +00:00
Eugene Grosbein	4f1e3122ac	Prevent multicast code from panicing due to unprotected access to INADDR_HASH. PR: 220078 MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D12457 Tested-by: Cassiano Peixoto and others	2018-10-27 04:53:25 +00:00
Michael Tuexen	de00ad05e6	Add initial descriptions for SCTP related MIB variable. This work was mostly done by Marie-Helene Kvello-Aune. MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D3583	2018-10-26 21:04:17 +00:00
Andrey V. Elsukov	8796e291f8	Add the check that current VNET is ready and access to srchash is allowed. This change is similar to r339646. The callback that checks for appearing and disappearing of tunnel ingress address can be called during VNET teardown. To prevent access to already freed memory, add check to the callback and epoch_wait() call to be sure that callback has finished its work. MFC after: 20 days	2018-10-23 13:11:45 +00:00
John Baldwin	74e10fb613	A couple of style fixes in recent TCP changes. - Add a blank line before a block comment to match other block comments in the same function. - Sort the prototype for sbsndptr_adv and fix whitespace between return type and function name. Reviewed by: gallatin, bz Differential Revision: https://reviews.freebsd.org/D17474	2018-10-22 21:17:36 +00:00
Eugene Grosbein	410634efd1	New sysctl: net.inet.icmp.error_keeptags Currently, icmp_error() function copies FIB number from original packet into generated ICMP response but not mbuf_tags(9) chain. This prevents us from easily matching ICMP responses corresponding to tagged original packets by means of packet filter such as ipfw(8). For example, ICMP "time-exceeded in-transit" packets usually generated in response to traceroute probes lose tags attached to original packets. This change adds new sysctl net.inet.icmp.error_keeptags that defaults to 0 to avoid extra overhead when this feature not needed. Set net.inet.icmp.error_keeptags=1 to make icmp_error() copy mbuf_tags from original packet to generated ICMP response. PR: 215874 MFC after: 1 month	2018-10-21 21:29:19 +00:00
Andrey V. Elsukov	f252e3f2f2	Include <sys/eventhandler.h> to fix the build. MFC after: 1 month	2018-10-21 18:39:34 +00:00
Andrey V. Elsukov	19873f4780	Add handling for appearing/disappearing of ingress addresses to if_gre(4). * register handler for ingress address appearing/disappearing; * add new srcaddr hash table for fast softc lookup by srcaddr; * when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface, and set it otherwise; MFC after: 1 month Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D17214	2018-10-21 18:13:45 +00:00
Andrey V. Elsukov	009d82ee0f	Add handling for appearing/disappearing of ingress addresses to if_gif(4). * register handler for ingress address appearing/disappearing; * add new srcaddr hash table for fast softc lookup by srcaddr; * when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface, and set it otherwise; * remove the note about ingress address from BUGS section. MFC after: 1 month Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D17134	2018-10-21 18:06:15 +00:00
Andrey V. Elsukov	8251c68d5c	Add KPI that can be used by tunneling interfaces to handle IP addresses appearing and disappearing on the host system. Such handling is need, because tunneling interfaces must use addresses, that are configured on the host as ingress addresses for tunnels. Otherwise the system can send spoofed packets with source address, that belongs to foreign host. The KPI uses ifaddr_event_ext event to implement addresses tracking. Tunneling interfaces register event handlers and then they are notified by the kernel, when an address disappears or appears. ifaddr_event_compat() handler from if.c replaced by srcaddr_change_event() in the ip_encap.c MFC after: 1 month Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D17134	2018-10-21 17:55:26 +00:00
Andrey V. Elsukov	094d6f8d75	Add IPFW_RULE_JUSTOPTS flag, that is used by ipfw(8) to mark rule, that was added using "new rule format". And then, when the kernel returns rule with this flag, ipfw(8) can correctly show it. Reported by: lev MFC after: 3 weeks Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D17373	2018-10-21 15:10:59 +00:00
Andrey V. Elsukov	64d63b1e03	Add ifaddr_event_ext event. It is similar to ifaddr_event, but the handler receives the type of event IFADDR_EVENT_ADD/IFADDR_EVENT_DEL, and the pointer to ifaddr. Also ifaddr_event now is implemented using ifaddr_event_ext handler. MFC after: 3 weeks Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D17100	2018-10-21 15:02:06 +00:00
Michael Tuexen	93899d10b4	The handling of RST segments in the SYN-RCVD state exists in the code paths. Both are not consistent and the one on the syn cache code does not conform to the relevant specifications (Page 69 of RFC 793 and Section 4.2 of RFC 5961). This patch fixes this: * The sequence numbers checks are fixed as specified on page Page 69 RFC 793. * The sysctl variable net.inet.tcp.insecure_rst is now honoured and the behaviour as specified in Section 4.2 of RFC 5961. Approved by: re (gjb@) Reviewed by: bz@, glebius@, rrs@, Differential Revision: https://reviews.freebsd.org/D17595 Sponsored by: Netflix, Inc.	2018-10-18 19:21:18 +00:00
Jonathan T. Looney	ac75e35d85	In r338102, the TCP reassembly code was substantially restructured. Prior to this change, the code sometimes used a temporary stack variable to hold details of a TCP segment. r338102 stopped using the variable to hold segments, but did not actually remove the variable. Because the variable is no longer used, we can safely remove it. Approved by: re (gjb)	2018-10-16 14:41:09 +00:00
Bjoern A. Zeeb	4ba16a92c7	In udp_input() when walking the pcblist we can come across an inp marked FREED after the epoch(9) changes. Check once we hold the lock and skip the inp if it is the case. Contrary to IPv6 the locking of the inp is outside the multicast section and hence a single check seems to suffice. PR: 232192 Reviewed by: mmacy, markj Approved by: re (kib) Differential Revision: https://reviews.freebsd.org/D17540	2018-10-12 22:51:45 +00:00
Bjoern A. Zeeb	3afdfcaf33	r217592 moved the check for imo in udp_input() into the conditional block but leaving the variable assignment outside the block, where it is no longer used. Move both the variable and the assignment one block further in. This should result in no functional changes. It will however make upcoming changes slightly easier to apply. Reviewed by: markj, jtl, tuexen Approved by: re (kib) Differential Revision: https://reviews.freebsd.org/D17525	2018-10-12 11:30:46 +00:00
Jonathan T. Looney	13c6ba6d94	There are three places where we return from a function which entered an epoch section without exiting that epoch section. This is bad for two reasons: the epoch section won't exit, and we will leave the epoch tracker from the stack on the epoch list. Fix the epoch leak by making sure we exit epoch sections before returning. Reviewed by: ae, gallatin, mmacy Approved by: re (gjb, kib) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D17450	2018-10-09 13:26:06 +00:00
Michael Tuexen	3535cdc43e	Avoid truncating unrecognised parameters when reporting them. This resulted in sending malformed packets. Approved by: re (kib@) MFC after: 1 week	2018-10-07 15:13:47 +00:00
Michael Tuexen	3924dfa721	Ensure that the ips_localout counter is incremented for locally generated SCTP packets sent over IPv4. This make the behaviour consistent with IPv6. Reviewed by: ae@, bz@, jtl@ Approved by: re (kib@) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D17406	2018-10-07 11:26:15 +00:00
Tom Jones	b6e870116f	Convert UDP length to host byte order When getting the number of bytes to checksum make sure to convert the UDP length to host byte order when the entire header is not in the first mbuf. Reviewed by: jtl, tuexen, ae Approved by: re (gjb), jtl (mentor) Differential Revision: https://reviews.freebsd.org/D17357	2018-10-05 12:51:30 +00:00
Ryan Stone	083a010c62	Hold a write lock across udp_notify() With the new route cache feature udp_notify() will modify the inp when it needs to invalidate the route cache. Ensure that we hold a write lock on the inp before calling the function to ensure that multiple threads don't race while trying to invalidate the cache (which previously lead to a page fault). Differential Revision: https://reviews.freebsd.org/D17246 Reviewed by: sbruno, bz, karels Sponsored by: Dell EMC Isilon Approved by: re (gjb)	2018-10-04 22:03:58 +00:00
Michael Tuexen	15a087e551	Mitigate providing a timing signal if the COOKIE or AUTH validation fails. Thanks to jmg@ for reporting the issue, which was discussed in https://admbugs.freebsd.org/show_bug.cgi?id=878 Approved by: re (TBD@) MFC after: 1 week	2018-10-01 14:05:31 +00:00
Michael Tuexen	9d2e3f14c4	After allocating chunks set the fields in a consistent way. This removes two assignments for the flags field being done twice and adds one, which was missing. Thanks to Felix Weinrank for reporting the issue he found by using fuzz testing of the userland stack. Approved by: re (kib@) MFC after: 1 week	2018-10-01 13:09:18 +00:00
Andrey V. Elsukov	384a5c3c28	Add INP_INFO_WUNLOCK_ASSERT() macro and use it instead of INP_INFO_UNLOCK_ASSERT() in TCP-related code. For encapsulated traffic it is possible, that the code is running in net_epoch_preempt section, and INP_INFO_UNLOCK_ASSERT() is very strict assertion for such case. PR: 231428 Reviewed by: mmacy, tuexen Approved by: re (kib) Differential Revision: https://reviews.freebsd.org/D17335	2018-10-01 10:46:00 +00:00
Michael Tuexen	1b084a5e5e	Plug mbuf leak in the SCTP input path in an error case. Approved by: re (kib@) MFC after: 1 week CID: 749312	2018-09-30 21:54:02 +00:00
Michael Tuexen	66bcf0b333	Plug mbuf leaks in the SCTP output path in error cases. Approved by: re (kib@) MFC after: 1 week CID: 1395307	2018-09-30 21:31:33 +00:00
Michael Tuexen	8184648425	Fix the handling of ancillary data for SCTP socket. Implement sctp_process_cmsgs_for_init() and sctp_findassociation_cmsgs() similar to sctp_find_cmsg() to improve consistency and avoid the signed/unsigned issues in sctp_process_cmsgs_for_init() and sctp_findassociation_cmsgs(). Thanks to andrew@ for reporting the problem he found using syzcaller. Approved by: re (kib@) MFC after: 1 week	2018-09-30 16:21:31 +00:00
Michael Tuexen	ae0a9a8850	Increment the corresponding UDP stats counter (udps_opackets) when sending UDP encapsulated SCTP packets. This is consistent with the behaviour that when such packets are received, the corresponding UDP stats counter (udps_ipackets) is incremented. Thanks to Peter Lei for making me aware of this inconsistency. Approved by: re (kib@) MFC after: 1 week	2018-09-30 12:16:06 +00:00
Michael Tuexen	3552f16d82	Fix typo in comment. Reported by: @danfe Approved by: re (kib@) MFC after: 1 week X-MFC: r338941	2018-09-28 19:47:32 +00:00
Michael Tuexen	0277ec9c43	Whitespace changes and fixing a typo. No functional change. Approved by: re (kib@) MFC after: 1 week	2018-09-26 10:24:50 +00:00
Michael Tuexen	078a49a077	Remove the unused parameter 'locked' from the function syncache_respond(). There is no functional change. The parameter became unused in r313330, but wasn't removed. Approved by: re (kib@) MFC after: 1 month Sponsored by: Netflix, Inc.	2018-09-23 16:37:32 +00:00
Andrey V. Elsukov	76b09d1823	Add new field max_hdrsize to struct encap_config. It is currently unused and reserved for future use to keep KBI/KPI. Also add several spare pointers to be able extend structure if it will be needed. Approved by: re (gjb)	2018-09-20 19:45:27 +00:00
Michael Tuexen	ba4704a278	Remove unused code. Approved by: re (kib@) MFC after: 1 week	2018-09-18 10:53:07 +00:00
Michael Tuexen	a8a8a8a808	Fix TCP Fast Open for the TCP RACK stack. * Fix a bug where the SYN handling during established state was applied to a front state. * Move a check for retransmission after the timer handling. This was suppressing timer based retransmissions. * Fix an off-by one byte in the sequence number of retransmissions. * Apply fixes corresponding to https://svnweb.freebsd.org/changeset/base/336934 Reviewed by: rrs@ Approved by: re (kib@) MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16912	2018-09-12 10:27:58 +00:00
Mark Johnston	54af3d0dac	Fix synchronization of LB group access. Lookups are protected by an epoch section, so the LB group linkage must be a CK_LIST rather than a plain LIST. Furthermore, we were not deferring LB group frees, so in_pcbremlbgrouphash() could race with readers and cause a use-after-free. Reviewed by: sbruno, Johannes Lundberg <johalun0@gmail.com> Tested by: gallatin Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17031	2018-09-10 19:00:29 +00:00
Mark Johnston	a7026c7fd9	Use ratecheck(9) in in_pcbinslbgrouphash(). Reviewed by: bz, Johannes Lundberg <johalun0@gmail.com> Approved by: re (kib) Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D17065	2018-09-07 21:11:41 +00:00
Bjoern A. Zeeb	113c4fad55	The inp_lle field to struct inpcb, along with two "valid" flags for the rt and lle cache were added in r191129 (2009). To my best knowledge they have never been used and route caching has converted the inp_rt field from that commit to inp_route rendering this field and these flags obsolete. Convert the pointer into a spare pointer to not change the size of the structure anymore (and to have a spare pointer) and mark the two fields as unused. Reviewed by: markj, karels Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17062	2018-09-06 19:55:40 +00:00
Bjoern A. Zeeb	6d2b0c0166	Make tcp_hpts.c compile a LINT kernel with options RSS and PCBGROUPS added by adding the missing include files and changing a the type of cpuid which would otherwise cause a false comparison with NETISR_CPUID_NONE. Reviewed by: rrs Approved by: re (marius) Differential Revision: https://reviews.freebsd.org/D16891	2018-09-06 16:11:24 +00:00
Mark Johnston	49365eb433	Define sctp probes only when SCTP is configured. Otherwise the "depends_on provider" guard in sctp.d does not work as intended. Reported by: mjg Reviewed by: tuexen Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17057	2018-09-06 14:15:03 +00:00
Mark Johnston	8be02ee4da	Fix style bugs in in_pcblookup_lbgroup(). No functional change intended. Reviewed by: bz, Johannes Lundberg <johalun0@gmail.com> Approved by: re (rgrimes) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17030	2018-09-05 15:04:11 +00:00
Eugene Grosbein	d5d21ad932	Fix "ipfw fwd" to work for incoming IPv4 packets when ip_tryforward() chooses fast forwarding path, as it already works for IPv6 and for both of them on old slow path. PR: 231143 Reviewed by: ae Approved by: re (gjb) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D17039	2018-09-05 13:59:36 +00:00
Mark Johnston	73ad0b6abf	Use the correct malloc type in in_pcblbgroup_free(). Approved by: re (kib) Sponsored by: The FreeBSD Foundation	2018-09-03 17:39:09 +00:00
Michael Tuexen	c6c0be2765	Fix a shadowed variable warning. Thanks to Peter Lei for reporting the issue. Approved by: re(kib@) MFH: 1 month Sponsored by: Netflix, Inc.	2018-08-24 10:50:19 +00:00
Michael Tuexen	90ab3571d8	Use arc4rand() instead of read_random() in the SCTP and TCP code. This was suggested by jmg@. Reviewed by: delphij@, jmg@, jtl@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16860	2018-08-23 19:10:45 +00:00
Michael Tuexen	4ba1513d1a	Don't use the explicit number 32 for the length of the secrets, use sizeof() or explicit #definesi instead. No functional change. This was suggested by jmg@. MFC after: 1 month XMFC with: r338053 Sponsored by: Netflix, Inc.	2018-08-23 06:03:59 +00:00
Michael Tuexen	1e88cc8b59	Add support for send, receive and state-change DTrace providers for SCTP. They are based on what is specified in the Solaris DTrace manual for Solaris 11.4. Reviewed by: 0mp, dteske, markj Relnotes: yes Differential Revision: https://reviews.freebsd.org/D16839	2018-08-22 21:23:32 +00:00
Matt Macy	d3878608d7	in_mcast: fix copy paste error when clearing flag	2018-08-22 04:09:55 +00:00
Michael Tuexen	5dff1c3845	Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP socket resulted in sending fragmented IPV6 packets. This is fixes by reducing the MSS to the appropriate value. In addtion, if the socket option is set before the handshake happens, announce this MSS to the peer. This is not stricly required, but done since TCP is conservative. PR: 173444 Reviewed by: bz@, rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16796	2018-08-21 14:12:30 +00:00
Michael Tuexen	7d4dcc36a8	Fix the inheritance of IPv6 level socket options on TCP sockets. This was broken for IPv6 listening socket, which are not IPV6_ONLY, and the accepted TCP connection was using IPv4. Reviewed by: bz@, rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16792	2018-08-21 14:07:36 +00:00
Michael Tuexen	6ef849e601	Whitespace change.	2018-08-21 13:37:06 +00:00
Michael Tuexen	1a0b021677	Refactor the SHUTDOWN_PENDING state handling. This is not a functional change but a preperation for the upcoming DTrace support. It is necessary to change the state in one logical operation, even if it involves clearing the sub state SHUTDOWN_PENDING. MFC after: 1 month	2018-08-21 13:25:32 +00:00
Bjoern A. Zeeb	10b070c166	GC inc_isipv6; it was added for "temp" compatibility in 2001, r86764 and does not seem to be used.	2018-08-20 20:06:36 +00:00
Randall Stewart	c28440db29	This change represents a substantial restructure of the way we reassembly inbound tcp segments. The old algorithm just blindly dropped in segments without coalescing. This meant that every segment could take up greater and greater room on the linked list of segments. This of course is now subject to a tighter limit (100) of segments which in a high BDP situation will cause us to be a lot more in-efficent as we drop segments beyond 100 entries that we receive. What this restructure does is cause the reassembly buffer to coalesce segments putting an emphasis on the two common cases (which avoid walking the list of segments) i.e. where we add to the back of the queue of segments and where we add to the front. We also have the reassembly buffer supporting a couple of debug options (black box logging as well as counters for code coverage). These are compiled out by default but can be added by uncommenting the defines. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16626	2018-08-20 12:43:18 +00:00
Michael Tuexen	8e02b4e00c	Don't expose the uptime via the TCP timestamps. The TCP client side or the TCP server side when not using SYN-cookies used the uptime as the TCP timestamp value. This patch uses in all cases an offset, which is the result of a keyed hash function taking the source and destination addresses and port numbers into account. The keyed hash function is the same a used for the initial TSN. Reviewed by: rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16636	2018-08-19 14:56:10 +00:00
Navdeep Parhar	32d2623ae2	Add the ability to look up the 3b PCP of a VLAN interface. Use it in toe_l2_resolve to fill up the complete vtag and not just the vid. Reviewed by: kib@ MFC after: 1 week Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D16752	2018-08-16 23:46:38 +00:00
Matt Macy	f9be038601	Fix in6_multi double free This is actually several different bugs: - The code is not designed to handle inpcb deletion after interface deletion - add reference for inpcb membership - The multicast address has to be removed from interface lists when the refcount goes to zero OR when the interface goes away - decouple list disconnect from refcount (v6 only for now) - ifmultiaddr can exist past being on interface lists - add flag for tracking whether or not it's enqueued - deferring freeing moptions makes the incpb cleanup code simpler but opens the door wider still to races - call inp_gcmoptions synchronously after dropping the the inpcb lock Fundamentally multicast needs a rewrite - but keep applying band-aids for now. Tested by: kp Reported by: novel, kp, lwhsu	2018-08-15 20:23:08 +00:00
Luiz Otavio O Souza	59b2022f94	Late style follow up on r312770. Submitted by: glebius X-MFC with: r312770 MFC after: 3 days	2018-08-15 15:44:30 +00:00
Jonathan T. Looney	a967df1c8f	Lower the default limits on the IPv4 reassembly queue. In particular, try to ensure that no bucket will have a reassembly queue larger than approximately 100 items. This limits the cost to find the correct reassembly queue when processing an incoming fragment. Due to the low limits on each bucket's length, increase the size of the hash table from 64 to 1024. Reviewed by: jhb Security: FreeBSD-SA-18:10.ip Security: CVE-2018-6923	2018-08-14 17:30:46 +00:00
Jonathan T. Looney	ff790bbad0	Implement a limit on on the number of IPv4 reassembly queues per bucket. There is a hashing algorithm which should distribute IPv4 reassembly queues across the available buckets in a relatively even way. However, if there is a flaw in the hashing algorithm which allows a large number of IPv4 fragment reassembly queues to end up in a single bucket, a per- bucket limit could help mitigate the performance impact of this flaw. Implement such a limit, with a default of twice the maximum number of reassembly queues divided by the number of buckets. Recalculate the limit any time the maximum number of reassembly queues changes. However, allow the user to override the value using a sysctl (net.inet.ip.maxfragbucketsize). Reviewed by: jhb Security: FreeBSD-SA-18:10.ip Security: CVE-2018-6923	2018-08-14 17:23:05 +00:00
Jonathan T. Looney	7b9c5eb0a5	Add a global limit on the number of IPv4 fragments. The IP reassembly fragment limit is based on the number of mbuf clusters, which are a global resource. However, the limit is currently applied on a per-VNET basis. Given enough VNETs (or given sufficient customization of enough VNETs), it is possible that the sum of all the VNET limits will exceed the number of mbuf clusters available in the system. Given the fact that the fragment limit is intended (at least in part) to regulate access to a global resource, the fragment limit should be applied on a global basis. VNET-specific limits can be adjusted by modifying the net.inet.ip.maxfragpackets and net.inet.ip.maxfragsperpacket sysctls. To disable fragment reassembly globally, set net.inet.ip.maxfrags to 0. To disable fragment reassembly for a particular VNET, set net.inet.ip.maxfragpackets to 0. Reviewed by: jhb Security: FreeBSD-SA-18:10.ip Security: CVE-2018-6923	2018-08-14 17:19:49 +00:00
Jonathan T. Looney	5d9bd45518	Improve hashing of IPv4 fragments. Currently, IPv4 fragments are hashed into buckets based on a 32-bit key which is calculated by (src_ip ^ ip_id) and combined with a random seed. However, because an attacker can control the values of src_ip and ip_id, it is possible to construct an attack which causes very deep chains to form in a given bucket. To ensure more uniform distribution (and lower predictability for an attacker), calculate the hash based on a key which includes all the fields we use to identify a reassembly queue (dst_ip, src_ip, ip_id, and the ip protocol) as well as a random seed. Reviewed by: jhb Security: FreeBSD-SA-18:10.ip Security: CVE-2018-6923	2018-08-14 17:15:47 +00:00
Michael Tuexen	0f1346f7f4	Remove a set but not used warning showing up in usrsctp.	2018-08-14 08:32:33 +00:00
Andrey V. Elsukov	62484790e0	Restore ability to send ICMP and ICMPv6 redirects. It was lost when tryforward appeared. Now ip[6]_tryforward will be enabled only when sending redirects for corresponding IP version is disabled via sysctl. Otherwise will be used default forwarding function. PR: 221137 Submitted by: mckay@ MFC after: 2 weeks	2018-08-14 07:54:14 +00:00
Michael Tuexen	839d21d62e	Use the stacb instead of the asoc in state macros. This is not a functional change. Just a preparation for upcoming dtrace state change provider support.	2018-08-13 13:58:45 +00:00
Michael Tuexen	61a2188021	Use consistently the macors to modify the assoc state. No functional change.	2018-08-13 11:56:21 +00:00
Michael Tuexen	812649d86f	Add explicit cast to silence a warning for the userland stack. Thanks to Felix Weinrank for providing the patch.	2018-08-12 14:05:15 +00:00
Devin Teske	ab9ed8a1bd	Fix misspellings of transmitter/transmitted Reviewed by: emaste, bcr Sponsored by: Smule, Inc. Differential Revision: https://reviews.freebsd.org/D16025	2018-08-10 20:37:32 +00:00
Andrey V. Elsukov	16bbf600d9	Remove unneeded ipsec-related includes. Reviewed by: rrs Differential Revision: https://reviews.freebsd.org/D16637	2018-08-10 07:24:01 +00:00
Leandro Lupori	c8e2123b6a	[ppc] Fix kernel panic when using BOOTP_NFSROOT On PowerPC (and possibly other architectures), that doesn't use EARLY_AP_STARTUP, the config task queue may be used initialized. This was observed while trying to mount the root fs from NFS, as reported here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230168. This patch has 2 main changes: 1- Perform a basic initialization of qgroup_config, similar to what is done in taskqgroup_adjust, but simpler. This makes qgroup_config ready to be used during NFS root mount. 2- When EARLY_AP_STARTUP is not used, call inm_init() and in6m_init() right before SI_SUB_ROOT_CONF, because bootp needs to send multicast packages to request an IP. PR: Bug 230168 Reported by: sbruno Reviewed by: jhibbits, mmacy, sbruno Approved by: jhibbits Differential Revision: D16633	2018-08-09 14:04:51 +00:00
Randall Stewart	d18ea344e6	Fix a small bug in rack where it will end up sending the FIN twice. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16604	2018-08-08 13:36:49 +00:00
Jonathan T. Looney	95a914f631	Address concerns about CPU usage while doing TCP reassembly. Currently, the per-queue limit is a function of the receive buffer size and the MSS. In certain cases (such as connections with large receive buffers), the per-queue segment limit can be quite large. Because we process segments as a linked list, large queues may not perform acceptably. The better long-term solution is to make the queue more efficient. But, in the short-term, we can provide a way for a system administrator to set the maximum queue size. We set the default queue limit to 100. This is an effort to balance performance with a sane resource limit. Depending on their environment, goals, etc., an administrator may choose to modify this limit in either direction. Reviewed by: jhb Approved by: so Security: FreeBSD-SA-18:08.tcp Security: CVE-2018-6922	2018-08-06 17:36:57 +00:00
Randall Stewart	936b2b64ae	This fixes a bug in Rack where we were not properly using the correct value for Delayed Ack. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16579	2018-08-06 09:22:07 +00:00
Gleb Smirnoff	cc7963191d	Now that after r335979 the kernel addresses in API structures are fixed size, there is no reason left for the unions. Discussed with: brooks	2018-08-04 00:03:21 +00:00
Michael Tuexen	7bda966394	Add a dtrace provider for UDP-Lite. The dtrace provider for UDP-Lite is modeled after the UDP provider. This fixes the bug that UDP-Lite packets were triggering the UDP provider. Thanks to dteske@ for providing the dwatch module. Reviewed by: dteske@, markj@, rrs@ Relnotes: yes Differential Revision: https://reviews.freebsd.org/D16377	2018-07-31 22:56:03 +00:00
Michael Tuexen	51e08d53ae	Fix INET only builds. r336940 introduced an "unused variable" warning on platforms which support INET, but not INET6, like MALTA and MALTA64 as reported by Mark Millard. Improve the #ifdefs to address this issue. Sponsored by: Netflix, Inc.	2018-07-31 06:27:05 +00:00
Michael Tuexen	888973f5ae	Allow implicit TCP connection setup for TCP/IPv6. TCP/IPv4 allows an implicit connection setup using sendto(), which is used for TTCP and TCP fast open. This patch adds support for TCP/IPv6. While there, improve some tests for detecting multicast addresses, which are mapped. Reviewed by: bz@, kbowling@, rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16458	2018-07-30 21:27:26 +00:00
Michael Tuexen	e2662978b8	Send consistent SEG.WIN when using timewait codepath for TCP. When sending TCP segments from the timewait code path, a stored value of the last sent window is used. Use the same code for computing this in the timewait code path as in the main code path used in tcp_output() to avoiv inconsistencies. Reviewed by: rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16503	2018-07-30 21:13:42 +00:00
Michael Tuexen	8db239dc6b	Fix some TCP fast open issues. The following issues are fixed: * Whenever a TCP server with TCP fast open enabled, calls accept(), recv(), send(), and close() before the TCP-ACK segment has been received, the TCP connection is just dropped and the reception of the TCP-ACK segment triggers the sending of a TCP-RST segment. * Whenever a TCP server with TCP fast open enabled, calls accept(), recv(), send(), send(), and close() before the TCP-ACK segment has been received, the first byte provided in the second send call is not transferred. * Whenever a TCP client with TCP fast open enabled calls sendto() followed by close() the TCP connection is just dropped. Reviewed by: jtl@, kbowling@, rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16485	2018-07-30 20:35:50 +00:00
Michael Tuexen	6138da62a9	Add missing send/recv dtrace probes for TCP. These missing probe are mostly in the syncache and timewait code. Reviewed by: markj@, rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16369	2018-07-30 20:13:38 +00:00
Alan Somers	6040822c4e	Make timespecadd(3) and friends public The timespecadd(3) family of macros were imported from NetBSD back in r35029. However, they were initially guarded by #ifdef _KERNEL. In the meantime, we have grown at least 28 syscalls that use timespecs in some way, leading many programs both inside and outside of the base system to redefine those macros. It's better just to make the definitions public. Our kernel currently defines two-argument versions of timespecadd and timespecsub. NetBSD, OpenBSD, and FreeDesktop.org's libbsd, however, define three-argument versions. Solaris also defines a three-argument version, but only in its kernel. This revision changes our definition to match the common three-argument version. Bump _FreeBSD_version due to the breaking KPI change. Discussed with: cem, jilles, ian, bde Differential Revision: https://reviews.freebsd.org/D14725	2018-07-30 15:46:40 +00:00
Randall Stewart	4ad5b7a0ac	This fixes a hole where rack could end up sending an invalid segment into the reassembly queue. This would happen if you enabled the data after close option. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16453	2018-07-30 10:23:29 +00:00
Andrew Turner	1e0582fd55	icmp_quotelen was accidentially changes in r336676, undo this. Sponsored by: DARPA, AFRL	2018-07-24 16:45:01 +00:00
Andrew Turner	5f901c92a8	Use the new VNET_DEFINE_STATIC macro when we are defining static VNET variables. Reviewed by: bz Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D16147	2018-07-24 16:35:52 +00:00
Randall Stewart	399973c33d	Delete the example tcp stack "fastpath" which was only put in has an example. Sponsored by: Netflix inc. Differential Revision: https://reviews.freebsd.org/D16420	2018-07-24 14:55:47 +00:00
Matt Macy	e5e3e746fe	Fix a potential use after free in getsockopt() access to inp_options Discussed with: jhb Reviewed by: sbruno, transport MFC after: 2 weeks Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14621	2018-07-22 20:02:14 +00:00
Matt Macy	2269988749	NULL out cc_data in pluggable TCP {cc}_cb_destroy When ABE was added (rS331214) to NewReno and leak fixed (rS333699) , it now has a destructor (newreno_cb_destroy) for per connection state. Other congestion controls may allocate and free cc_data on entry and exit, but the field is never explicitly NULLed if moving back to NewReno which only internally allocates stateful data (no entry contstructor) resulting in a situation where newreno_cb_destory might be called on a junk pointer. - NULL out cc_data in the framework after calling {cc}_cb_destroy - free(9) checks for NULL so there is no need to perform not NULL checks before calling free. - Improve a comment about NewReno in tcp_ccalgounload This is the result of a debugging session from Jason Wolfe, Jason Eggleston, and mmacy@ and very helpful insight from lstewart@. Submitted by: Kevin Bowling Reviewed by: lstewart Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16282	2018-07-22 05:37:58 +00:00
Michael Tuexen	34fc9072ce	Set the IPv4 version in the IP header for UDP and UDPLite.	2018-07-21 02:14:13 +00:00
Michael Tuexen	e1526d5a5b	Add missing dtrace probes for received UDP packets. Fire UDP receive probes when a packet is received and there is no endpoint consuming it. Fire the probe also if the TTL of the received packet is smaller than the minimum required by the endpoint. Clarify also in the man page, when the probe fires. Reviewed by: dteske@, markj@, rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16046	2018-07-20 15:32:20 +00:00
Michael Tuexen	0053ed28ff	Whitespace changes due to changes in ident.	2018-07-19 20:16:33 +00:00
Michael Tuexen	b0471b4b95	Revert https://svnweb.freebsd.org/changeset/base/336503 since I also ran the export script with different parameters.	2018-07-19 20:11:14 +00:00
Michael Tuexen	7679e49dd4	Whitespace changes due to change if ident.	2018-07-19 19:33:42 +00:00
Randall Stewart	8de9ac5eec	Bump the ICMP echo limits to match the RFC Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16333	2018-07-18 22:49:53 +00:00
Andrey V. Elsukov	acf673edf0	Move invoking of callout_stop(&lle->lle_timer) into llentry_free(). This deduplicates the code a bit, and also implicitly adds missing callout_stop() to in[6]_lltable_delete_entry() functions. PR: 209682, 225927 Submitted by: hselasky (previous version) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D4605	2018-07-17 11:33:23 +00:00
Sean Bruno	c8b1bdc31c	There was quite a bit of feedback on r336282 that has led to the submitter to want to revert it.	2018-07-14 23:53:51 +00:00
Sean Bruno	179a28b098	Fixup memory management for fetching options in ip_ctloutput() Submitted by: Jason Eggleston <jason@eggnet.com> Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14621	2018-07-14 16:19:46 +00:00
Mark Johnston	aaf268f9f6	Remove a duplicate check. PR: 229663 Submitted by: David Binderman <dcb314@hotmail.com> MFC after: 3 days	2018-07-11 14:54:56 +00:00
Brooks Davis	3a20f06a1c	Use uintptr_t alone when assigning to kvaddr_t variables. Suggested by: jhb	2018-07-10 13:03:06 +00:00
Michael Tuexen	c9da58534d	Add support for printing the TCP FO client-side cookie cache via the sysctl interface. This is similar to the TCP host cache. Reviewed by: pkelsey@, kbowling@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D14554	2018-07-10 10:50:43 +00:00
Michael Tuexen	a026a53a76	Use appropriate MSS value when populating the TCP FO client cookie cache When a client receives a SYN-ACK segment with a TFP fast open cookie, but without an MSS option, an MSS value from uninitialised stack memory is used. This patch ensures that in case no MSS option is included in the SYN-ACK, the appropriate value as given in RFC 7413 is used. Reviewed by: kbowling@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16175	2018-07-10 10:42:48 +00:00
Steven Hartland	65c3a353e6	Removed pointless NULL check Removed pointless NULL check after malloc with M_WAITOK which can never return NULL. Sponsored by: Multiplay	2018-07-10 08:05:32 +00:00
Andrey V. Elsukov	f7c4fdee1a	Add "record-state", "set-limit" and "defer-action" rule options to ipfw. "record-state" is similar to "keep-state", but it doesn't produce implicit O_PROBE_STATE opcode in a rule. "set-limit" is like "limit", but it has the same feature as "record-state", it is single opcode without implicit O_PROBE_STATE opcode. "defer-action" is targeted to be used with dynamic states. When rule with this opcode is matched, the rule's action will not be executed, instead dynamic state will be created. And when this state will be matched by "check-state", then rule action will be executed. This allows create a more complicated rulesets. Submitted by: lev MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D1776	2018-07-09 11:35:18 +00:00
Michael Tuexen	5f1347d7c9	Allow alternate TCP stack to populate the TCP FO client cookie cache. Without this patch, TCP FO could be used when using alternate TCP stack, but only existing entires in the TCP client cookie cache could be used. This cache was not populated by connections using alternate TCP stacks. Sponsored by: Netflix, Inc.	2018-07-07 12:28:16 +00:00
Michael Tuexen	c556884f8e	When initializing the TCP FO client cookie cache, take into account whether the TCP FO support is enabled or not for the client side. The code in tcp_fastopen_init() implicitly assumed that the sysctl variable V_tcp_fastopen_client_enable was initialized to 0. This was initially true, but was changed in r335610, which unmasked this bug. Thanks to Pieter de Goeje for reporting the issue on freebsd-net@	2018-07-07 11:18:26 +00:00
Brooks Davis	5c5e39e3d5	One more 32-bit fix for r335979. Reported by: tuexen	2018-07-06 13:34:45 +00:00
Brooks Davis	7524b4c14b	Correct breakage on 32-bit platforms from r335979.	2018-07-06 10:03:33 +00:00
Andrew Turner	2bf9501287	Create a new macro for static DPCPU data. On arm64 (and possible other architectures) we are unable to use static DPCPU data in kernel modules. This is because the compiler will generate PC-relative accesses, however the runtime-linker expects to be able to relocate these. In preparation to fix this create two macros depending on if the data is global or static. Reviewed by: bz, emaste, markj Sponsored by: ABT Systems Ltd Differential Revision: https://reviews.freebsd.org/D16140	2018-07-05 17:13:37 +00:00
Brooks Davis	f38b68ae8a	Make struct xinpcb and friends word-size independent. Replace size_t members with ksize_t (uint64_t) and pointer members (never used as pointers in userspace, but instead as unique idenitifiers) with kvaddr_t (uint64_t). This makes the structs identical between 32-bit and 64-bit ABIs. On 64-bit bit systems, the ABI is maintained. On 32-bit systems, this is an ABI breaking change. The ABI of most of these structs was previously broken in r315662. This also imposes a small API change on userspace consumers who must handle kernel pointers becoming virtual addresses. PR: 228301 (exp-run by antoine) Reviewed by: jtl, kib, rwatson (various versions) Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D15386	2018-07-05 13:13:48 +00:00
Hiroki Sato	5ba05d3d0e	- Fix a double unlock in inp_block_unblock_source() and lock leakage in inp_leave_group() which caused a panic. - Make order of CTR1() and IN_MULTI_LIST_LOCK() consistent around inm_merge().	2018-07-04 06:47:34 +00:00
Matt Macy	6573d7580b	epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066	2018-07-04 02:47:16 +00:00
Matt Macy	99208b820f	inpcb: don't gratuitously defer frees Don't defer frees in sysctl handlers. It isn't necessary and it just confuses things. revert: r333911, r334104, and r334125 Requested by: jtl	2018-07-02 05:19:44 +00:00
Kristof Provost	0d3d234cd1	carp: Set DSCP value CS7 Update carp to set DSCP value CS7(Network Traffic) in the flowlabel field of packets by default. Currently carp only sets TOS_LOWDELAY in IPv4 which was deprecated in 1998. This also implements sysctl that can revert carp back to it's old behavior if desired. This will allow implementation of QOS on modern network devices to make sure carp packets aren't dropped during interface contention. Submitted by: Nick Wolff <darkfiberiru AT gmail.com> Reviewed by: kp, mav (earlier version) Differential Revision: https://reviews.freebsd.org/D14536	2018-07-01 08:37:07 +00:00
Andrey V. Elsukov	6e081509db	Add NULL pointer check. encap_lookup_t method can be invoked by IP encap subsytem even if none of gif/gre/me interfaces are exist. Hash tables are allocated on demand, when first interface is created. So, make NULL pointer check before doing access to hash table. PR: 229378	2018-06-28 11:39:27 +00:00
Gleb Smirnoff	b8ab659396	Check the inp_flags under inp lock. Looks like the race was hidden before, the conversion of tcbinfo to CK_LIST have uncovered it.	2018-06-27 22:01:59 +00:00
Sean Bruno	af4da58655	Enable TCP_FASTOPEN by default for FreeBSD 12. Submitted by: kbowling Reviewed by: tuexen Differential Revision: https://reviews.freebsd.org/D15959	2018-06-24 21:46:29 +00:00
Sean Bruno	45fc0718d8	Reap unused variable and assignment that had no effect. Noted by cross compiling with gcc on mips. Reviewed by: mmacy	2018-06-24 21:36:37 +00:00
Gleb Smirnoff	a00f4ac22f	Revert r334843, and partially revert r335180. tcp_outflags[] were defined since 4BSD and are defined nowadays in all its descendants. Removing them breaks third party application.	2018-06-23 06:53:53 +00:00
Randall Stewart	581a046a8b	This adds in an optimization so that we only walk one time through the mbuf chain during copy and TSO limiting. It is used by both Rack and now the FreeBSD stack. Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D15937	2018-06-21 21:03:58 +00:00
Matt Macy	e93fdbe212	raw_ip: validate inp in both loops Continuation of r335497. Also move the lock acquisition up to validate before referencing inp_cred. Reported by: pho	2018-06-21 20:18:23 +00:00
Matt Macy	3d348772e7	in_pcblookup_hash: validate inp before return Post r335356 it is possible to have an inpcb on the hash lists that is partially torn down. Validate before using. Also as a side effect of this change the lock ordering issue between hash lock and inpcb no longer exists allowing some simplification. Reported by: pho@	2018-06-21 18:40:15 +00:00
Matt Macy	e5c331cf78	raw_ip: validate inp Post r335356 it is possible to have an inpcb on the hash lists that is partially torn down. Validate before using. Reported by: pho	2018-06-21 17:24:10 +00:00
Matt Macy	46374cbf54	udp_ctlinput: don't refer to unpcb after we drop the lock Reported by: pho@	2018-06-21 06:10:52 +00:00
Randall Stewart	c6f76759ca	Make sure that the t_peakrate_thr is not compiled in by default until NF can upstream it. Reviewed by: and suggested lstewart Sponsored by: Netflix Inc.	2018-06-19 11:20:28 +00:00
Randall Stewart	f923a734b3	Move the tp set back to where it was before we started playing with the VNET sets. This way we have verified the INP settings before we go to the trouble of de-referencing it. Reviewed by: and suggested by lstewart Sponsored by: Netflix Inc.	2018-06-19 05:28:14 +00:00
Matt Macy	9e58ff6ff9	convert inpcbinfo hash and info rwlocks to epoch + mutex - Convert inpcbinfo info & hash locks to epoch for read and mutex for write - Garbage collect code that handled INP_INFO_TRY_RLOCK failures as INP_INFO_RLOCK which can no longer fail When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to 3%. Overall packet throughput rate limited by CPU affinity and NIC driver design choices. On the receiver unhalted core cycles samples in in_pcblookup_hash went from 13% to to 1.6% Tested by LLNW and pho@ Reviewed by: jtl Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15686	2018-06-19 01:54:00 +00:00
Randall Stewart	f994ead330	Move to using the inp->vnet pointer has suggested by lstewart. This is far better since the hpts system is using the inp as its basis anyway. Unfortunately his comments came late. Sponsored by: Netflix Inc.	2018-06-18 14:10:12 +00:00
Andrey V. Elsukov	20efcfc602	Switch RIB and RADIX_NODE_HEAD lock from rwlock(9) to rmlock(9). Using of rwlock with multiqueue NICs for IP forwarding on high pps produces high lock contention and inefficient. Rmlock fits better for such workloads. Reviewed by: melifaro, olivier Obtained from: Yandex LLC Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D15789	2018-06-16 08:26:23 +00:00
Michael Tuexen	43b223f42e	When retransmitting TCP SYN-ACK segments with the TCP timestamp option enabled use an updated timestamp instead of reusing the one used in the initial TCP SYN-ACK segment. This patch ensures that an updated timestamp is used when sending the SYN-ACK from the syncache code. It was already done if the SYN-ACK was retransmitted from the generic code. This makes the behaviour consistent and also conformant with the TCP specification. Reviewed by: jtl@, Jason Eggleston MFC after: 1 month Sponsored by: Neflix, Inc. Differential Revision: https://reviews.freebsd.org/D15634	2018-06-15 12:28:43 +00:00
Gleb Smirnoff	9293873e83	TCPOUTFLAGS no longer exists since r334843.	2018-06-14 22:25:10 +00:00
Michael Tuexen	33ef123090	Provide the ip6_plen in network byte order when calling ip6_output(). This is not strictly required by ip6_output(), since it overrides it, but it is needed for upcoming dtrace support.	2018-06-14 21:30:52 +00:00
Michael Tuexen	8d86bd564f	Whitespace changes.	2018-06-14 21:22:14 +00:00
Andrey V. Elsukov	eb548a1a5c	In m_megapullup() use m_getjcl() to allocate 9k or 16k mbuf when requested. It is better to try allocate a big mbuf, than just silently drop a big packet. A better solution could be reworking of libalias modules to be able use m_copydata()/m_copyback() instead of requiring the single contiguous buffer. PR: 229006 MFC after: 1 week	2018-06-14 11:15:39 +00:00
Randall Stewart	4aec110f70	This fixes several bugs that Larry Rosenman helped me find in Rack with respect to its handling of TCP Fast Open. Several fixes all related to TFO are included in this commit: 1) Handling of non-TFO retransmissions 2) Building the proper send-map when we are doing TFO 3) Dealing with the ack that comes back that includes the SYN and data. It appears that with this commit TFO now works :-) Thanks Larry for all your help!! Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15758	2018-06-14 03:27:42 +00:00
Matt Macy	feeef8509b	Fix PCBGROUPS build post CK conversion of pcbinfo	2018-06-13 23:19:54 +00:00
Andrey V. Elsukov	a5185adeb6	Rework if_gre(4) to use encap_lookup_t method to speedup lookup of needed interface when many gre interfaces are present. Remove rmlock from gre_softc, use epoch(9) and CK_LIST instead. Move more AF-related code into AF-related locations. Use hash table to speedup lookup of needed softc.	2018-06-13 11:11:33 +00:00
Matt Macy	483305b99c	Handle INP_FREED when looking up an inpcb When hash table lookups are not serialized with in_pcbfree it will be possible for callers to find an inpcb that has been marked free. We need to check for this and return NULL.	2018-06-13 04:23:49 +00:00
Randall Stewart	c9b4ac7587	This fixes missing VNET sets in the hpts system. Basically without this and running vnets with a TCP stack that uses some of the features is a recipe for panic (without this commit). Reported by: Larry Rosenman Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15757	2018-06-12 23:54:08 +00:00
Matt Macy	700e893c34	Defer inpcbport free in in_pcbremlists as well	2018-06-12 23:26:25 +00:00
Matt Macy	f09ee4fc01	Defer inpcbport free until after a grace period has elapsed This is a dependency for inpcbinfo rlock conversion to epoch	2018-06-12 22:18:27 +00:00
Matt Macy	b872626dbe	mechanical CK macro conversion of inpcbinfo lists This is a dependency for converting the inpcbinfo hash and info rlocks to epoch.	2018-06-12 22:18:20 +00:00
Matt Macy	addf2b2009	Defer inpcb deletion until after a grace period has elapsed Deferring the actual free of the inpcb until after a grace period has elapsed will allow us to convert the inpcbinfo info and hash read locks to epoch. Reviewed by: gallatin, jtl Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15510	2018-06-12 22:18:15 +00:00
Jonathan T. Looney	cff21e484b	Change RACK dependency on TCPHPTS from a build-time dependency to a load- time dependency. At present, RACK requires the TCPHPTS option to run. However, because modules can be moved from machine to machine, this dependency is really best assessed at load time rather than at build time. Reviewed by: rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15756	2018-06-11 14:27:19 +00:00
Matt Macy	3db28e6656	avoid 'tcp_outflags defined but not used'	2018-06-08 17:37:49 +00:00
Matt Macy	afbd6cfa72	hpts: remove redundant decl breaking gcc build	2018-06-08 17:37:43 +00:00
Randall Stewart	89e560f441	This commit brings in a new refactored TCP stack called Rack. Rack includes the following features: - A different SACK processing scheme (the old sack structures are not used). - RACK (Recent acknowledgment) where counting dup-acks is no longer done instead time is used to knwo when to retransmit. (see the I-D) - TLP (Tail Loss Probe) where we will probe for tail-losses to attempt to try not to take a retransmit time-out. (see the I-D) - Burst mitigation using TCPHTPS - PRR (partial rate reduction) see the RFC. Once built into your kernel, you can select this stack by either socket option with the name of the stack is "rack" or by setting the global sysctl so the default is rack. Note that any connection that does not support SACK will be kicked back to the "default" base FreeBSD stack (currently known as "default"). To build this into your kernel you will need to enable in your kernel: makeoptions WITH_EXTRA_TCP_STACKS=1 options TCPHPTS Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15525	2018-06-07 18:18:13 +00:00
Michael Tuexen	ff34bbe9c2	Improve compliance with RFC 4895 and RFC 6458. Silently dicard SCTP chunks which have been requested to be authenticated but are received unauthenticated no matter if support for SCTP authentication has been negotiated. This improves compliance with RFC 4895. When the application uses the SCTP_AUTH_CHUNK socket option to request a chunk to be received in an authenticated way, enable the SCTP authentication extension for the end-point. This improves compliance with RFC 6458. Discussed with: Peter Lei MFC after: 3 days	2018-06-06 19:27:06 +00:00
Sean Bruno	1a43cff92a	Load balance sockets with new SO_REUSEPORT_LB option. This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function. Most of the code was copied from a similar patch for DragonflyBSD. However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system. Required changes to structures: Globally change so_options from 16 to 32 bit value to allow for more options. Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets. Limitations: As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket). This is a substantially different contribution as compared to its original incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@ for the substantive feedback that is included in this commit. Submitted by: Johannes Lundberg <johalun0@gmail.com> Obtained from: DragonflyBSD Relnotes: Yes Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D11003	2018-06-06 15:45:57 +00:00
Andrey V. Elsukov	590d0a43b6	Make in_delayed_cksum() be similar to IPv6 implementation. Use m_copyback() function to write checksum when it isn't located in the first mbuf of the chain. Handmade analog doesn't handle the case when parts of checksum are located in different mbufs. Also in case when mbuf is too short, m_copyback() will allocate new mbuf in the chain instead of making out of bounds write. Also wrap long line and remove now useless KASSERTs. X-MFC after: r334705	2018-06-06 13:01:53 +00:00
Tom Jones	1fdbfb909f	Use UDP len when calculating UDP checksums The length of the IP payload is normally equal to the UDP length, UDP Options (draft-ietf-tsvwg-udp-options-02) suggests using the difference between IP length and UDP length to create space for trailing data. Correct checksum length calculation to use the UDP length rather than the IP length when not offloading UDP checksums. Approved by: jtl (mentor) Differential Revision: https://reviews.freebsd.org/D15222	2018-06-06 07:04:40 +00:00
Andrey V. Elsukov	b941bc1d6e	Rework if_gif(4) to use new encap_lookup_t method to speedup lookup of needed interface when many gif interfaces are present. Remove rmlock from gif_softc, use epoch(9) and CK_LIST instead. Move more AF-related code into AF-related locations. Use hash table to speedup lookup of needed softc. Interfaces with GIF_IGNORE_SOURCE flag are stored in plain CK_LIST. Sysctl net.link.gif.parallel_tunnels is removed. The removal was planed 16 years ago, and actually it could work only for outbound direction. Each protocol, that can be handled by if_gif(4) interface is registered by separate encap handler, this helps avoid invoking the handler for unrelated protocols (GRE, PIM, etc.). This change allows dramatically improve performance when many gif(4) interfaces are used. Sponsored by: Yandex LLC	2018-06-05 21:24:59 +00:00
Andrey V. Elsukov	6d8fdfa9d5	Rework IP encapsulation handling code. Currently it has several disadvantages: - it uses single mutex to protect internal structures. It is used by data- and control- path, thus there are no parallelism at all. - it uses single list to keep encap handlers for both INET and INET6 families. - struct encaptab keeps unneeded information (src, dst, masks, protosw), that isn't used by code in the source tree. - matches are prioritized and when many tunneling interfaces are registered, encapcheck handler of each interface is invoked for each packet. The search takes O(n) for n interfaces. All this work is done with exclusive lock held. What this patch includes: - the datapath is converted to be lockless using epoch(9) KPI. - struct encaptab now linked using CK_LIST. - all unused fields removed from struct encaptab. Several new fields addedr: min_length is the minimum packet length, that encapsulation handler expects to see; exact_match is maximum number of bits, that can return an encapsulation handler, when it wants to consume a packet. - IPv6 and IPv4 handlers are stored in separate lists; - added new "encap_lookup_t" method, that will be used later. It is targeted to speedup lookup of needed interface, when gif(4)/gre(4) have many interfaces. - the need to use protosw structure is eliminated. The only pr_input method was used from this structure, so I don't see the need to keep using it. - encap_input_t method changed to avoid using mbuf tags to store softc pointer. Now it is passed directly trough encap_input_t method. encap_getarg() funtions is removed. - all sockaddr structures and code that uses them removed. We don't have any code in the tree that uses them. All consumers use encap_attach_func() method, that relies on invoking of encapcheck() to determine the needed handler. - introduced struct encap_config, it contains parameters of encap handler that is going to be registered by encap_attach() function. - encap handlers are stored in lists ordered by exact_match value, thus handlers that need more bits to match will be checked first, and if encapcheck method returns exact_match value, the search will be stopped. - all current consumers changed to use new KPI. Reviewed by: mmacy Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D15617	2018-06-05 20:51:01 +00:00
Mateusz Guzik	34c538c356	malloc: try to use builtins for zeroing at the callsite Plenty of allocation sites pass M_ZERO and sizes which are small and known at compilation time. Handling them internally in malloc loses this information and results in avoidable calls to memset. Instead, let the compiler take the advantage of it whenever possible. Discussed with: jeff	2018-06-02 22:20:09 +00:00
Michael Tuexen	13500cbb61	Don't overflow a buffer if we receive an INIT or INIT-ACK chunk without a RANDOM parameter but with a CHUNKS or HMAC-ALGO parameter. Please note that sending this combination violates the specification. Thnanks to Ronald E. Crane for reporting the issue for the userland stack. MFC after: 3 days	2018-06-02 16:28:10 +00:00
Michael Tuexen	c14f9fe5ef	Limit the retransmission timer for SYN-ACKs by TCPTV_REXMTMAX. Use the same logic to handle the SYN-ACK retransmission when sent from the syn cache code as when sent from the main code. MFC after: 3 days Sponsored by: Netflix, Inc.	2018-06-01 21:24:27 +00:00
Michael Tuexen	badef00d58	Ensure net.inet.tcp.syncache.rexmtlimit is limited by TCP_MAXRXTSHIFT. If the sysctl variable is set to a value larger than TCP_MAXRXTSHIFT+1, the array tcp_syn_backoff[] is accessed out of bounds. Discussed with: jtl@ MFC after: 3 days Sponsored by: Netflix, Inc.	2018-06-01 19:58:19 +00:00
Andrey V. Elsukov	12e7376216	Remove empty encap_init() function. MFC after: 2 weeks	2018-05-29 12:32:08 +00:00
Michael Tuexen	eef8d4a973	Use correct mask. Introduced in https://svnweb.freebsd.org/changeset/base/333603. Thanks to Irene Ruengler for testing and reporting the issue. MFC after: 1 week X-MFC-with: 333603	2018-05-28 13:31:47 +00:00
Matt Macy	62d733a11d	in_pcbladdr: remove debug code that snuck in with ifa epoch conversion r334118	2018-05-27 06:47:09 +00:00
Matt Macy	0f8d79d977	CK: update consumers to use CK macros across the board r334189 changed the fields to have names distinct from those in queue.h in order to expose the oversights as compile time errors	2018-05-24 23:21:23 +00:00
Matt Macy	fe524329e4	convert allocations to INVARIANTS M_ZERO	2018-05-24 01:04:56 +00:00
Matt Macy	4f6c66cc9c	UDP: further performance improvements on tx Cumulative throughput while running 64 netperf -H $DUT -t UDP_STREAM -- -m 1 on a 2x8x2 SKL went from 1.1Mpps to 2.5Mpps Single stream throughput increases from 910kpps to 1.18Mpps Baseline: https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg - Protect read access to global ifnet list with epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender3.svg - Protect short lived ifaddr references with epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender4.svg - Convert if_afdata read lock path to epoch https://people.freebsd.org/~mmacy/2018.05.11/udpsender5.svg A fix for the inpcbhash contention is pending sufficient time on a canary at LLNW. Reviewed by: gallatin Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15409	2018-05-23 21:02:14 +00:00
Matt Macy	630ba2c514	udp: assign flowid to udp sockets round-robin On a 2x8x2 SKL this increases measured throughput with 64 netperf -H $DUT -t UDP_STREAM -- -m 1 from 590kpps to 1.1Mpps before: https://people.freebsd.org/~mmacy/2018.05.11/udpsender.svg after: https://people.freebsd.org/~mmacy/2018.05.11/udpsender2.svg	2018-05-23 20:50:09 +00:00
Matt Macy	246a619924	epoch: allow for conditionally asserting that the epoch context fields are unused by zeroing on INVARIANTS builds	2018-05-23 17:00:05 +00:00
Mark Johnston	9f78e2b83d	Initialize the dumper struct before calling set_dumper(). Fields owned by the generic code were being left uninitialized, causing problems in clear_dumper() if an error occurred. Coverity CID: 1391200 X-MFC with: r333283	2018-05-22 16:01:56 +00:00
Matt Macy	f42a83f2a6	inpcb: revert deferred inpcb free pending further review	2018-05-21 16:13:43 +00:00
Michael Tuexen	c692df45fc	Only fillin data srucuture when actually stored.	2018-05-21 14:53:22 +00:00
Michael Tuexen	d3132db2b5	Do the appropriate accounting when ip_output() fails.	2018-05-21 14:52:18 +00:00
Michael Tuexen	95844fce7d	Make clear why there is an assignment, which is not necessary.	2018-05-21 14:51:20 +00:00
Ed Maste	3b9b6b1704	Pair CURVNET_SET and CURVNET_RESTORE in a block Per vnet(9), CURVNET_SET and CURVNET_RESTORE cannot be used as a single statement for a conditional and CURVNET_RESTORE must be in the same block as CURVNET_SET (or a subblock). Reviewed by: andrew Sponsored by: The FreeBSD Foundation	2018-05-21 13:08:44 +00:00
Ed Maste	15f8acc53f	Revert r333968, it broke all archs but i386 and amd64	2018-05-21 11:56:07 +00:00
Matt Macy	ed6bb714b2	in(6)_mcast: Expand out vnet set / restore macro so that they work in a conditional block Reported by: zec at fer.hr	2018-05-21 08:34:10 +00:00
Matt Macy	06b15160e1	ensure that vnet is set when doing in_leavegroup	2018-05-21 07:12:06 +00:00
Matt Macy	1a3d880c26	in(s)_moptions: free before tearing down inpcb	2018-05-20 20:08:21 +00:00

... 3 4 5 6 7 ...

6389 Commits