freebsd-skq

Author	SHA1	Message	Date
Michael Tuexen	35dfb8cb68	Ensure that TCP state changes to state-closing are reported via dtrace. This does not cover state changes from TIME-WAIT. Reviewed by: gnn MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8443	2016-11-19 14:45:08 +00:00
Andrey V. Elsukov	8432fa5fd9	Initialize ip6 pointer before use. PR: 214169 MFC after: 1 week	2016-11-06 02:33:04 +00:00
Michael Tuexen	3e1465754f	Make ICMPv6 hard error handling for TCP consistent with the ICMPv4 handling. Ensure that: * Protocol unreachable errors are handled by indicating ECONNREFUSED to the TCP user for both IPv4 and IPv6. These were ignored for IPv6. * Communication prohibited errors are handled by indicating ECONNREFUSED to the TCP user for both IPv4 and IPv6. These were ignored for IPv6. * Hop Limited exceeded errors are handled by indicating EHOSTUNREACH to the TCP user for both IPv4 and IPv6. For IPv6 the TCP connected was dropped but errno wasn't set. Reviewed by: gallatin, rrs MFC after: 1 month Sponsored by: Netflix Differential Revision: 7904	2016-10-21 10:32:57 +00:00
Jonathan T. Looney	82676a28eb	r307082 added the TCP_HHOOK kernel option and made some existing code only compile when that option is configured. In tcp_destroy(), the error variable is now only used in code enclosed in an '#ifdef TCP_HHOOK' block. This broke the build for VNET images. Enclose the error variable itself in an #ifdef block. Submitted by: Shawn Webb <shawn.webb at hardenedbsd.org> Reported by: Shawn Webb <shawn.webb at hardenedbsd.org> PointyHat to: jtl	2016-10-15 00:29:15 +00:00
Jonathan T. Looney	bd79708dbf	In the TCP stack, the hhook(9) framework provides hooks for kernel modules to add actions that run when a TCP frame is sent or received on a TCP session in the ESTABLISHED state. In the base tree, this functionality is only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd, and cc_vegas congestion control modules. Presently, we incur overhead to check for hooks each time a TCP frame is sent or received on an ESTABLISHED TCP session. This change adds a new compile-time option (TCP_HHOOK) to determine whether to include the hhook(9) framework for TCP. To retain backwards compatibility, I added the TCP_HHOOK option to every configuration file that already defined "options INET". (Therefore, this patch introduces no functional change. In order to see a functional difference, you need to compile a custom kernel without the TCP_HHOOK option.) This change will allow users to easily exclude this functionality from their kernel, should they wish to do so. Note that any users who use a custom kernel configuration and use one of the congestion control modules listed above will need to add the TCP_HHOOK option to their kernel configuration. Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8185	2016-10-12 02:16:42 +00:00
Jonathan T. Looney	3ac125068a	Remove "long" variables from the TCP stack (not including the modular congestion control framework). Reviewed by: gnn, lstewart (partial) Sponsored by: Juniper Networks, Netflix Differential Revision: (multiple) Tested by: Limelight, Netflix	2016-10-06 16:28:34 +00:00
Randall Stewart	587d67c008	Here we update the modular tcp to be able to switch to an alternate TCP stack in other then the closed state (pre-listen/connect). The idea is that if that is supported by the alternate stack, it is asked if its ok to switch. If it approves the "handoff" then we allow the switch to happen. Also the fini() function now gets a flag to tell if you are switching away or the tcb is destroyed. The init() call into the alternate stack is moved to the end so the tcb is more fully formed before the init transpires. Sponsored by: Netflix Inc. Differential Revision: D6790	2016-08-16 15:11:46 +00:00
Andrew Gallatin	d4c22202e6	Rework IPV6 TCP path MTU discovery to match IPv4 - Re-write tcp_ctlinput6() to closely mimic the IPv4 tcp_ctlinput() - Now that tcp_ctlinput6() updates t_maxseg, we can allow ip6_output() to send TCP packets without looking at the tcp host cache for every single transmit. - Make the icmp6 code mimic the IPv4 code & avoid returning PRC_HOSTDEAD because it is so expensive. Without these changes in place, every TCP6 pmtu discovery or host unreachable ICMP resulted in a call to in6_pcbnotify() which walks the tcbinfo table with the write lock held. Because the tcbinfo table is shared between IPv4 and IPv6, this causes huge scalabilty issues on servers with lots of (~100K) TCP connections, to the point where even a small percent of IPv6 traffic had a disproportionate impact on overall throughput. Reviewed by: bz, rrs, ae (all earlier versions), lstewart (in Netflix's tree) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D7272	2016-08-01 17:02:21 +00:00
Andrew Gallatin	0e3b891988	Call tcp_notify() directly to shoot down routes, rather than calling in_pcbnotifyall(). This avoids lock contention on tcbinfo due to in_pcbnotifyall() holding the tcbinfo write lock while walking all connections. Reviewed by: rrs, karels MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D7251	2016-07-28 19:32:25 +00:00
Jonathan T. Looney	24b9bb5614	The TCPPCAP debugging feature caches recently-used mbufs for use in debugging TCP connections. This commit provides a mechanism to free those mbufs when the system is under memory pressure. Because this will result in lost debugging information, the behavior is controllable by a sysctl. The default setting is to free the mbufs. Reviewed by: gnn Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D6931 Input from: novice_techie.com	2016-07-06 16:17:13 +00:00
Bjoern A. Zeeb	42644eb88e	Try to avoid a 2nd conditional by re-writing the loop, pause, and escape clause another time. Submitted by: jhb Approved by: re (gjb) MFC after: 12 days	2016-06-23 21:32:52 +00:00
Bjoern A. Zeeb	361279a4fb	In VNET TCP teardown Do not sleep unconditionally but only if we have any TCP connections left. Submitted by: zec Approved by: re (hrs) MFC after: 13 days	2016-06-23 11:55:15 +00:00
Bjoern A. Zeeb	b54e08e11a	Check the V_tcbinfo.ipi_count to hit 0 before doing the full TCP cleanup. That way timers can finish cleanly and we do not gamble with a DELAY(). Reviewed by: gnn, jtl Approved by: re (gjb) Obtained from: projects/vnet MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D6923	2016-06-23 00:34:03 +00:00
Bjoern A. Zeeb	3f58662dd9	The pr_destroy field does not allow us to run the teardown code in a specific order. VNET_SYSUNINITs however are doing exactly that. Thus remove the VIMAGE conditional field from the domain(9) protosw structure and replace it with VNET_SYSUNINITs. This also allows us to change some order and to make the teardown functions file local static. Also convert divert(4) as it uses the same mechanism ip(4) and ip6(4) use internally. Slightly reshuffle the SI_SUB_* fields in kernel.h and add a new ones, e.g., for pfil consumers (firewalls), partially for this commit and for others to come. Reviewed by: gnn, tuexen (sctp), jhb (kernel.h) Obtained from: projects/vnet MFC after: 2 weeks X-MFC: do not remove pr_destroy Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D6652	2016-06-01 10:14:04 +00:00
John Baldwin	052a5418e8	Don't reuse the source mbuf in tcp_respond() if it is not writable. Not all mbufs passed up from device drivers are M_WRITABLE(). In particular, the Chelsio T4/T5 driver uses a feature called "buffer packing" to receive multiple frames in a single receive buffer. The mbufs for these frames all share the same external storage so are treated as read-only by the rest of the stack when multiple frames are in flight. Previously tcp_respond() would blindly overwrite read-only mbufs when INVARIANTS was disabled or panic with an assertion failure if INVARIANTS was enabled. Note that the new case is a bit of a mix of the two other cases in tcp_respond(). The TCP and IP headers must be copied explicitly into the new mbuf instead of being inherited (similar to the m == NULL case), but the addresses and ports must be swapped in the reply (similar to the m != NULL case). Reviewed by: glebius	2016-05-26 18:35:37 +00:00
Gleb Smirnoff	f59d975e10	Tiny refactor of r294869/r296881: use defines to mask the VNET() macro. Suggested by: bz	2016-05-17 23:14:17 +00:00
Pedro F. Giffuni	a4641f4eaa	sys/net*: minor spelling fixes. No functional change.	2016-05-03 18:05:43 +00:00
Randall Stewart	e5ad64562a	This cleans up the timers code in TCP to start using the new async_drain functionality. This as been tested in NF as well as by Verisign. Still to do in here is to remove all the old flags. They are currently left being maintained but probably are no longer needed. Sponsored by: Netflix Inc. Differential Revision: http://reviews.freebsd.org/D5924	2016-04-28 13:27:12 +00:00
Bjoern A. Zeeb	806929d514	Mfp: r296310,r296343 It looks like as with the safety belt of DELAY() fastened () we can completely tear down and free all memory for TCP (after r281599). () in theory a few ticks should be good enough to make sure the timers are all really gone. Could we use a better matric here and check a tcbcb count as an optimization? PR: 164763 Reviewed by: gnn, emaste MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5734	2016-04-09 12:05:23 +00:00
Bjoern A. Zeeb	8586a9635f	Mfp: r296260 The tcp_inpcb (pcbinfo) zone should be safe to destroy. PR: 164763 Reviewed by: gnn MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5732	2016-04-09 11:27:47 +00:00
Bjoern A. Zeeb	f254aeda60	Mfp: r296259 We attach the "counter" to the tcpcbs. Thus don't free the TCP Fastopen zone before the tcpcbs are gone, as otherwise the zone won't be empty. With that it should be safe to destroy the "tfo" zone without leaking the memory. PR: 164763 Reviewed by: gnn MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5731	2016-04-09 10:58:08 +00:00
Edward Tomasz Napierala	35030a5dd4	Remove some NULL checks for M_WAITOK allocations. MFC after: 1 month Sponsored by: The FreeBSD Foundation	2016-03-29 13:56:59 +00:00
Bjoern A. Zeeb	4f321dbd1c	Fix compile errors after r297225: - properly V_irtualise variable access unbreaking VIMAGE kernels. - remove the volatile from the function return type to make architecture using gcc happy [-Wreturn-type] "type qualifiers ignored on function return type" I am not entirely happy with this solution putting the u_int there but it will do for now.	2016-03-24 11:40:10 +00:00
George V. Neville-Neil	84cc0778d0	FreeBSD previously provided route caching for TCP (and UDP). Re-add route caching for TCP, with some improvements. In particular, invalidate the route cache if a new route is added, which might be a better match. The cache is automatically invalidated if the old route is deleted. Submitted by: Mike Karels Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D4306	2016-03-24 07:54:56 +00:00
Gleb Smirnoff	bf840a1707	Redo r294869. The array of counters for TCP states doesn't belong to struct tcpstat, because the structure can be zeroed out by netstat(1) -z, and of course running connection counts shouldn't be touched. Place running connection counts into separate array, and provide separate read-only sysctl oid for it.	2016-03-15 00:15:10 +00:00
Jonathan T. Looney	737d4f6c93	As reported on the transport@ and current@ mailing lists, the FreeBSD TCP stack is not compliant with RFC 7323, which requires that TCP stacks send a timestamp option on all packets (except, optionally, RSTs) after the session is established. This patch adds that support. It also adds a TCP signature option to the packet, if appropriate. PR: 206047 Differential Revision: https://reviews.freebsd.org/D4808 Reviewed by: hiren MFC after: 2 weeks Sponsored by: Juniper Networks	2016-03-07 15:00:34 +00:00
Jonathan T. Looney	9cbade8feb	Some cleanup in tcp_respond() in preparation for another change: - Reorder variables by size - Move initializer closer to where it is used - Remove unneeded variable Differential Revision: https://reviews.freebsd.org/D4808 Reviewed by: hiren MFC after: 2 weeks Sponsored by: Juniper Networks	2016-03-07 14:59:49 +00:00
George V. Neville-Neil	e79cb051d5	Fix dtrace probes (introduced in 287759): debug__input was used for output and drop; connect didn't always fire a user probe some probes were missing in fastpath Submitted by: Hannes Mehnert Sponsored by: REMS, EPSRC Differential Revision: https://reviews.freebsd.org/D5525	2016-03-03 17:46:38 +00:00
Bryan Drewery	6971a63795	Fix build after r29592.	2016-02-23 21:21:47 +00:00
Randall Stewart	6e0efc6a39	This fixes the fastpath code to have a better module initialization sequence when included in loader.conf. It also fixes it so that no matter if some one incorrectly specifies a load order, the lists and such will be initialized on demand at that time so no one can make that mistake. Reviewed by: hiren Differential Revision: D5189	2016-02-23 17:53:39 +00:00
Gleb Smirnoff	4644fda3f7	Rename netinet/tcp_cc.h to netinet/cc/cc.h. Discussed with: lstewart	2016-01-27 17:59:39 +00:00
Gleb Smirnoff	75dd79d937	Grab a snap amount of TCP connections in syncache from tcpstat.	2016-01-27 00:48:05 +00:00
Gleb Smirnoff	57a78e3bae	Augment struct tcpstat with tcps_states[], which is used for book-keeping the amount of TCP connections by state. Provides a cheap way to get connection count without traversing the whole pcb list. Sponsored by: Netflix	2016-01-27 00:45:46 +00:00
Hiren Panchasara	0645c6049d	Persist timers TCPTV_PERSMIN and TCPTV_PERSMAX are hardcoded with 5 seconds and 60 seconds, respectively. Turn them into sysctls that can be tuned live. The default values of 5 seconds and 60 seconds have been retained. Submitted by: Jason Wolfe (j at nitrology dot com) Reviewed by: gnn, rrs, hiren, bz MFC after: 1 week Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D5024	2016-01-26 16:33:38 +00:00
Alexander V. Chernikov	0d6a516eb8	Convert TCP mtu checks to the new routing KPI.	2016-01-25 10:06:49 +00:00
Gleb Smirnoff	2de3e790f5	- Rename cc.h to more meaningful tcp_cc.h. - Declare it a kernel only include, which it already is. - Don't include tcp.h implicitly from tcp_cc.h	2016-01-21 22:34:51 +00:00
Alexander V. Chernikov	ea8d14925c	Remove sys/eventhandler.h from net/route.h Reviewed by: ae	2016-01-09 09:34:39 +00:00
Gleb Smirnoff	0c39d38d21	Historically we have two fields in tcpcb to describe sender MSS: t_maxopd, and t_maxseg. This dualism emerged with T/TCP, but was not properly cleaned up after T/TCP removal. After all permutations over the years the result is that t_maxopd stores a minimum of peer offered MSS and MTU reduced by minimum protocol header. And t_maxseg stores (t_maxopd - TCPOLEN_TSTAMP_APPA) if timestamps are in action, or is equal to t_maxopd otherwise. That's a very rough estimate of MSS reduced by options length. Throughout the code it was used in places, where preciseness was not important, like cwnd or ssthresh calculations. With this change: - t_maxopd goes away. - t_maxseg now stores MSS not adjusted by options. - new function tcp_maxseg() is provided, that calculates MSS reduced by options length. The functions gives a better estimate, since it takes into account SACK state as well. Reviewed by: jtl Differential Revision: https://reviews.freebsd.org/D3593	2016-01-07 00:14:42 +00:00
Patrick Kelsey	281a0fd4f9	Implementation of server-side TCP Fast Open (TFO) [RFC7413]. TFO is disabled by default in the kernel build. See the top comment in sys/netinet/tcp_fastopen.c for implementation particulars. Reviewed by: gnn, jch, stas MFC after: 3 days Sponsored by: Verisign, Inc. Differential Revision: https://reviews.freebsd.org/D4350	2015-12-24 19:09:48 +00:00
Bjoern A. Zeeb	616bc4f476	If bootverbose is enabled every vnet startup and virtual interface creation will print extra lines on the console. We are generally not interested in this (repeated) information for each VNET. Thus only print it for the default VNET. Virtual interfaces on the base system will remain printing information, but e.g. each loopback in each vnet will no longer cause a "bpf attached" line. Sponsored by: The FreeBSD Foundation MFC after: 2 weeks Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D4531	2015-12-22 15:00:04 +00:00
Jonathan T. Looney	c54a41def7	Fix a panic when launching VNETs after the commit of r292309. Differential Revision: https://reviews.freebsd.org/D4645 Reviewed by: rrs Reported by: kp Tested by: kp Sponsored by: Juniper Networks	2015-12-22 13:41:50 +00:00
Randall Stewart	55bceb1e2b	First cut of the modularization of our TCP stack. Still to do is to clean up the timer handling using the async-drain. Other optimizations may be coming to go with this. Whats here will allow differnet tcp implementations (one included). Reviewed by: jtl, hiren, transports Sponsored by: Netflix Inc. Differential Revision: D4055	2015-12-16 00:56:45 +00:00
George V. Neville-Neil	26882b4239	Turning on IPSEC used to introduce a slight amount of performance degradation (7%) for host host TCP connections over 10Gbps links, even when there were no secuirty policies in place. There is no change in performance on 1Gbps network links. Testing GENERIC vs. GENERIC-NOIPSEC vs. GENERIC with this change shows that the new code removes any overhead introduced by having IPSEC always in the kernel. Differential Revision: D3993 MFC after: 1 month Sponsored by: Rubicon Communications (Netgate)	2015-10-27 00:42:15 +00:00
Hiren Panchasara	86a996e6bd	There are times when it would be really nice to have a record of the last few packets and/or state transitions from each TCP socket. That would help with narrowing down certain problems we see in the field that are hard to reproduce without understanding the history of how we got into a certain state. This change provides just that. It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is destroyed, the list is freed. I thought this was likely to be more performance-friendly than saving copies of the tcpcb. Plus, with the packets, you should be able to reverse-engineer what happened to the tcpcb. To enable the feature, you will need to compile a kernel with the TCPPCAP option. Even then, the feature defaults to being deactivated. You can activate it by setting a positive value for the number of captured packets. You can do that on either a global basis or on a per-socket basis (via a setsockopt call). There is no way to get the packets out of the kernel other than using kmem or getting a coredump. I thought that would help some of the legal/privacy concerns regarding such a feature. However, it should be possible to add a future effort to export them in PCAP format. I tested this at low scale, and found that there were no mbuf leaks and the peak mbuf usage appeared to be unchanged with and without the feature. The main performance concern I can envision is the number of mbufs that would be used on systems with a large number of sockets. If you save five packets per direction per socket and have 3,000 sockets, that will consume at least 30,000 mbufs just to keep these packets. I tried to reduce the concerns associated with this by limiting the number of clusters (not mbufs) that could be used for this feature. Again, in my testing, that appears to work correctly. Differential Revision: D3100 Submitted by: Jonathan Looney <jlooney at juniper dot net> Reviewed by: gnn, hiren	2015-10-14 00:35:37 +00:00
Gleb Smirnoff	794ac42374	When processing ICMP need frag message, ignore the suggested MTU unless it is smaller than the current one for this connection. This is behavior specified by RFC 1191, and this is how original BSD stack behaved, but this was unintentionally regressed in r182851. Reported & tested by: Richard Russo <russor whatsapp.com> Differential Revision: D3567 Sponsored by: Nginx, Inc.	2015-09-30 03:37:37 +00:00
Gleb Smirnoff	399fbd0ec0	Use proper byteswap macro. This isn't a functional change.	2015-09-17 17:27:49 +00:00
Gleb Smirnoff	db642c8e6e	In tcp_ctlinput() separate the (ip == NULL) block from the rest of the function to reduce so many levels of indentation. Style the lines that got now indentation reduced. No functional change. Checked with: md5	2015-09-16 21:42:33 +00:00
George V. Neville-Neil	5d06879adb	dd DTrace probe points, translators and a corresponding script to provide the TCPDEBUG functionality with pure DTrace. Reviewed by: rwatson MFC after: 2 weeks Sponsored by: Limelight Networks Differential Revision: D3530	2015-09-13 15:50:55 +00:00
Gleb Smirnoff	24067db8ca	Make tcp_mtudisc() static and void. No functional changes. Sponsored by: Nginx, Inc.	2015-09-04 12:02:12 +00:00
Julien Charbon	079672cb07	Fix a kernel assertion issue introduced with r286227: Avoid too strict INP_INFO_RLOCK_ASSERT checks due to tcp_notify() being called from in6_pcbnotify(). Reported by: Larry Rosenman <ler@lerctr.org> Submitted by: markj, jch	2015-08-08 08:40:36 +00:00

1 2 3 4 5 ...

471 Commits