freebsd-dev

Author	SHA1	Message	Date
Richard Scheffenegger	d7ca3f780d	Reduce default TCP delayed ACK timeout to 40ms. Reviewed by: kbowling, tuexen Approved by: tuexen (mentor) MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D23281	2020-04-16 15:59:23 +00:00
Randall Stewart	481be5de9d	White space cleanup -- remove trailing tab's or spaces from any line. Sponsored by: Netflix Inc.	2020-02-12 13:31:36 +00:00
Bjoern A. Zeeb	334fc5822b	vnet: virtualise more network stack sysctls. Virtualise tcp_always_keepalive, TCP and UDP log_in_vain. All three are set in the netoptions startup script, which we would love to run for VNETs as well [1]. While virtualising the log_in_vain sysctls seems pointles at first for as long as the kernel message buffer is not virtualised, it at least allows an administrator to debug the base system or an individual jail if needed without turning the logging on for all jails running on a system. PR: 243193 [1] MFC after: 2 weeks	2020-01-08 23:30:26 +00:00
Michael Tuexen	0999766ddf	Add sysctl variable net.inet.tcp.rexmit_initial for setting RTO.Initial used by TCP. Reviewed by: rrs@, 0mp@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D19355	2019-03-23 21:36:59 +00:00
Michael Tuexen	3b853844d7	Reduce the TCP initial retransmission timeout from 3 seconds to 1 second as allowed by RFC 6298. Reviewed by: kbowling@, Richard Scheffenegger Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18941	2019-02-20 18:03:43 +00:00
Michael Tuexen	c6dcb64b18	Use exponential backoff for retransmitting SYN segments as specified in the TCP RFCs. Reviewed by: rrs@, Richard Scheffenegger Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18974	2019-02-20 17:56:38 +00:00
Matt Macy	6573d7580b	epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066	2018-07-04 02:47:16 +00:00
Randall Stewart	89e560f441	This commit brings in a new refactored TCP stack called Rack. Rack includes the following features: - A different SACK processing scheme (the old sack structures are not used). - RACK (Recent acknowledgment) where counting dup-acks is no longer done instead time is used to knwo when to retransmit. (see the I-D) - TLP (Tail Loss Probe) where we will probe for tail-losses to attempt to try not to take a retransmit time-out. (see the I-D) - Burst mitigation using TCPHTPS - PRR (partial rate reduction) see the RFC. Once built into your kernel, you can select this stack by either socket option with the name of the stack is "rack" or by setting the global sysctl so the default is rack. Note that any connection that does not support SACK will be kicked back to the "default" base FreeBSD stack (currently known as "default"). To build this into your kernel you will need to enable in your kernel: makeoptions WITH_EXTRA_TCP_STACKS=1 options TCPHPTS Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15525	2018-06-07 18:18:13 +00:00
John Baldwin	f17985319d	Export tcp_always_keepalive for use by the Chelsio TOM module. This used to work by accident with ld.bfd even though always_keepalive was marked as static. LLD honors static more correctly, so export this variable properly (including moving it into the tcp_* namespace). Reviewed by: bz, emaste MFC after: 2 weeks Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D14129	2018-01-30 23:01:37 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Gleb Smirnoff	e29c55e4bb	Declare pmtud_blackhole global variables in tcp_timer.h, so that alternative TCP stacks can legally use them.	2017-10-06 20:33:40 +00:00
Gleb Smirnoff	cc65eb4e79	Hide struct inpcb, struct tcpcb from the userland. This is a painful change, but it is needed. On the one hand, we avoid modifying them, and this slows down some ideas, on the other hand we still eventually modify them and tools like netstat(1) never work on next version of FreeBSD. We maintain a ton of spares in them, and we already got some ifdef hell at the end of tcpcb. Details: - Hide struct inpcb, struct tcpcb under _KERNEL \|\| _WANT_FOO. - Make struct xinpcb, struct xtcpcb pure API structures, not including kernel structures inpcb and tcpcb inside. Export into these structures the fields from inpcb and tcpcb that are known to be used, and put there a ton of spare space. - Make kernel and userland utilities compilable after these changes. - Bump __FreeBSD_version. Reviewed by: rrs, gnn Differential Revision: D10018	2017-03-21 06:39:49 +00:00
Warner Losh	fbbd9655e5	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
Ryan Stone	5ede40dcf2	Don't zero out srtt after excess retransmits If the TCP stack has retransmitted more than 1/4 of the total number of retransmits before a connection drop, it decides that its current RTT estimate is hopelessly out of date and decides to recalculate it from scratch starting with the next ACK. Unfortunately, it implements this by zeroing out the current RTT estimate. Drop this hack entirely, as it makes it significantly more difficult to debug connection issues. Instead check for excessive retransmits at the point where srtt is updated from an ACK being received. If we've exceeded 1/4 of the maximum retransmits, discard the previous srtt estimate and replace it with the latest rtt measurement. Differential Revision: https://reviews.freebsd.org/D9519 Reviewed by: gnn Sponsored by: Dell EMC Isilon	2017-02-11 17:05:08 +00:00
Randall Stewart	b07fef500b	This cleans up the timer code in TCP and also makes it so we do not take the INFO lock unless we are really going to delete the TCB. Differential Revision: D7136	2016-08-16 12:40:56 +00:00
Randall Stewart	e5ad64562a	This cleans up the timers code in TCP to start using the new async_drain functionality. This as been tested in NF as well as by Verisign. Still to do in here is to remove all the old flags. They are currently left being maintained but probably are no longer needed. Sponsored by: Netflix Inc. Differential Revision: http://reviews.freebsd.org/D5924	2016-04-28 13:27:12 +00:00
Hiren Panchasara	0645c6049d	Persist timers TCPTV_PERSMIN and TCPTV_PERSMAX are hardcoded with 5 seconds and 60 seconds, respectively. Turn them into sysctls that can be tuned live. The default values of 5 seconds and 60 seconds have been retained. Submitted by: Jason Wolfe (j at nitrology dot com) Reviewed by: gnn, rrs, hiren, bz MFC after: 1 week Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D5024	2016-01-26 16:33:38 +00:00
Julien Charbon	d6de19ac2f	Put r284245 back in place: If at first this fix was seen as a temporary workaround for a callout(9) issue, it turns out it is instead the right way to use callout in mpsafe mode without using callout_drain(). r284245 commit message: Fix a callout race condition introduced in TCP timers callouts with r281599. In TCP timer context, it is not enough to check callout_stop() return value to decide if a callout is still running or not, previous callout_reset() return values have also to be checked. Differential Revision: https://reviews.freebsd.org/D2763	2015-08-30 13:44:39 +00:00
Julien Charbon	bcf9b91395	Revert r284245: "Fix a callout race condition introduced in TCP timers callouts with r281599." r281599 fixed a TCP timer race condition, but due a callout(9) bug it also introduced another race condition workaround-ed with r284245. The callout(9) bug being fixed with r286880, we can now revert the workaround (r284245). Differential Revision: https://reviews.freebsd.org/D2079 (Initial change) Differential Revision: https://reviews.freebsd.org/D2763 (Workaround) Differential Revision: https://reviews.freebsd.org/D3078 (Fix) Sponsored by: Verisign, Inc. MFC after: 2 weeks	2015-08-24 09:30:27 +00:00
Hiren Panchasara	ad389a8c3b	Remove unused TCPTV_SRTTDFLT. We initialize srtt with TCPTV_SRTTBASE when we don't have any rtt estimate. Differential Revision: D3334 Sponsored by: Limelight Networks	2015-08-12 16:08:37 +00:00
Julien Charbon	cad814eedf	Fix a callout race condition introduced in TCP timers callouts with r281599. In TCP timer context, it is not enough to check callout_stop() return value to decide if a callout is still running or not, previous callout_reset() return values have also to be checked. Differential Revision: https://reviews.freebsd.org/D2763 Reviewed by: hiren Approved by: hiren MFC after: 1 day Sponsored by: Verisign, Inc.	2015-06-10 20:43:07 +00:00
Julien Charbon	5571f9cf81	Fix an old and well-documented use-after-free race condition in TCP timers: - Add a reference from tcpcb to its inpcb - Defer tcpcb deletion until TCP timers have finished Differential Revision: https://reviews.freebsd.org/D2079 Submitted by: jch, Marc De La Gueronniere <mdelagueronniere@verisign.com> Reviewed by: imp, rrs, adrian, jhb, bz Approved by: jhb Sponsored by: Verisign, Inc.	2015-04-16 10:00:06 +00:00
Julien Charbon	cea40c4888	Fix a race condition in TCP timewait between tcp_tw_2msl_reuse() and tcp_tw_2msl_scan(). This race condition drives unplanned timewait timeout cancellation. Also simplify implementation by holding inpcb reference and removing tcptw reference counting. Differential Revision: https://reviews.freebsd.org/D826 Submitted by: Marc De la Gueronniere <mdelagueronniere@verisign.com> Submitted by: jch Reviewed By: jhb (mentor), adrian, rwatson Sponsored by: Verisign, Inc. MFC after: 2 weeks X-MFC-With: r264321	2014-10-30 08:53:56 +00:00
John Baldwin	66eefb1eae	Currently, the TCP slow timer can starve TCP input processing while it walks the list of connections in TIME_WAIT closing expired connections due to contention on the global TCP pcbinfo lock. To remediate, introduce a new global lock to protect the list of connections in TIME_WAIT. Only acquire the TCP pcbinfo lock when closing an expired connection. This limits the window of time when TCP input processing is stopped to the amount of time needed to close a single connection. Submitted by: Julien Charbon <jcharbon@verisign.com> Reviewed by: rwatson, rrs, adrian MFC after: 2 months	2014-04-10 18:15:35 +00:00
Andre Oppermann	13feab8286	Add DELACK to list of timers. MFC after: 1 week	2012-11-27 19:07:28 +00:00
Andre Oppermann	8d045dbdf3	Define the delayed ACK timeout value directly as hz/10 instead of obfuscating it by going through PR_FASTHZ. No functional change. MFC after: 2 weeks	2012-10-29 12:17:02 +00:00
Andre Oppermann	024fd5b6bb	For retransmits of SYN\|ACK from the syncache use the slightly more aggressive special tcp_syn_backoff[] retransmit schedule instead of the normal tcp_backoff[] schedule for established connections. MFC after: 2 weeks	2012-10-28 19:02:07 +00:00
Gleb Smirnoff	9077f38738	Add new socket options: TCP_KEEPINIT, TCP_KEEPIDLE, TCP_KEEPINTVL and TCP_KEEPCNT, that allow to control initial timeout, idle time, idle re-send interval and idle send count on a per-socket basis. Reviewed by: andre, bz, lstewart	2012-02-05 16:53:02 +00:00
Andre Oppermann	1c18314d17	Remove the TCP inflight bandwidth limiter as announced in r211315 to give way for the pluggable congestion control framework. It is the task of the congestion control algorithm to set the congestion window and amount of inflight data without external interference. In 'struct tcpcb' the variables previously used by the inflight limiter are renamed to spares to keep the ABI intact and to have some more space for future extensions. In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to preserve the ABI. It is always set to 0. In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed to preserve the ABI. It is always set to 0. These unused variable in the various structures may be reused in the future or garbage collected before the next release or at some other point when an ABI change happens anyway for other reasons. No MFC is planned. The inflight bandwidth limiter stays disabled by default in the other branches but remains available.	2010-09-16 21:06:45 +00:00
Mike Silbersack	b8614722ff	Add the ability to see TCP timers via netstat -x. This can be a useful feature when you have a seemingly stuck socket and want to figure out why it has not been closed yet. No plans to MFC this, as it changes the netstat sysctl ABI. Reviewed by: andre, rwatson, Eric Van Gyzen	2009-09-16 05:33:15 +00:00
Mike Silbersack	e2f2059f68	Two changes: - Reintegrate the ANSI C function declaration change from tcp_timer.c rev 1.92 - Reorganize the tcpcb structure so that it has a single pointer to the "tcp_timer" structure which contains all of the tcp timer callouts. This change means that when the single tcp timer change is reintegrated, tcpcb will not change in size, and therefore the ABI between netstat and the kernel will not change. Neither of these changes should have any functional impact. Reviewed by: bmah, rrs Approved by: re (bmah)	2007-09-24 05:26:24 +00:00
Robert Watson	85d9437250	Back out tcp_timer.c:1.93 and associated changes that reimplemented the many TCP timers as a single timer, but retain the API changes necessary to reintroduce this change. This will back out the source of at least two reported problems: lock leaks in certain timer edge cases, and TCP timers continuing to fire after a connection has closed (a bug previously fixed and then reintroduced with the timer rewrite). In a follow-up commit, some minor restylings and comment changes performed after the TCP timer rewrite will be reapplied, and a further change to allow the TCP timer rewrite to be added back without disturbing the ABI. The new design is believed to be a good thing, but the outstanding issues are leading to significant stability/correctness problems that are holding up 7.0. This patch was generated by silby, but is being committed by proxy due to poor network connectivity for silby this week. Approved by: re (kensmith) Submitted by: silby Tested by: rwatson, kris Problems reported by: peter, kris, others	2007-09-07 09:19:22 +00:00
Peter Wemm	c4a184bdc4	Change TCPTV_MIN to be independent of HZ. While it was documented to be in ticks "for algorithm stability" when originally committed, it turns out that it has a significant impact in timing out connections. When we changed HZ from 100 to 1000, this had a big effect on reducing the time before dropping connections. To demonstrate, boot with kern.hz=100. ssh to a box on local ethernet and establish a reliable round-trip-time (ie: type a few commands). Then unplug the ethernet and press a key. Time how long it takes to drop the connection. The old behavior (with hz=100) caused the connection to typically drop between 90 and 110 seconds of getting no response. Now boot with kern.hz=1000 (default). The same test causes the ssh session to drop after just 9-10 seconds. This is a big deal on a wifi connection. With kern.hz=1000, change sysctl net.inet.tcp.rexmit_min from 3 to 30. Note how it behaves the same as when HZ was 100. Also, note that when booting with hz=100, net.inet.tcp.rexmit_min used to be 30. This commit changes TCPTV_MIN to be scaled with hz. rexmit_min should always be about 30. If you set hz to Really Slow(TM), there is a safety feature to prevent a value of 0 being used. This may be revised in the future, but for the time being, it restores the old, pre-hz=1000 behavior, which is significantly less annoying. As a workaround, to avoid rebooting or rebuilding a kernel, you can run "sysctl net.inet.tcp.rexmit_min=30" and add "net.inet.tcp.rexmit_min=30" to /etc/sysctl.conf. This is safe to run from 6.0 onwards. Approved by: re (rwatson) Reviewed by: andre, silby	2007-07-31 22:11:55 +00:00
Andre Oppermann	abb91d889a	Remove now unused stuff forgotten in the previous commit.	2007-05-16 17:55:22 +00:00
Andre Oppermann	2104448fe7	Move TIME_WAIT related functions and timer handling from files other than repo copied tcp_subr.c into tcp_timewait.c#1.284: tcp_input.c#1.350 tcp_timewait() -> tcp_twcheck() tcp_timer.c#1.92 tcp_timer_2msl_reset() -> tcp_tw_2msl_reset() tcp_timer.c#1.92 tcp_timer_2msl_stop() -> tcp_tw_2msl_stop() tcp_timer.c#1.92 tcp_timer_2msl_tw() -> tcp_tw_2msl_scan() This is a mechanical move with appropriate renames and making them static if used only locally. The tcp_tw_2msl_scan() cleanup function is still run from the tcp_slowtimo() in tcp_timer.c.	2007-05-16 17:14:25 +00:00
Ruslan Ermilov	7480de4305	Make "struct tcp_timer" visible only to the kernel, and unbreak world.	2007-04-11 14:08:42 +00:00
Andre Oppermann	b8152ba793	Change the TCP timer system from using the callout system five times directly to a merged model where only one callout, the next to fire, is registered. Instead of callout_reset(9) and callout_stop(9) the new function tcp_timer_activate() is used which then internally manages the callout. The single new callout is a mutex callout on inpcb simplifying the locking a bit. tcp_timer() is the called function which handles all race conditions in one place and then dispatches the individual timer functions. Reviewed by: rwatson (earlier version)	2007-04-11 09:45:16 +00:00
Mohan Srinivasan	7c72af8770	Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate potential issues where the peer does not close, potentially leaving thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl fast_finwait2_recycle, which is disabled by default. Reviewed by: gnn, silby.	2007-02-26 22:25:21 +00:00
John-Mark Gurney	4dc630cdd2	if min is greater than max, prefer max over min... I managed to get a retransmit timer that was going to take 19 days to trigger... Reviewed by: silby	2006-09-25 07:22:39 +00:00
Ruslan Ermilov	751dea2935	Back when we had T/TCP support, we used to apply different timeouts for TCP and T/TCP connections in the TIME_WAIT state, and we had two separate timed wait queues for them. Now that is has gone, the timeout is always 2*MSL again, and there is no reason to keep two queues (the first was unused anyway!). Also, reimplement the remaining queue using a TAILQ (it was technically impossible before, with two queues).	2006-09-07 13:06:00 +00:00
Mohan Srinivasan	464469c713	Fixes an edge case bug in timewait handling where ticks rolling over causing the timewait expiry to be exactly 0 corrupts the timewait queues (and that entry). Reviewed by: silby	2006-08-11 21:15:23 +00:00
Andre Oppermann	eaf80179e2	Have TCP Inflight disable itself if the RTT is below a certain threshold. Inflight doesn't make sense on a LAN as it has trouble figuring out the maximal bandwidth because of the coarse tick granularity. The sysctl net.inet.tcp.inflight.rttthresh specifies the threshold in milliseconds below which inflight will disengage. It defaults to 10ms. Tested by: Joao Barros <joao.barros-at-gmail.com>, Rich Murphey <rich-at-whiteoaklabs.com> Sponsored by: TCP/IP Optimization Fundraise 2005	2006-02-16 19:38:07 +00:00
Warner Losh	c398230b64	/* -> /*- for license, minor formatting changes	2005-01-07 01:45:51 +00:00
Robert Watson	a4f757cd5d	White space cleanup for netinet before branch: - Trailing tab/space cleanup - Remove spurious spaces between or before tabs This change avoids touching files that Andre likely has in his working set for PFIL hooks changes for IPFW/DUMMYNET. Approved by: re (scottl) Submitted by: Xin LI <delphij@frontfree.net>	2004-08-16 18:32:07 +00:00
Warner Losh	f36cfd49ad	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson	2004-04-07 20:46:16 +00:00
Jonathan Lemon	607b0b0cc9	Remove a panic(); if the zone allocator can't provide more timewait structures, reuse the oldest one. Also move the expiry timer from a per-structure callout to the tcp slow timer. Sponsored by: DARPA, NAI Labs	2003-03-08 22:06:20 +00:00
Jonathan Lemon	340c35de6a	Add a TCP TIMEWAIT state which uses less space than a fullblown TCP control block. Allow the socket and tcpcb structures to be freed earlier than inpcb. Update code to understand an inp w/o a socket. Reviewed by: hsu, silby, jayanth Sponsored by: DARPA, NAI Labs	2003-02-19 22:32:43 +00:00
Alfred Perlstein	e88894d39a	make the strings for tcptimers, tanames and prurequests const to silence warnings.	2002-08-16 09:07:59 +00:00
Matthew Dillon	701bec5a38	Introduce two new sysctl's: net.inet.tcp.rexmit_min (default 3 ticks equiv) This sysctl is the retransmit timer RTO minimum, specified in milliseconds. This value is designed for algorithmic stability only. net.inet.tcp.rexmit_slop (default 200ms) This sysctl is the retransmit timer RTO slop which is added to every retransmit timeout and is designed to handle protocol stack overheads and delayed ack issues. Note that the original code applied a 1-second RTO minimum but never applied real slop to the RTO calculation, so any RTO calculation over one second would have no slop and thus not account for protocol stack overheads (TCP timestamps are not a measure of protocol turnaround!). Essentially, the original code made the RTO calculation almost completely irrelevant. Please note that the 200ms slop is debateable. This commit is not meant to be a line in the sand, and if the community winds up deciding that increasing it is the correct solution then it's easy to do. Note that larger values will destroy performance on lossy networks while smaller values may result in a greater number of unnecessary retransmits.	2002-07-18 19:06:12 +00:00
Matthew Dillon	22fd54d461	I don't know how the minimum retransmit timeout managed to get set to one second but it badly breaks throughput on networks with minor packet loss. Complaints by: at least two people tracked down to this. MFC after: 3 days	2002-07-17 23:32:03 +00:00

1 2

69 Commits