freebsd-dev

Author	SHA1	Message	Date
Randall Stewart	ac7bd23a7a	lets put (void) in a couple of functions to keep older platforms that are stuck with gcc happy (ppc). The changes are needed in both bbr and rack. Obtained from: Michael Tuexen (mtuexen@)	2019-09-24 20:36:43 +00:00
Randall Stewart	da99b33b17	don't call in_ratelmit detach when RATELIMIT is not compiled in the kernel.	2019-09-24 20:11:55 +00:00
Randall Stewart	35c7bb3407	This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This is a completely separate TCP stack (tcp_bbr.ko) that will be built only if you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that supports hardware pacing, BBR understands how to use such a feature. Note that this commit also adds in a general purpose time-filter which allows you to have a min-filter or max-filter. A filter allows you to have a low (or high) value for some period of time and degrade slowly to another value has time passes. You can find out the details of BBR by looking at the original paper at: https://queue.acm.org/detail.cfm?id=3022184 or consult many other web resources you can find on the web referenced by "BBR congestion control". It should be noted that BBRv1 (which this is) does tend to unfairness in cases of small buffered paths, and it will usually get less bandwidth in the case of large BDP paths(when competing with new-reno or cubic flows). BBR is still an active research area and we do plan on implementing V2 of BBR to see if it is an improvement over V1. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D21582	2019-09-24 18:18:11 +00:00
Michael Tuexen	dd3121a895	When the RACK stack computes the space for user data in a TCP segment, it wasn't taking the IP level options into account. This patch fixes this. In addition, it also corrects a KASSERT and adds protection code to assure that the IP header chain and the TCP head fit in the first fragment as required by RFC 7112. Reviewed by: rrs@ MFC after: 3 days Sponsored by: Nertflix, Inc. Differential Revision: https://reviews.freebsd.org/D21666	2019-09-19 10:27:47 +00:00
Michael Tuexen	6d261981ee	Only update SACK/DSACK lists when a non-empty segment was received. This fixes hitting a KASSERT with a valid packet exchange. Reviewed by: rrs@, Richard Scheffenegger MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21567	2019-09-09 16:07:47 +00:00
Randall Stewart	e57b2d0e51	This adds the final tweaks to LRO that will now allow me to add BBR. These changes make it so you can get an array of timestamps instead of a compressed ack/data segment. BBR uses this to aid with its delivery estimates. We also now (via Drew's suggestions) will not go to the expense of the tcb lookup if no stack registers to want this feature. If HPTS is not present the feature is not present either and you just get the compressed behavior. Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D21127	2019-09-06 14:25:41 +00:00
Michael Tuexen	191ae5bfa9	Fix two TCP RACK issues: * Convert the TCP delayed ACK timer from ms to ticks as required. This fixes the timer on platforms with hz != 1000. * Don't delay acknowledgements which report duplicate data using DSACKs. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21512	2019-09-03 19:48:02 +00:00
Michael Tuexen	fe5dee73f7	This patch improves the DSACK handling to conform with RFC 2883. The lowest SACK block is used when multiple Blocks would be elegible as DSACK blocks ACK blocks get reordered - while maintaining the ordering of SACK blocks not relevant in the DSACK context is maintained. Reviewed by: rrs@, tuexen@ Obtained from: Richard Scheffenegger MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D21038	2019-09-02 19:04:02 +00:00
John Baldwin	b2e60773c6	Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277	2019-08-27 00:01:56 +00:00
Randall Stewart	e13ad86c67	Fix an issue when TSO and Rack play together. Basically an retransmission of the initial SYN (with data) would cause us to strip the SYN and decrement/increase offset/len which then caused us a -1 offset and a panic. Reported by: Larry Rosenman (Michael Tuexen helped me debug this at the IETF)	2019-08-21 10:45:28 +00:00
Randall Stewart	23fa2dbc06	Place back in the dependency on HPTS via module depends versus a fatal error in compiling. This was taken out by mistake when I mis-merged from the 18q22p2 sources of rack in NF. Opps. Reported by: sbruno	2019-08-13 12:41:15 +00:00
Michael Tuexen	e4a5561e01	Fix compilation on platforms using gcc. When compiling RACK on platforms using gcc, a warning that tcp_outflags is defined but not used is issued and terminates compilation on PPC64, for example. So don't indicate that tcp_outflags is used. Reviewed by: rrs@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D20971	2019-07-16 17:54:20 +00:00
Randall Stewart	e5926fd368	This is the second in a number of patches needed to get BBRv1 into the tree. This fixes the DSACK bug but is also needed by BBR. We have yet to go two more one will be for the pacing code (tcp_ratelimit.c) and the second will be for the new updated LRO code that allows a transport to know the arrival times of packets and (tcp_lro.c). After that we should finally be able to get BBRv1 into head. Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D20908	2019-07-14 16:05:47 +00:00
Randall Stewart	55f795883f	add back the comment around the pending DSACK fixes.	2019-07-12 11:45:42 +00:00
Randall Stewart	1cf999a5f3	Update to jhb's other suggestion, use #error when we are missing HPTS.	2019-07-11 04:40:58 +00:00
Randall Stewart	9cf3c235c0	Update copyright per JBH's suggestions.. thanks.	2019-07-11 04:38:33 +00:00
Randall Stewart	3b0b41e613	This commit updates rack to what is basically being used at NF as well as sets in some of the groundwork for committing BBR. The hpts system is updated as well as some other needed utilities for the entrance of BBR. This is actually part 1 of 3 more needed commits which will finally complete with BBRv1 being added as a new tcp stack. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D20834	2019-07-10 20:40:39 +00:00
Jonathan T. Looney	5e02b277a4	Add the ability to limit how much the code will fragment the RACK send map in response to SACKs. The default behavior is unchanged; however, the limit can be activated by changing the new net.inet.tcp.rack.split_limit sysctl. Submitted by: Peter Lei <peterlei@netflix.com> Reported by: jtl Reviewed by: lstewart (earlier version) Security: CVE-2019-5599	2019-06-19 13:55:00 +00:00
Michael Tuexen	d1156b0505	r347382 added receiver side DSACK support for the TCP base stack. The corresponding changes for the RACK stack where missed and are added by this commit. Reviewed by: Richard Scheffenegger, rrs@ MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D20372	2019-06-06 07:49:03 +00:00
Michael Tuexen	c6dcb64b18	Use exponential backoff for retransmitting SYN segments as specified in the TCP RFCs. Reviewed by: rrs@, Richard Scheffenegger Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18974	2019-02-20 17:56:38 +00:00
Warner Losh	52467047aa	Regularize the Netflix copyright Use recent best practices for Copyright form at the top of the license: 1. Remove all the All Rights Reserved clauses on our stuff. Where we piggybacked others, use a separate line to make things clear. 2. Use "Netflix, Inc." everywhere. 3. Use a single line for the copyright for grep friendliness. 4. Use date ranges in all places for our stuff. Approved by: Netflix Legal (who gave me the form), adrian@ (pmc files)	2019-02-04 21:28:25 +00:00
Michael Tuexen	116ef4d6e7	When handling SYN-ACK segments in the SYN-RCVD state, set tp->snd_wnd consistently. This inconsistency was observed when working on the bug reported in PR 235256, although it does not fix the reported issue. The fix for the PR will be a separate commit. PR: 235256 Reviewed by: rrs@, Richard Scheffenegger MFC after: 3 days Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D19033	2019-02-01 12:33:00 +00:00
Michael Tuexen	bf7fcdb18a	Fix the detection of ECN-setup SYN-ACK packets. RFC 3168 defines an ECN-setup SYN-ACK packet as on with the ECE flags set and the CWR flags not set. The code was only checking if ECE flag is set. This patch adds the check to verify that the CWR flags is not set. Submitted by: Richard Scheffenegger Reviewed by: tuexen@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18996	2019-01-28 12:45:31 +00:00
Michael Tuexen	7dc90a1de0	Fix a bug in the restart window computation of TCP New Reno When implementing support for IW10, an update in the computation of the restart window used after an idle phase was missed. To minimize code duplication, implement the logic in tcp_compute_initwnd() and call it. This fixes a bug in NewReno, which was not aware of IW10. Submitted by: Richard Scheffenegger Reviewed by: tuexen@ MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D18940	2019-01-25 13:57:09 +00:00
Michael Tuexen	ad2be38941	A TCP stack is required to check SEG.ACK first, when processing a segment in the SYN-SENT state as stated in Section 3.9 of RFC 793, page 66. Ensure this is also done by the TCP RACK stack. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18034	2018-11-22 20:05:57 +00:00
Michael Tuexen	fef56019e9	Ensure that the TCP RACK stack honours the setting of the net.inet.tcp.drop_synfin sysctl-variable. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18033	2018-11-22 20:02:39 +00:00
Michael Tuexen	7e729f0787	Ensure that the default RTT stack can make an RTT measurement if the TCP connection was initiated using the RACK stack, but the peer does not support the TCP RACK extension. This ensures that the TCP behaviour on the wire is the same if the TCP connection is initated using the RACK stack or the default stack. Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18032	2018-11-22 19:56:52 +00:00
Michael Tuexen	794107181a	Ensure that TCP RST-segments announce consistently a receiver window of zero. This was already done when sending them via tcp_respond(). Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D17949	2018-11-22 19:49:52 +00:00
Michael Tuexen	3bea9a2664	Improve two KASSERTs in the TCP RACK stack. There are two locations where an always true comparison was made in a KASSERT. Replace this by an appropriate check and use a consistent panic message. Also use this code when checking a similar condition. PR: 229664 Reviewed by: rrs@ MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D18021	2018-11-21 18:19:15 +00:00
Michael Tuexen	a8a8a8a808	Fix TCP Fast Open for the TCP RACK stack. * Fix a bug where the SYN handling during established state was applied to a front state. * Move a check for retransmission after the timer handling. This was suppressing timer based retransmissions. * Fix an off-by one byte in the sequence number of retransmissions. * Apply fixes corresponding to https://svnweb.freebsd.org/changeset/base/336934 Reviewed by: rrs@ Approved by: re (kib@) MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16912	2018-09-12 10:27:58 +00:00
Randall Stewart	c28440db29	This change represents a substantial restructure of the way we reassembly inbound tcp segments. The old algorithm just blindly dropped in segments without coalescing. This meant that every segment could take up greater and greater room on the linked list of segments. This of course is now subject to a tighter limit (100) of segments which in a high BDP situation will cause us to be a lot more in-efficent as we drop segments beyond 100 entries that we receive. What this restructure does is cause the reassembly buffer to coalesce segments putting an emphasis on the two common cases (which avoid walking the list of segments) i.e. where we add to the back of the queue of segments and where we add to the front. We also have the reassembly buffer supporting a couple of debug options (black box logging as well as counters for code coverage). These are compiled out by default but can be added by uncommenting the defines. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16626	2018-08-20 12:43:18 +00:00
Randall Stewart	d18ea344e6	Fix a small bug in rack where it will end up sending the FIN twice. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16604	2018-08-08 13:36:49 +00:00
Randall Stewart	936b2b64ae	This fixes a bug in Rack where we were not properly using the correct value for Delayed Ack. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16579	2018-08-06 09:22:07 +00:00
Randall Stewart	4ad5b7a0ac	This fixes a hole where rack could end up sending an invalid segment into the reassembly queue. This would happen if you enabled the data after close option. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D16453	2018-07-30 10:23:29 +00:00
Randall Stewart	399973c33d	Delete the example tcp stack "fastpath" which was only put in has an example. Sponsored by: Netflix inc. Differential Revision: https://reviews.freebsd.org/D16420	2018-07-24 14:55:47 +00:00
Randall Stewart	8de9ac5eec	Bump the ICMP echo limits to match the RFC Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16333	2018-07-18 22:49:53 +00:00
Michael Tuexen	a026a53a76	Use appropriate MSS value when populating the TCP FO client cookie cache When a client receives a SYN-ACK segment with a TFP fast open cookie, but without an MSS option, an MSS value from uninitialised stack memory is used. This patch ensures that in case no MSS option is included in the SYN-ACK, the appropriate value as given in RFC 7413 is used. Reviewed by: kbowling@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16175	2018-07-10 10:42:48 +00:00
Michael Tuexen	5f1347d7c9	Allow alternate TCP stack to populate the TCP FO client cookie cache. Without this patch, TCP FO could be used when using alternate TCP stack, but only existing entires in the TCP client cookie cache could be used. This cache was not populated by connections using alternate TCP stacks. Sponsored by: Netflix, Inc.	2018-07-07 12:28:16 +00:00
Matt Macy	6573d7580b	epoch(9): allow preemptible epochs to compose - Add tracker argument to preemptible epochs - Inline epoch read path in kernel and tied modules - Change in_epoch to take an epoch as argument - Simplify tfb_tcp_do_segment to not take a ti_locked argument, there's no longer any benefit to dropping the pcbinfo lock and trying to do so just adds an error prone branchfest to these functions - Remove cases of same function recursion on the epoch as recursing is no longer free. - Remove the the TAILQ_ENTRY and epoch_section from struct thread as the tracker field is now stack or heap allocated as appropriate. Tested by: pho and Limelight Networks Reviewed by: kbowling at llnw dot com Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D16066	2018-07-04 02:47:16 +00:00
Gleb Smirnoff	a00f4ac22f	Revert r334843, and partially revert r335180. tcp_outflags[] were defined since 4BSD and are defined nowadays in all its descendants. Removing them breaks third party application.	2018-06-23 06:53:53 +00:00
Randall Stewart	c6f76759ca	Make sure that the t_peakrate_thr is not compiled in by default until NF can upstream it. Reviewed by: and suggested lstewart Sponsored by: Netflix Inc.	2018-06-19 11:20:28 +00:00
Matt Macy	9e58ff6ff9	convert inpcbinfo hash and info rwlocks to epoch + mutex - Convert inpcbinfo info & hash locks to epoch for read and mutex for write - Garbage collect code that handled INP_INFO_TRY_RLOCK failures as INP_INFO_RLOCK which can no longer fail When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to 3%. Overall packet throughput rate limited by CPU affinity and NIC driver design choices. On the receiver unhalted core cycles samples in in_pcblookup_hash went from 13% to to 1.6% Tested by LLNW and pho@ Reviewed by: jtl Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D15686	2018-06-19 01:54:00 +00:00
Gleb Smirnoff	9293873e83	TCPOUTFLAGS no longer exists since r334843.	2018-06-14 22:25:10 +00:00
Randall Stewart	4aec110f70	This fixes several bugs that Larry Rosenman helped me find in Rack with respect to its handling of TCP Fast Open. Several fixes all related to TFO are included in this commit: 1) Handling of non-TFO retransmissions 2) Building the proper send-map when we are doing TFO 3) Dealing with the ack that comes back that includes the SYN and data. It appears that with this commit TFO now works :-) Thanks Larry for all your help!! Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15758	2018-06-14 03:27:42 +00:00
Jonathan T. Looney	cff21e484b	Change RACK dependency on TCPHPTS from a build-time dependency to a load- time dependency. At present, RACK requires the TCPHPTS option to run. However, because modules can be moved from machine to machine, this dependency is really best assessed at load time rather than at build time. Reviewed by: rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D15756	2018-06-11 14:27:19 +00:00
Randall Stewart	89e560f441	This commit brings in a new refactored TCP stack called Rack. Rack includes the following features: - A different SACK processing scheme (the old sack structures are not used). - RACK (Recent acknowledgment) where counting dup-acks is no longer done instead time is used to knwo when to retransmit. (see the I-D) - TLP (Tail Loss Probe) where we will probe for tail-losses to attempt to try not to take a retransmit time-out. (see the I-D) - Burst mitigation using TCPHTPS - PRR (partial rate reduction) see the RFC. Once built into your kernel, you can select this stack by either socket option with the name of the stack is "rack" or by setting the global sysctl so the default is rack. Note that any connection that does not support SACK will be kicked back to the "default" base FreeBSD stack (currently known as "default"). To build this into your kernel you will need to enable in your kernel: makeoptions WITH_EXTRA_TCP_STACKS=1 options TCPHPTS Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15525	2018-06-07 18:18:13 +00:00
Randall Stewart	3ee9c3c4eb	This commit brings in the TCP high precision timer system (tcp_hpts). It is the forerunner/foundational work of bringing in both Rack and BBR which use hpts for pacing out packets. The feature is optional and requires the TCPHPTS option to be enabled before the feature will be active. TCP modules that use it must assure that the base component is compile in the kernel in which they are loaded. MFC after: Never Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15020	2018-04-19 13:37:59 +00:00
Gleb Smirnoff	3bdf4c4274	Declare more TCP globals in tcp_var.h, so that alternative TCP stacks can use them. Gather all TCP tunables in tcp_var.h in one place and alphabetically sort them, to ease maintainance of the list. Don't copy and paste declarations in tcp_stacks/fastpath.c.	2017-10-11 20:36:09 +00:00
Michael Tuexen	e8aba2eb21	Avoid TCP log messages which are false positives. This is https://svnweb.freebsd.org/changeset/base/322812, just for alternate TCP stacks. XMFC with: 322812	2017-08-23 15:08:51 +00:00
Sean Bruno	43053c125a	Revert r307901 - Inform CC modules about loss events. This was discussed between various transport@ members and it was requested to be reverted and discussed. Submitted by: Kevin Bowling <kevin.bowling@kev009.com> Reported by: lawrence Reviewed by: hiren Sponsored by: Limelight Networks	2017-07-25 15:08:52 +00:00
Michael Tuexen	98732609d5	Improve comments to describe what the code does. Reported by: jtl Sponsored by: Netflix, Inc.	2017-06-01 15:11:18 +00:00
Michael Tuexen	ebfd753408	When a SYN-ACK is received in SYN-SENT state, RFC 793 requires the validation of SEG.ACK as the first step. If the ACK is not acceptable, a RST segment should be sent and the segment should be dropped. Up to now, the segment was partially processed. This patch moves the check for the SEG.ACK validation up to the front as required. Reviewed by: hiren, gnn MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D10424	2017-04-26 06:20:58 +00:00
Michael Tuexen	013f4df643	The sysctl variable net.inet.tcp.drop_synfin is not honored in all states, for example not in SYN-SENT. This patch adds code to check the sysctl variable in other states than LISTEN. Thanks to ae and gnn for providing comments. Reviewed by: gnn MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D9894	2017-04-12 20:27:15 +00:00
Steven Hartland	e44c1887fd	Use estimated RTT for receive buffer auto resizing instead of timestamps Switched from using timestamps to RTT estimates when performing TCP receive buffer auto resizing, as not all hosts support / enable TCP timestamps. Disabled reset of receive buffer auto scaling when not in bulk receive mode, which gives an extra 20% performance increase. Also extracted auto resizing to a common method shared between standard and fastpath modules. With this AWS S3 downloads at ~17ms latency on a 1Gbps connection jump from ~3MB/s to ~100MB/s using the default settings. Reviewed by: lstewart, gnn MFC after: 2 weeks Relnotes: Yes Sponsored by: Multiplay Differential Revision: https://reviews.freebsd.org/D9668	2017-04-10 08:19:35 +00:00
Warner Losh	fbbd9655e5	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
Andrey V. Elsukov	fcf596178b	Merge projects/ipsec into head/. Small summary ------------- o Almost all IPsec releated code was moved into sys/netipsec. o New kernel modules added: ipsec.ko and tcpmd5.ko. New kernel option IPSEC_SUPPORT added. It enables support for loading and unloading of ipsec.ko and tcpmd5.ko kernel modules. o IPSEC_NAT_T option was removed. Now NAT-T support is enabled by default. The UDP_ENCAP_ESPINUDP_NON_IKE encapsulation type support was removed. Added TCP/UDP checksum handling for inbound packets that were decapsulated by transport mode SAs. setkey(8) modified to show run-time NAT-T configuration of SA. o New network pseudo interface if_ipsec(4) added. For now it is build as part of ipsec.ko module (or with IPSEC kernel). It implements IPsec virtual tunnels to create route-based VPNs. o The network stack now invokes IPsec functions using special methods. The only one header file <netipsec/ipsec_support.h> should be included to declare all the needed things to work with IPsec. o All IPsec protocols handlers (ESP/AH/IPCOMP protosw) were removed. Now these protocols are handled directly via IPsec methods. o TCP_SIGNATURE support was reworked to be more close to RFC. o PF_KEY SADB was reworked: - now all security associations stored in the single SPI namespace, and all SAs MUST have unique SPI. - several hash tables added to speed up lookups in SADB. - SADB now uses rmlock to protect access, and concurrent threads can do SA lookups in the same time. - many PF_KEY message handlers were reworked to reflect changes in SADB. - SADB_UPDATE message was extended to support new PF_KEY headers: SADB_X_EXT_NEW_ADDRESS_SRC and SADB_X_EXT_NEW_ADDRESS_DST. They can be used by IKE daemon to change SA addresses. o ipsecrequest and secpolicy structures were cardinally changed to avoid locking protection for ipsecrequest. Now we support only limited number (4) of bundled SAs, but they are supported for both INET and INET6. o INPCB security policy cache was introduced. Each PCB now caches used security policies to avoid SP lookup for each packet. o For inbound security policies added the mode, when the kernel does check for full history of applied IPsec transforms. o References counting rules for security policies and security associations were changed. The proper SA locking added into xform code. o xform code was also changed. Now it is possible to unregister xforms. tdb_xxx structures were changed and renamed to reflect changes in SADB/SPDB, and changed rules for locking and refcounting. Reviewed by: gnn, wblock Obtained from: Yandex LLC Relnotes: yes Sponsored by: Yandex LLC Differential Revision: https://reviews.freebsd.org/D9352	2017-02-06 08:49:57 +00:00
George V. Neville-Neil	fad073dd44	Followup to mtod removal in main stack (r311225). Continued removal of mtod() calls from TCP_PROBE macros. MFC after: 1 week Sponsored by: Limelight Networks	2017-01-04 04:00:28 +00:00
Michael Tuexen	35dfb8cb68	Ensure that TCP state changes to state-closing are reported via dtrace. This does not cover state changes from TIME-WAIT. Reviewed by: gnn MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8443	2016-11-19 14:45:08 +00:00
Michael Tuexen	6779a1a101	Notify the use via setting errno when a TCP RST segment is received either in the CLOSING or LAST-ACK state. Reviewed by: hiren MFC after: 3 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8371	2016-11-17 08:15:02 +00:00
Hiren Panchasara	4e7f755377	FreeBSD tcp stack used to inform respective congestion control module about the loss event but not use or obay the recommendations i.e. values set by it in some cases. Here is an attempt to solve that confusion by following relevant RFCs/drafts. Stack only sets congestion window/slow start threshold values when there is no CC module availalbe to take that action. All CC modules are inspected and updated when needed to take appropriate action on loss. tcp_stacks/fastpath module has been updated to adapt these changes. Note: Probably, the most significant change would be to not bring congestion window down to 1MSS on a loss signaled by 3-duplicate acks and letting respective CC decide that value. In collaboration with: Matt Macy <mmacy at nextbsd dot org> Discussed on: transport@ mailing list Reviewed by: jtl MFC after: 1 month Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D8225	2016-10-25 05:45:47 +00:00
Jonathan T. Looney	6d172f58a2	The code currently resets the keepalive timer each time a packet is received on a TCP session that has entered the ESTABLISHED state. This results in a lot of calls to reset the keepalive timer. This patch changes the behavior so we set the keepalive timer for the keepalive idle time (TP_KEEPIDLE). When the keepalive timer fires, it will first check to see if the session has been idle for TP_KEEPIDLE ticks. If not, it will reschedule the keepalive timer for the time the session will have been idle for TP_KEEPIDLE ticks. For a session with regular communication, the keepalive timer should fire approximately once every TP_KEEPIDLE ticks. For sessions with irregular communication, the keepalive timer might fire more often. But, the disruption from a periodic keepalive timer should be less than the regular cost of resetting the keepalive timer on every packet. (FWIW, this change saved approximately 1.73% of the busy CPU cycles on a particular test system with a heavy TCP output load. Of course, the actual impact is very specific to the particular hardware and workload.) Reviewed by: gallatin, rrs MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8243	2016-10-14 14:57:43 +00:00
Gleb Smirnoff	ec7bbf1f79	With build without TCP_HHOOK and with INVARIANTS. Before mutex.h came via sys/hhook.h -> sys/rmlock.h -> sys/mutex.h.	2016-10-13 18:02:29 +00:00
Jonathan T. Looney	bd79708dbf	In the TCP stack, the hhook(9) framework provides hooks for kernel modules to add actions that run when a TCP frame is sent or received on a TCP session in the ESTABLISHED state. In the base tree, this functionality is only used for the h_ertt module, which is used by the cc_cdg, cc_chd, cc_hd, and cc_vegas congestion control modules. Presently, we incur overhead to check for hooks each time a TCP frame is sent or received on an ESTABLISHED TCP session. This change adds a new compile-time option (TCP_HHOOK) to determine whether to include the hhook(9) framework for TCP. To retain backwards compatibility, I added the TCP_HHOOK option to every configuration file that already defined "options INET". (Therefore, this patch introduces no functional change. In order to see a functional difference, you need to compile a custom kernel without the TCP_HHOOK option.) This change will allow users to easily exclude this functionality from their kernel, should they wish to do so. Note that any users who use a custom kernel configuration and use one of the congestion control modules listed above will need to add the TCP_HHOOK option to their kernel configuration. Reviewed by: rrs, lstewart, hiren (previous version), sjg (makefiles only) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D8185	2016-10-12 02:16:42 +00:00
Jonathan T. Looney	3ac125068a	Remove "long" variables from the TCP stack (not including the modular congestion control framework). Reviewed by: gnn, lstewart (partial) Sponsored by: Juniper Networks, Netflix Differential Revision: (multiple) Tested by: Limelight, Netflix	2016-10-06 16:28:34 +00:00
Hiren Panchasara	06b99bd826	Adjust TCP module fastpath after r304803's cc_ack_received() changes. Reported by: hiren, bz, np Reviewed by: rrs Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D7664	2016-08-26 19:23:17 +00:00
Andrey V. Elsukov	4c10540274	Cleanup unneded include "opt_ipfw.h". It was used for conditional build IPFIREWALL_FORWARD support. But IPFIREWALL_FORWARD option was removed a long time ago.	2016-06-09 05:48:34 +00:00
Randall Stewart	5105a92c49	This small change adopts the excellent suggestion for using named structures in the add of a new tcp-stack that came in late to me via email after the last commit. It also makes it so that a new stack may optionally get a callback during a retransmit timeout. This allows the new stack to clear specific state (think sack scoreboards or other such structures). Sponsored by: Netflix Inc. Differential Revision: http://reviews.freebsd.org/D6303	2016-05-17 09:53:22 +00:00
Mark Johnston	565e7fd3bc	opt_kdtrace.h is not needed for SDT probes as of r258541.	2016-05-15 20:04:43 +00:00
Pedro F. Giffuni	a4641f4eaa	sys/net*: minor spelling fixes. No functional change.	2016-05-03 18:05:43 +00:00
Randall Stewart	e5ad64562a	This cleans up the timers code in TCP to start using the new async_drain functionality. This as been tested in NF as well as by Verisign. Still to do in here is to remove all the old flags. They are currently left being maintained but probably are no longer needed. Sponsored by: Netflix Inc. Differential Revision: http://reviews.freebsd.org/D5924	2016-04-28 13:27:12 +00:00
Bjoern A. Zeeb	19edab1711	Remove duplicate external declaration of tcprexmtthresh making gcc compiles barf.	2016-03-13 21:26:18 +00:00
Randall Stewart	ec64c84ddc	Fix a sneaky bug where we were missing an extern to get the rxt threshold.. and thus created our own defaulted to 0 :-( Sponsored by: Netflix Inc	2016-03-08 00:16:34 +00:00
George V. Neville-Neil	e79cb051d5	Fix dtrace probes (introduced in 287759): debug__input was used for output and drop; connect didn't always fire a user probe some probes were missing in fastpath Submitted by: Hannes Mehnert Sponsored by: REMS, EPSRC Differential Revision: https://reviews.freebsd.org/D5525	2016-03-03 17:46:38 +00:00
Randall Stewart	6e0efc6a39	This fixes the fastpath code to have a better module initialization sequence when included in loader.conf. It also fixes it so that no matter if some one incorrectly specifies a load order, the lists and such will be initialized on demand at that time so no one can make that mistake. Reviewed by: hiren Differential Revision: D5189	2016-02-23 17:53:39 +00:00
Gleb Smirnoff	4644fda3f7	Rename netinet/tcp_cc.h to netinet/cc/cc.h. Discussed with: lstewart	2016-01-27 17:59:39 +00:00
Gleb Smirnoff	2de3e790f5	- Rename cc.h to more meaningful tcp_cc.h. - Declare it a kernel only include, which it already is. - Don't include tcp.h implicitly from tcp_cc.h	2016-01-21 22:34:51 +00:00
Gleb Smirnoff	b66d74c138	Cleanup TCP files from unnecessary interface related includes.	2016-01-21 22:24:20 +00:00
Jonathan T. Looney	49b375e74b	Apply the changes from r293284 to one additional file. Discussed with: glebius	2016-01-07 11:54:20 +00:00
Randall Stewart	f4e476c893	Remove redundant extern's that make the ppc compile fail. Thanks Ed Maste for the heads up.	2015-12-16 15:16:44 +00:00
Randall Stewart	55bceb1e2b	First cut of the modularization of our TCP stack. Still to do is to clean up the timer handling using the async-drain. Other optimizations may be coming to go with this. Whats here will allow differnet tcp implementations (one included). Reviewed by: jtl, hiren, transports Sponsored by: Netflix Inc. Differential Revision: D4055	2015-12-16 00:56:45 +00:00

1 2 3

130 Commits