freebsd-dev

Author	SHA1	Message	Date
Richard Scheffenegger	93e28d6e89	tcp: LRO code to deal with all 12 TCP header flags TCP per RFC793 has 4 reserved flag bits for future use. One of those bits may be used for Accurate ECN. This patch is to include these bits in the LRO code to ease the extensibility if/when these bits are used. Reviewed By: hselasky, rrs, #transport Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D34127	2022-02-01 18:41:36 +01:00
Peter Lei	e28330832b	tcp: socket option to get stack alias name TCP stack sysctl nodes are currently inserted using the stack name alias. Allow the user to get the current stack's alias to allow for programatic sysctl access. Obtained from: Netflix	2021-10-27 08:21:59 -07:00
Randall Stewart	4e4c84f8d1	tcp: Add hystart-plus to cc_newreno and rack. TCP Hystart draft version -03: https://datatracker.ietf.org/doc/html/draft-ietf-tcpm-hystartplusplus Is a new version of hystart that allows one to carefully exit slow start if the RTT spikes too much. The newer version has a slower-slow-start so to speak that then kicks in for five round trips. To see if you exited too early, if not into congestion avoidance. This commit will add that feature to our newreno CC and add the needed bits in rack to be able to enable it. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32373	2021-10-22 07:10:28 -04:00
Randall Stewart	5baf32c97a	tcp: Add support for DSACK based reordering window to rack. The rack stack, with respect to the rack bits in it, was originally built based on an early I-D of rack. In fact at that time the TLP bits were in a separate I-D. The dynamic reordering window based on DSACK events was not present in rack at that time. It is now part of the RFC and we need to update our stack to include these features. However we want to have a way to control the feature so that we can, if the admin decides, make it stay the same way system wide as well as via socket option. The new sysctl and socket option has the following meaning for setting: 00 (0) - Keep the old way, i.e. reordering window is 1 and do not use DSACK bytes to add to reorder window 01 (1) - Change the Reordering window to 1/4 of an RTT but do not use DSACK bytes to add to reorder window 10 (2) - Keep the reordering window as 1, but do use SACK bytes to add additional 1/4 RTT delay to the reorder window 11 (3) - reordering window is 1/4 of an RTT and add additional DSACK bytes to increase the reordering window (RFC behavior) The default currently in the sysctl is 3 so we get standards based behavior. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31506	2021-08-17 16:29:22 -04:00
Kristof Provost	8e1864ed07	pf: syncookie support Import OpenBSD's syncookie support for pf. This feature help pf resist TCP SYN floods by only creating states once the remote host completes the TCP handshake rather than when the initial SYN packet is received. This is accomplished by using the initial sequence numbers to encode a cookie (hence the name) in the SYN+ACK response and verifying this on receipt of the client ACK. Reviewed by: kbowling Obtained from: OpenBSD MFC after: 1 week Sponsored by: Modirum MDPay Differential Revision: https://reviews.freebsd.org/D31138	2021-07-20 10:36:13 +02:00
Randall Stewart	4f3addd94b	tcp: Add a socket option to rack so we can test various changes to the slop value in timers. Timer_slop, in TCP, has been 200ms for a long time. This value dates back a long time when delayed ack timers were longer and links were slower. A 200ms timer slop allows 1 MSS to be sent over a 60kbps link. Its possible that lowering this value to something more in line with todays delayed ack values (40ms) might improve TCP. This bit of code makes it so rack can, via a socket option, adjust the timer slop. Reviewed by: mtuexen Sponsered by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30249	2021-05-26 06:43:30 -04:00
Richard Scheffenegger	0471a8c734	tcp: SACK Lost Retransmission Detection (LRD) Recover from excessive losses without reverting to a retransmission timeout (RTO). Disabled by default, enable with sysctl net.inet.tcp.do_lrd=1 Reviewed By: #transport, rrs, tuexen, #manpages Sponsored by: Netapp, Inc. Differential Revision: https://reviews.freebsd.org/D28931	2021-05-10 19:06:20 +02:00
Randall Stewart	5d8fd932e4	This brings into sync FreeBSD with the netflix versions of rack and bbr. This fixes several breakages (panics) since the tcp_lro code was committed that have been reported. Quite a few new features are now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the largest). There is also support for ack-war prevention. Documents comming soon on rack.. Sponsored by: Netflix Reviewed by: rscheff, mtuexen Differential Revision: https://reviews.freebsd.org/D30036	2021-05-06 11:22:26 -04:00
Michael Tuexen	9e644c2300	tcp: add support for TCP over UDP Adding support for TCP over UDP allows communication with TCP stacks which can be implemented in userspace without requiring special priviledges or specific support by the OS. This is joint work with rrs. Reviewed by: rrs Sponsored by: Netflix, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29469	2021-04-18 16:16:42 +02:00
Andrew Gallatin	a034518ac8	Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain In order to efficiently serve web traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection. This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases. This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller. Reviewed by: jhb, bz (earlier version), bcr (man page) Tested by: gonzo Sponsored by: Netfix Differential Revision: https://reviews.freebsd.org/D21636	2020-12-19 22:04:46 +00:00
Richard Scheffenegger	e399566123	TCP: send full initial window when timestamps are in use The fastpath in tcp_output tries to send out full segments, and avoid sending partial segments by comparing against the static t_maxseg variable. That value does not consider tcp options like timestamps, while the initial window calculation is using the correct dynamic tcp_maxseg() function. Due to this interaction, the last, full size segment is considered too short and not sent out immediately. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26478	2020-09-25 10:38:19 +00:00
Mateusz Guzik	662c13053f	net: clean up empty lines in .c and .h files	2020-09-01 21:19:14 +00:00
John Baldwin	f1f9347546	Initial support for kernel offload of TLS receive. - Add a new TCP_RXTLS_ENABLE socket option to set the encryption and authentication algorithms and keys as well as the initial sequence number. - When reading from a socket using KTLS receive, applications must use recvmsg(). Each successful call to recvmsg() will return a single TLS record. A new TCP control message, TLS_GET_RECORD, will contain the TLS record header of the decrypted record. The regular message buffer passed to recvmsg() will receive the decrypted payload. This is similar to the interface used by Linux's KTLS RX except that Linux does not return the full TLS header in the control message. - Add plumbing to the TOE KTLS interface to request either transmit or receive KTLS sessions. - When a socket is using receive KTLS, redirect reads from soreceive_stream() into soreceive_generic(). - Note that this interface is currently only defined for TLS 1.1 and 1.2, though I believe we will be able to reuse the same interface and structures for 1.3.	2020-04-27 23:17:19 +00:00
Randall Stewart	e570d231f4	This change does a small prepratory step in getting the latest rack and bbr in from the NF repo. When those come in the OOB data handling will be fixed where Skyzaller crashes. Differential Revision: https://reviews.freebsd.org/D24575	2020-04-27 16:30:29 +00:00
Randall Stewart	481be5de9d	White space cleanup -- remove trailing tab's or spaces from any line. Sponsored by: Netflix Inc.	2020-02-12 13:31:36 +00:00
Michael Tuexen	493c98c6d2	Add flags for upcoming patches related to improved ECN handling. No functional change. Submitted by: Richard Scheffenegger Reviewed by: rgrimes@, tuexen@ Differential Revision: https://reviews.freebsd.org/D22429	2019-12-31 14:32:48 +00:00
Edward Tomasz Napierala	adc56f5a38	Make use of the stats(3) framework in the TCP stack. This makes it possible to retrieve per-connection statistical information such as the receive window size, RTT, or goodput, using a newly added TCP_STATS getsockopt(3) option, and extract them using the stats_voistat_fetch(3) API. See the net/tcprtt port for an example consumer of this API. Compared to the existing TCP_INFO system, the main differences are that this mechanism is easy to extend without breaking ABI, and provides statistical information instead of raw "snapshots" of values at a given point in time. stats(3) is more generic and can be used in both userland and the kernel. Reviewed by: thj Tested by: thj Obtained from: Netflix Relnotes: yes Sponsored by: Klara Inc, Netflix Differential Revision: https://reviews.freebsd.org/D20655	2019-12-02 20:58:04 +00:00
John Baldwin	9e14430d46	Add a TOE KTLS mode and a TOE hook for allocating TLS sessions. This adds the glue to allocate TLS sessions and invokes it from the TLS enable socket option handler. This also adds some counters for active TOE sessions. The TOE KTLS mode is returned by getsockopt(TLSTX_TLS_MODE) when TOE KTLS is in use on a socket, but cannot be set via setsockopt(). To simplify various checks, a TLS session now includes an explicit 'mode' member set to the value returned by TLSTX_TLS_MODE. Various places that used to check 'sw_encrypt' against NULL to determine software vs ifnet (NIC) TLS now check 'mode' instead. Reviewed by: np, gallatin Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21891	2019-10-08 21:34:06 +00:00
Randall Stewart	35c7bb3407	This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. This is a completely separate TCP stack (tcp_bbr.ko) that will be built only if you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that supports hardware pacing, BBR understands how to use such a feature. Note that this commit also adds in a general purpose time-filter which allows you to have a min-filter or max-filter. A filter allows you to have a low (or high) value for some period of time and degrade slowly to another value has time passes. You can find out the details of BBR by looking at the original paper at: https://queue.acm.org/detail.cfm?id=3022184 or consult many other web resources you can find on the web referenced by "BBR congestion control". It should be noted that BBRv1 (which this is) does tend to unfairness in cases of small buffered paths, and it will usually get less bandwidth in the case of large BDP paths(when competing with new-reno or cubic flows). BBR is still an active research area and we do plan on implementing V2 of BBR to see if it is an improvement over V1. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D21582	2019-09-24 18:18:11 +00:00
John Baldwin	b2e60773c6	Add kernel-side support for in-kernel TLS. KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277	2019-08-27 00:01:56 +00:00
Randall Stewart	3b0b41e613	This commit updates rack to what is basically being used at NF as well as sets in some of the groundwork for committing BBR. The hpts system is updated as well as some other needed utilities for the entrance of BBR. This is actually part 1 of 3 more needed commits which will finally complete with BBRv1 being added as a new tcp stack. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D20834	2019-07-10 20:40:39 +00:00
Randall Stewart	89e560f441	This commit brings in a new refactored TCP stack called Rack. Rack includes the following features: - A different SACK processing scheme (the old sack structures are not used). - RACK (Recent acknowledgment) where counting dup-acks is no longer done instead time is used to knwo when to retransmit. (see the I-D) - TLP (Tail Loss Probe) where we will probe for tail-losses to attempt to try not to take a retransmit time-out. (see the I-D) - Burst mitigation using TCPHTPS - PRR (partial rate reduction) see the RFC. Once built into your kernel, you can select this stack by either socket option with the name of the stack is "rack" or by setting the global sysctl so the default is rack. Note that any connection that does not support SACK will be kicked back to the "default" base FreeBSD stack (currently known as "default"). To build this into your kernel you will need to enable in your kernel: makeoptions WITH_EXTRA_TCP_STACKS=1 options TCPHPTS Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15525	2018-06-07 18:18:13 +00:00
Jonathan T. Looney	2529f56ed3	Add the "TCP Blackbox Recorder" which we discussed at the developer summits at BSDCan and BSDCam in 2017. The TCP Blackbox Recorder allows you to capture events on a TCP connection in a ring buffer. It stores metadata with the event. It optionally stores the TCP header associated with an event (if the event is associated with a packet) and also optionally stores information on the sockets. It supports setting a log ID on a TCP connection and using this to correlate multiple connections that share a common log ID. You can log connections in different modes. If you are doing a coordinated test with a particular connection, you may tell the system to put it in mode 4 (continuous dump). Or, if you just want to monitor for errors, you can put it in mode 1 (ring buffer) and dump all the ring buffers associated with the connection ID when we receive an error signal for that connection ID. You can set a default mode that will be applied to a particular ratio of incoming connections. You can also manually set a mode using a socket option. This commit includes only basic probes. rrs@ has added quite an abundance of probes in his TCP development work. He plans to commit those soon. There are user-space programs which we plan to commit as ports. These read the data from the log device and output pcapng files, and then let you analyze the data (and metadata) in the pcapng files. Reviewed by: gnn (previous version) Obtained from: Netflix, Inc. Relnotes: yes Differential Revision: https://reviews.freebsd.org/D11085	2018-03-22 09:40:08 +00:00
Patrick Kelsey	c560df6f12	This is an implementation of the client side of TCP Fast Open (TFO) [RFC7413]. It also includes a pre-shared key mode of operation in which the server requires the client to be in possession of a shared secret in order to successfully open TFO connections with that server. The names of some existing fastopen sysctls have changed (e.g., net.inet.tcp.fastopen.enabled -> net.inet.tcp.fastopen.server_enable). Reviewed by: tuexen MFC after: 1 month Sponsored by: Limelight Networks Differential Revision: https://reviews.freebsd.org/D14047	2018-02-26 02:53:22 +00:00
Pedro F. Giffuni	51369649b0	sys: further adoption of SPDX licensing ID tags. Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point.	2017-11-20 19:43:44 +00:00
Warner Losh	fbbd9655e5	Renumber copyright clause 4 Renumber cluase 4 to 3, per what everybody else did when BSD granted them permission to remove clause 3. My insistance on keeping the same numbering for legal reasons is too pedantic, so give up on that point. Submitted by: Jan Schaumann <jschauma@stevens.edu> Pull Request: https://github.com/freebsd/freebsd/pull/96	2017-02-28 23:42:47 +00:00
Gleb Smirnoff	d519cedbad	Provide new socket option TCP_CCALGOOPT, which stands for TCP congestion control algorithm options. The argument is variable length and is opaque to TCP, forwarded directly to the algorithm's ctl_output method. Provide new includes directory netinet/cc, where algorithm specific headers can be installed. The new API doesn't yet have any in tree consumers. The original code written by lstewart. Reviewed by: rrs, emax Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D711	2016-01-22 02:07:48 +00:00
Patrick Kelsey	281a0fd4f9	Implementation of server-side TCP Fast Open (TFO) [RFC7413]. TFO is disabled by default in the kernel build. See the top comment in sys/netinet/tcp_fastopen.c for implementation particulars. Reviewed by: gnn, jch, stas MFC after: 3 days Sponsored by: Verisign, Inc. Differential Revision: https://reviews.freebsd.org/D4350	2015-12-24 19:09:48 +00:00
Randall Stewart	55bceb1e2b	First cut of the modularization of our TCP stack. Still to do is to clean up the timer handling using the async-drain. Other optimizations may be coming to go with this. Whats here will allow differnet tcp implementations (one included). Reviewed by: jtl, hiren, transports Sponsored by: Netflix Inc. Differential Revision: D4055	2015-12-16 00:56:45 +00:00
Hiren Panchasara	86a996e6bd	There are times when it would be really nice to have a record of the last few packets and/or state transitions from each TCP socket. That would help with narrowing down certain problems we see in the field that are hard to reproduce without understanding the history of how we got into a certain state. This change provides just that. It saves copies of the last N packets in a list in the tcpcb. When the tcpcb is destroyed, the list is freed. I thought this was likely to be more performance-friendly than saving copies of the tcpcb. Plus, with the packets, you should be able to reverse-engineer what happened to the tcpcb. To enable the feature, you will need to compile a kernel with the TCPPCAP option. Even then, the feature defaults to being deactivated. You can activate it by setting a positive value for the number of captured packets. You can do that on either a global basis or on a per-socket basis (via a setsockopt call). There is no way to get the packets out of the kernel other than using kmem or getting a coredump. I thought that would help some of the legal/privacy concerns regarding such a feature. However, it should be possible to add a future effort to export them in PCAP format. I tested this at low scale, and found that there were no mbuf leaks and the peak mbuf usage appeared to be unchanged with and without the feature. The main performance concern I can envision is the number of mbufs that would be used on systems with a large number of sockets. If you save five packets per direction per socket and have 3,000 sockets, that will consume at least 30,000 mbufs just to keep these packets. I tried to reduce the concerns associated with this by limiting the number of clusters (not mbufs) that could be used for this feature. Again, in my testing, that appears to work correctly. Differential Revision: D3100 Submitted by: Jonathan Looney <jlooney at juniper dot net> Reviewed by: gnn, hiren	2015-10-14 00:35:37 +00:00
John Baldwin	0d25fab44d	Add placeholder constants to reserve a portion of the socket option name space for use by downstream vendors to add custom options. MFC after: 2 weeks	2013-02-01 15:32:20 +00:00
John Baldwin	1d77fa5a26	Use decimal values for UDP and TCP socket options rather than hex to avoid implying that these constants should be treated as bit masks. Reviewed by: net MFC after: 1 week	2013-01-22 19:45:04 +00:00
Gleb Smirnoff	9077f38738	Add new socket options: TCP_KEEPINIT, TCP_KEEPIDLE, TCP_KEEPINTVL and TCP_KEEPCNT, that allow to control initial timeout, idle time, idle re-send interval and idle send count on a per-socket basis. Reviewed by: andre, bz, lstewart	2012-02-05 16:53:02 +00:00
Ed Schouten	cf05e311ea	Add missing #includes. According to POSIX, these two header files should be able to be included by themselves, not depending on other headers. The <net/if.h> header uses struct sockaddr when __BSD_VISIBLE=1, while <netinet/tcp.h> uses integer datatypes (u_int32_t, u_short, etc). MFC after: 2 months	2011-10-21 12:58:34 +00:00
George V. Neville-Neil	f5d34df525	Add new, per connection, statistics for TCP, including: Retransmitted Packets Zero Window Advertisements Out of Order Receives These statistics are available via the -T argument to netstat(1). MFC after: 2 weeks	2010-11-17 18:55:12 +00:00
Andre Oppermann	1c18314d17	Remove the TCP inflight bandwidth limiter as announced in r211315 to give way for the pluggable congestion control framework. It is the task of the congestion control algorithm to set the congestion window and amount of inflight data without external interference. In 'struct tcpcb' the variables previously used by the inflight limiter are renamed to spares to keep the ABI intact and to have some more space for future extensions. In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to preserve the ABI. It is always set to 0. In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed to preserve the ABI. It is always set to 0. These unused variable in the various structures may be reused in the future or garbage collected before the next release or at some other point when an ABI change happens anyway for other reasons. No MFC is planned. The inflight bandwidth limiter stays disabled by default in the other branches but remains available.	2010-09-16 21:06:45 +00:00
Andre Oppermann	2c9879e8d3	Improve comment to TCP_MINMSS by taking the wording from lstewart (with a small difference in the last paragraph though) as suggested by jhb. Clarify that the 'reviewed by' in r212653 by lstewart was for the functional change, not the comments in the committed version.	2010-09-16 12:13:06 +00:00
Andre Oppermann	c183b9c683	Change the default MSS for IPv4 and IPv6 TCP connections from an artificial power-of-2 rounded number to their real values specified in RFC879 and RFC2460. From the history and existing comments it appears that the rounded numbers were intended to be advantageous for the kernel and mbuf system. However this hasn't been the case at for at least a long time. The mbuf clusters used in tcp_output() have enough space to hold the larger real value for the default MSS for both IPv4 and IPv6. Note that the default MSS is only used when path MTU discovery is disabled. Update and expand related comments. Reviewed by: lsteward (including some word-smithing) MFC after: 2 weeks	2010-09-15 10:39:30 +00:00
Luigi Rizzo	dc5fd2595c	use u_char instead of u_int for short bitfields. For our compiler the two constructs are completely equivalent, but some compilers (including MSC and tcc) use the base type for alignment, which in the cases touched here result in aligning the bitfields to 32 bit instead of the 8 bit that is meant here. Note that almost all other headers where small bitfields are used have u_int8_t instead of u_int. MFC after: 3 days	2010-02-01 14:13:44 +00:00
John Baldwin	43d9473499	- Rename the __tcpi_(snd\|rcv)_mss fields of the tcp_info structure to remove the leading underscores since they are now implemented. - Implement the tcpi_rto and tcpi_last_data_recv fields in the tcp_info structure. Reviewed by: rwatson MFC after: 2 weeks	2009-12-22 15:47:40 +00:00
Kip Macy	535fbad68f	add rcv_nxt, snd_nxt, and toe offload id to FreeBSD-specific extension fields for tcp_info	2008-05-05 20:13:31 +00:00
Andre Oppermann	c343c524e1	Use #defines for TCP options padding after EOL to be consistent. Reviewed by: bz	2008-04-07 18:43:59 +00:00
Kip Macy	ee939bbf7e	Add socket option for setting and retrieving the congestion control algorithm. The name used is to allow compatibility with Linux.	2007-12-16 03:30:07 +00:00
Andre Oppermann	faedb66c2a	The printf %b list in PRINT_TH_FLAGS has to be in octal numbering. Thus convert \8 to \10 and the warnings go away. Pointed out by: sam, ru, thompsa	2007-05-25 21:28:49 +00:00
Andre Oppermann	a250f3820c	Add CWR back into the PRINT_TH_FLAGS list as gcc42 doesn't complain about \8 in a string anymore.	2007-05-23 19:16:21 +00:00
Andre Oppermann	df541e5fc1	Add tcp_log_addrs() function to generate and standardized TCP log line for use thoughout the tcp subsystem. It is IPv4 and IPv6 aware creates a line in the following format: "TCP: [1.2.3.4]:50332 to [1.2.3.4]:80 tcpflags <RST>" A "\n" is not included at the end. The caller is supposed to add further information after the standard tcp log header. The function returns a NUL terminated string which the caller has to free(s, M_TCPLOG) after use. All memory allocation is done with M_NOWAIT and the return value may be NULL in memory shortage situations. Either struct in_conninfo \|\| (struct tcphdr && (struct ip \|\| struct ip6_hdr) have to be supplied. Due to ip[6].h header inclusion limitations and ordering issues the struct ip and struct ip6_hdr parameters have to be casted and passed as void * pointers. tcp_log_addrs(struct in_conninfo inc, struct tcphdr th, void ip4hdr, void ip6hdr) Usage example: struct ip ip; char tcplog; if (tcplog = tcp_log_addrs(NULL, th, (void *)ip, NULL)) { log(LOG_DEBUG, "%s; %s: Connection attempt to closed port\n", tcplog, __func__); free(s, M_TCPLOG); }	2007-05-18 19:58:37 +00:00
Andre Oppermann	0d957bba48	o Remove unused and redundant TCP option definitions o Replace usage of MAX_TCPOPTLEN with the correctly constructed and derived MAX_TCPOPTLEN	2007-04-20 15:08:09 +00:00
Andre Oppermann	e406f5a1c9	Remove tcp_minmssoverload DoS detection logic. The problem it tried to protect us from wasn't really there and it only bloats the code. Should the problem surface in the future we can simply resurrect it from cvs history.	2007-03-21 18:05:54 +00:00
Andre Oppermann	02a1a64357	Consolidate insertion of TCP options into a segment from within tcp_output() and syncache_respond() into its own generic function tcp_addoptions(). tcp_addoptions() is alignment agnostic and does optimal packing in all cases. In struct tcpopt rename to_requested_s_scale to just to_wscale. Add a comment with quote from RFC1323: "The Window field in a SYN (i.e., a <SYN> or <SYN,ACK>) segment itself is never scaled." Reviewed by: silby, mohans, julian Sponsored by: TCP/IP Optimization Fundraise 2005	2007-03-15 15:59:28 +00:00
Bruce M Simpson	1baaf8347c	Expose smoothed RTT and RTT variance measurements to userland via socket option TCP_INFO. Note that the units used in the original Linux API are in microseconds, so use a 64-bit mantissa to convert FreeBSD's internal measurements from struct tcpcb from ticks.	2007-02-02 18:34:18 +00:00

1 2

83 Commits