freebsd-nq

Author	SHA1	Message	Date
Andre Oppermann	02a1a64357	Consolidate insertion of TCP options into a segment from within tcp_output() and syncache_respond() into its own generic function tcp_addoptions(). tcp_addoptions() is alignment agnostic and does optimal packing in all cases. In struct tcpopt rename to_requested_s_scale to just to_wscale. Add a comment with quote from RFC1323: "The Window field in a SYN (i.e., a <SYN> or <SYN,ACK>) segment itself is never scaled." Reviewed by: silby, mohans, julian Sponsored by: TCP/IP Optimization Fundraise 2005	2007-03-15 15:59:28 +00:00
Mohan Srinivasan	7c72af8770	Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate potential issues where the peer does not close, potentially leaving thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl fast_finwait2_recycle, which is disabled by default. Reviewed by: gnn, silby.	2007-02-26 22:25:21 +00:00
Andre Oppermann	6741ecf595	Auto sizing TCP socket buffers. Normally the socket buffers are static (either derived from global defaults or set with setsockopt) and do not adapt to real network conditions. Two things happen: a) your socket buffers are too small and you can't reach the full potential of the network between both hosts; b) your socket buffers are too big and you waste a lot of kernel memory for data just sitting around. With automatic TCP send and receive socket buffers we can start with a small buffer and quickly grow it in parallel with the TCP congestion window to match real network conditions. FreeBSD has a default 32K send socket buffer. This supports a maximal transfer rate of only slightly more than 2Mbit/s on a 100ms RTT trans-continental link. Or at 200ms just above 1Mbit/s. With TCP send buffer auto scaling and the default values below it supports 20Mbit/s at 100ms and 10Mbit/s at 200ms. That's an improvement of factor 10, or 1000%. For the receive side it looks slightly better with a default of 64K buffer size. New sysctls are: net.inet.tcp.sendbuf_auto=1 (enabled) net.inet.tcp.sendbuf_inc=8192 (8K, step size) net.inet.tcp.sendbuf_max=262144 (256K, growth limit) net.inet.tcp.recvbuf_auto=1 (enabled) net.inet.tcp.recvbuf_inc=16384 (16K, step size) net.inet.tcp.recvbuf_max=262144 (256K, growth limit) Tested by: many (on HEAD and RELENG_6) Approved by: re MFC after: 1 month	2007-02-01 18:32:13 +00:00
Andre Oppermann	bf6d304ab2	Rewrite of TCP syncookies to remove locking requirements and to enhance functionality: - Remove a rwlock aquisition/release per generated syncookie. Locking is now integrated with the bucket row locking of syncache itself and syncookies no longer add any additional lock overhead. - Syncookie secrets are different for and stored per syncache buck row. Secrets expire after 16 seconds and are reseeded on-demand. - The computational overhead for syncookie generation and verification is one MD5 hash computation as before. - Syncache can be turned off and run with syncookies only by setting the sysctl net.inet.tcp.syncookies_only=1. This implementation extends the orginal idea and first implementation of FreeBSD by using not only the initial sequence number field to store information but also the timestamp field if present. This way we can keep track of the entire state we need to know to recreate the session in its original form. Almost all TCP speakers implement RFC1323 timestamps these days. For those that do not we still have to live with the known shortcomings of the ISN only SYN cookies. The use of the timestamp field causes the timestamps to be randomized if syncookies are enabled. The idea of SYN cookies is to encode and include all necessary information about the connection setup state within the SYN-ACK we send back and thus to get along without keeping any local state until the ACK to the SYN-ACK arrives (if ever). Everything we need to know should be available from the information we encoded in the SYN-ACK. A detailed description of the inner working of the syncookies mechanism is included in the comments in tcp_syncache.c. Reviewed by: silby (slightly earlier version) Sponsored by: TCP/IP Optimization Fundraise 2005	2006-09-13 13:08:27 +00:00
Ruslan Ermilov	751dea2935	Back when we had T/TCP support, we used to apply different timeouts for TCP and T/TCP connections in the TIME_WAIT state, and we had two separate timed wait queues for them. Now that is has gone, the timeout is always 2*MSL again, and there is no reason to keep two queues (the first was unused anyway!). Also, reimplement the remaining queue using a TAILQ (it was technically impossible before, with two queues).	2006-09-07 13:06:00 +00:00
Andre Oppermann	233dcce118	First step of TSO (TCP segmentation offload) support in our network stack. o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6 o add CSUM_TSO flag to mbuf pkthdr csum_flags field o add tso_segsz field to mbuf pkthdr o enhance ip_output() packet length check to allow for large TSO packets o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities o adjust all callers of tcp_maxmtu[46]() accordingly Discussed on: -current, -net Sponsored by: TCP/IP Optimization Fundraise 2005	2006-09-06 21:51:59 +00:00
Gleb Smirnoff	2c857a9be9	o Backout rev. 1.125 of in_pcb.c. It appeared to behave extremely bad under high load. For example with 40k sockets and 25k tcptw entries, connect() syscall can run for seconds. Debugging showed that it iterates the cycle millions times and purges thousands of tcptw entries at a time. Besides practical unusability this change is architecturally wrong. First, in_pcblookup_local() is used in connect() and bind() syscalls. No stale entries purging shouldn't be done here. Second, it is a layering violation. o Return back the tcptw purging cycle to tcp_timer_2msl_tw(), that was removed in rev. 1.78 by rwatson. The commit log of this revision tells nothing about the reason cycle was removed. Now we need this cycle, since major cleaner of stale tcptw structures is removed. o Disable probably necessary, but now unused tcp_twrecycleable() function. Reviewed by: ru	2006-09-06 13:56:35 +00:00
Andre Oppermann	f72167f4d1	Some cleanups and janitorial work to tcp_dooptions(): o redefine the parameter 'is_syn' to 'flags', add TO_SYN flag and adjust its usage accordingly o update the comments to the tcp_dooptions() invocation in tcp_input():after_listen to reflect reality o move the logic checking the echoed timestamp out of tcp_dooptions() to the only place that uses it next to the invocation described in the previous item o adjust parsing of TCPOPT_SACK_PERMITTED to use the same style as the others o add comments in to struct tcpopt.to_flags #defines No functional changes. Sponsored by: TCP/IP Optimization Fundraise 2005	2006-06-26 15:35:25 +00:00
Andre Oppermann	8411d000a1	Move all syncache related structures to tcp_syncache.c. They are only used there. This unbreaks userland programs that include tcp_var.h. Discussed with: rwatson	2006-06-18 12:26:11 +00:00
Andre Oppermann	93f0d0c5bf	Rearrange fields in struct syncache and syncache_head to make them more cache line friendly. Sponsored by: TCP/IP Optimization Fundraise 2005	2006-06-17 17:57:36 +00:00
Andre Oppermann	351630c40d	Add locking to TCP syncache and drop the global tcpinfo lock as early as possible for the syncache_add() case. The syncache timer no longer aquires the tcpinfo lock and timeout/retransmit runs can happen in parallel with bucket granularity. On a P4 the additional locks cause a slight degression of 0.7% in tcp connections per second. When IP and TCP input are deserialized and can run in parallel this little overhead can be neglected. The syncookie handling still leaves room for improvement and its random salts may be moved to the syncache bucket head structures to remove the second lock operation currently required for it. However this would be a more involved change from the way syncookies work at the moment. Reviewed by: rwatson Tested by: rwatson, ps (earlier version) Sponsored by: TCP/IP Optimization Fundraise 2005	2006-06-17 17:32:38 +00:00
Robert Watson	623dce13c6	Update TCP for infrastructural changes to the socket/pcb refcount model, pru_abort(), pru_detach(), and in_pcbdetach(): - Universally support and enforce the invariant that so_pcb is never NULL, converting dozens of unnecessary NULL checks into assertions, and eliminating dozens of unnecessary error handling cases in protocol code. - In some cases, eliminate unnecessary pcbinfo locking, as it is no longer required to ensure so_pcb != NULL. For example, the receive code no longer requires the pcbinfo lock, and the send code only requires it if building a new connection on an otherwise unconnected socket triggered via sendto() with an address. This should significnatly reduce tcbinfo lock contention in the receive and send cases. - In order to support the invariant that so_pcb != NULL, it is now necessary for the TCP code to not discard the tcpcb any time a connection is dropped, but instead leave the tcpcb until the socket is shutdown. This case is handled by setting INP_DROPPED, to substitute for using a NULL so_pcb to indicate that the connection has been dropped. This requires the inpcb lock, but not the pcbinfo lock. - Unlike all other protocols in the tree, TCP may need to retain access to the socket after the file descriptor has been closed. Set SS_PROTOREF in tcp_detach() in order to prevent the socket from being freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether or not it needs to free the socket when the connection finally does close. The typical case where this occurs is if close() is called on a TCP socket before all sent data in the send socket buffer has been transmitted or acknowledged. If INP_SOCKREF is found when the connection is dropped, we release the inpcb, tcpcb, and socket instead of flagging INP_DROPPED. - Abort and detach protocol switch methods no longer return failures, nor attempt to free sockets, as the socket layer does this. - Annotate the existence of a long-standing race in the TCP timer code, in which timers are stopped but not drained when the socket is freed, as waiting for drain may lead to deadlocks, or have to occur in a context where waiting is not permitted. This race has been handled by testing to see if the tcpcb pointer in the inpcb is NULL (and vice versa), which is not normally permitted, but may be true of a inpcb and tcpcb have been freed. Add a counter to test how often this race has actually occurred, and a large comment for each instance where we compare potentially freed memory with NULL. This will have to be fixed in the near future, but requires is to further address how to handle the timer shutdown shutdown issue. - Several TCP calls no longer potentially free the passed inpcb/tcpcb, so no longer need to return a pointer to indicate whether the argument passed in is still valid. - Un-macroize debugging and locking setup for various protocol switch methods for TCP, as it lead to more obscurity, and as locking becomes more customized to the methods, offers less benefit. - Assert copyright on tcp_usrreq.c due to significant modifications that have been made as part of this work. These changes significantly modify the memory management and connection logic of our TCP implementation, and are (as such) High Risk Changes, and likely to contain serious bugs. Please report problems to the current@ mailing list ASAP, ideally with simple test cases, and optionally, packet traces. MFC after: 3 months	2006-04-01 16:36:36 +00:00
Andre Oppermann	464fcfbc5c	Rework TCP window scaling (RFC1323) to properly scale the send window right from the beginning and partly clean up the differences in handling between SYN_SENT and SYN_RCVD (syncache). Further changes to this code to come. This is a first incremental step to a general overhaul and streamlining of the TCP code. PR: kern/15095 PR: kern/92690 (partly) Reviewed by: qingli (and tested with ANVL) Sponsored by: TCP/IP Optimization Fundraise 2005	2006-02-28 23:05:59 +00:00
Andre Oppermann	eaf80179e2	Have TCP Inflight disable itself if the RTT is below a certain threshold. Inflight doesn't make sense on a LAN as it has trouble figuring out the maximal bandwidth because of the coarse tick granularity. The sysctl net.inet.tcp.inflight.rttthresh specifies the threshold in milliseconds below which inflight will disengage. It defaults to 10ms. Tested by: Joao Barros <joao.barros-at-gmail.com>, Rich Murphey <rich-at-whiteoaklabs.com> Sponsored by: TCP/IP Optimization Fundraise 2005	2006-02-16 19:38:07 +00:00
Paul Saab	5a53ca1627	- Postpone SACK option processing until after PAWS checks. SACK option processing is now done in the ACK processing case. - Merge tcp_sack_option() and tcp_del_sackholes() into a new function called tcp_sack_doack(). - Test (SEG.ACK < SND.MAX) before processing the ACK. Submitted by: Noritoshi Demizu Reveiewed by: Mohan Srinivasan, Raja Mukerji Approved by: re	2005-06-27 22:27:42 +00:00
Paul Saab	9d17a7a64a	Changes to tcp_sack_option() that - Walks the scoreboard backwards from the tail to reduce the number of comparisons for each sack option received. - Introduce functions to add/remove sack scoreboard elements, making the code more readable. Submitted by: Noritoshi Demizu Reviewed by: Raja Mukerji, Mohan Srinivasan	2005-06-04 08:03:28 +00:00
Paul Saab	808f11b768	This is conform with the terminology in M.Mathis and J.Mahdavi, "Forward Acknowledgement: Refining TCP Congestion Control" SIGCOMM'96, August 1996. Submitted by: Noritoshi Demizu, Raja Mukerji	2005-05-25 17:55:27 +00:00
Paul Saab	2cdbfa66ee	Replace t_force with a t_flag (TF_FORCEDATA). Submitted by: Raja Mukerji. Reviewed by: Mohan, Silby, Andre Opperman.	2005-05-21 00:38:29 +00:00
Paul Saab	0077b0163f	When looking for the next hole to retransmit from the scoreboard, or to compute the total retransmitted bytes in this sack recovery episode, the scoreboard is traversed. While in sack recovery, this traversal occurs on every call to tcp_output(), every dupack and every partial ack. The scoreboard could potentially get quite large, making this traversal expensive. This change optimizes this by storing hints (for the next hole to retransmit and the total retransmitted bytes in this sack recovery episode) reducing the complexity to find these values from O(n) to constant time. The debug code that sanity checks the hints against the computed value will be removed eventually. Submitted by: Mohan Srinivasan, Noritoshi Demizu, Raja Mukerji.	2005-05-11 21:37:42 +00:00
Paul Saab	a6235da61e	- Make the sack scoreboard logic use the TAILQ macros. This improves code readability and facilitates some anticipated optimizations in tcp_sack_option(). - Remove tcp_print_holes() and TCP_SACK_DEBUG. Submitted by: Raja Mukerji. Reviewed by: Mohan Srinivasan, Noritoshi Demizu.	2005-04-21 20:11:01 +00:00
Andre Oppermann	1600372b6b	Ignore ICMP Source Quench messages for TCP sessions. Source Quench is ineffective, depreciated and can be abused to degrade the performance of active TCP sessions if spoofed. Replace a bogus call to tcp_quench() in tcp_output() with the direct equivalent tcpcb variable assignment. Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.1 MFC after: 3 days	2005-04-21 12:37:12 +00:00
Paul Saab	25e6f9ed4b	Fix for a TCP SACK bug where more than (win/2) bytes could have been in flight in SACK recovery. Found by: Noritoshi Demizu Submitted by: Mohan Srinivasan <mohans at yahoo-inc dot com> Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp> Raja Mukerji <raja at moselle dot com>	2005-04-14 20:09:52 +00:00
Paul Saab	e891d82b56	Add limits on the number of elements in the sack scoreboard both per-connection and globally. This eliminates potential DoS attacks where SACK scoreboard elements tie up too much memory. Submitted by: Raja Mukerji (raja at moselle dot com). Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com).	2005-03-09 23:14:10 +00:00
Paul Saab	7643c37cf2	Remove 2 (SACK) fields from the tcpcb. These are only used by a function that is called from tcp_input(), so they oughta be passed on the stack instead of stuck in the tcpcb. Submitted by: Mohan Srinivasan	2005-02-17 23:04:56 +00:00
Maxim Konovalov	9945c0e21f	o Add handling of an IPv4-mapped IPv6 address. o Use SYSCTL_IN() macro instead of direct call of copyin(9). Submitted by: ume o Move sysctl_drop() implementation to sys/netinet/tcp_subr.c where most of tcp sysctls live. o There are net.inet[6].tcp[6].getcred sysctls already, no needs in a separate struct tcp_ident_mapping. Suggested by: ume	2005-02-14 07:37:51 +00:00
Maxim Konovalov	212a79b010	o Implement net.inet.tcp.drop sysctl and userland part, tcpdrop(8) utility: The tcpdrop command drops the TCP connection specified by the local address laddr, port lport and the foreign address faddr, port fport. Obtained from: OpenBSD Reviewed by: rwatson (locking), ru (man page), -current MFC after: 1 month	2005-02-06 10:47:12 +00:00
Warner Losh	c398230b64	/* -> /*- for license, minor formatting changes	2005-01-07 01:45:51 +00:00
Robert Watson	db0aae38b6	Remove the now unused tcp_canceltimers() function. tcpcb timers are now stopped as part of tcp_discardcb(). MFC after: 2 weeks	2004-12-23 01:25:59 +00:00
Andre Oppermann	c94c54e4df	Remove RFC1644 T/TCP support from the TCP side of the network stack. A complete rationale and discussion is given in this message and the resulting discussion: http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706 Note that this commit removes only the functional part of T/TCP from the tcp_* related functions in the kernel. Other features introduced with RFC1644 are left intact (socket layer changes, sendmsg(2) on connection oriented protocols) and are meant to be reused by a simpler and less intrusive reimplemention of the previous T/TCP functionality. Discussed on: -arch	2004-11-02 22:22:22 +00:00
Andre Oppermann	cd109b0d82	Shave 40 unused bytes from struct tcpcb.	2004-10-22 19:55:04 +00:00
Paul Saab	a55db2b6e6	- Estimate the amount of data in flight in sack recovery and use it to control the packets injected while in sack recovery (for both retransmissions and new data). - Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c. - Add a new sysctl (net.inet.tcp.sack.initburst) that controls the number of sack retransmissions done upon initiation of sack recovery. Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>	2004-10-05 18:36:24 +00:00
Robert Watson	a4f757cd5d	White space cleanup for netinet before branch: - Trailing tab/space cleanup - Remove spurious spaces between or before tabs This change avoids touching files that Andre likely has in his working set for PFIL hooks changes for IPFW/DUMMYNET. Approved by: re (scottl) Submitted by: Xin LI <delphij@frontfree.net>	2004-08-16 18:32:07 +00:00
David Malone	969860f3ed	The tcp syncache code was leaving the IPv6 flowlabel uninitialised for the SYN\|ACK packet and then letting in6_pcbconnect set the flowlabel later. Arange for the syncache/syncookie code to set and recall the flow label so that the flowlabel used for the SYN\|ACK is consistent. This is done by using some of the cookie (when tcp cookies are enabeled) and by stashing the flowlabel in syncache. Tested and Discovered by: Orla McGann <orly@cnri.dit.ie> Approved by: ume, silby MFC after: 1 month	2004-07-17 19:44:13 +00:00
Bruce M Simpson	37332f049f	Whitespace.	2004-06-25 02:29:58 +00:00
Paul Saab	6d90faf3d8	Add support for TCP Selective Acknowledgements. The work for this originated on RELENG_4 and was ported to -CURRENT. The scoreboarding code was obtained from OpenBSD, and many of the remaining changes were inspired by OpenBSD, but not taken directly from there. You can enable/disable sack using net.inet.tcp.do_sack. You can also limit the number of sack holes that all senders can have in the scoreboard with net.inet.tcp.sackhole_limit. Reviewed by: gnn Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)	2004-06-23 21:04:37 +00:00
Mike Silbersack	80dd2a81fb	Tighten up reset handling in order to make reset attacks as difficult as possible while maintaining compatibility with the widest range of TCP stacks. The algorithm is as follows: --- For connections in the ESTABLISHED state, only resets with sequence numbers exactly matching last_ack_sent will cause a reset, all other segments will be silently dropped. For connections in all other states, a reset anywhere in the window will cause the connection to be reset. All other segments will be silently dropped. --- The necessity of accepting all in-window resets was discovered by jayanth and jlemon, both of whom have seen TCP stacks that will respond to FIN-ACK packets with resets not meeting the strict last_ack_sent check. Idea by: Darren Reed Reviewed by: truckman, jlemon, others(?)	2004-04-26 02:56:31 +00:00
Bruce M Simpson	de9f59f850	Fix a typo in a comment.	2004-04-20 19:04:24 +00:00
Mike Silbersack	c1537ef063	Enhance our RFC1948 implementation to perform better in some pathlogical TIME_WAIT recycling cases I was able to generate with http testing tools. In short, as the old algorithm relied on ticks to create the time offset component of an ISN, two connections with the exact same host, port pair that were generated between timer ticks would have the exact same sequence number. As a result, the second connection would fail to pass the TIME_WAIT check on the server side, and the SYN would never be acknowledged. I've "fixed" this by adding random positive increments to the time component between clock ticks so that ISNs will always be increasing, no matter how quickly the port is recycled. Except in such contrived benchmarking situations, this problem should never come up in normal usage... until networks get faster. No MFC planned, 4.x is missing other optimizations that are needed to even create the situation in which such quick port recycling will occur.	2004-04-20 06:33:39 +00:00
Warner Losh	f36cfd49ad	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson	2004-04-07 20:46:16 +00:00
Robert Watson	a7b6a14aee	Remove now unneeded arguments to tcp_twrespond() -- so and msrc. These were needed by the MAC Framework until inpcbs gained labels. Submitted by: sam	2004-02-28 15:12:20 +00:00
Bruce Evans	0613995bd0	Fixed namespace pollution in rev.1.74. Implementation of the syncache increased <netinet/tcp_var>'s already large set of prerequisites, and this was handled badly. Just don't declare the complete syncache struct unless <netinet/pcb.h> is included before <netinet/tcp_var.h>. Approved by: jlemon (years ago, for a more invasive fix)	2004-02-25 13:03:01 +00:00
Bruce Evans	a545b1dc4d	Don't use the negatively-opaque type uma_zone_t or be chummy with <vm/uma.h>'s idempotency indentifier or its misspelling.	2004-02-25 11:53:19 +00:00
Andre Oppermann	12e2e97051	Convert the tcp segment reassembly queue to UMA and limit the maximum amount of segments it will hold. The following tuneables and sysctls control the behaviour of the tcp segment reassembly queue: net.inet.tcp.reass.maxsegments (loader tuneable) specifies the maximum number of segments all tcp reassemly queues can hold (defaults to 1/16 of nmbclusters). net.inet.tcp.reass.maxqlen specifies the maximum number of segments any individual tcp session queue can hold (defaults to 48). net.inet.tcp.reass.cursegments (readonly) counts the number of segments currently in all reassembly queues. net.inet.tcp.reass.overflows (readonly) counts how often either the global or local queue limit has been reached. Tested by: bms, silby Reviewed by: bms, silby	2004-02-24 15:27:41 +00:00
Bruce M Simpson	265ed01285	Brucification. Submitted by: bde	2004-02-13 18:21:45 +00:00
Bruce M Simpson	b30190b542	Update the prototype for tcpsignature_apply() to reflect the spelling of the types used by m_apply()'s callback function, f, as documented in mbuf(9). Noticed by: njl	2004-02-12 20:16:09 +00:00
Bruce M Simpson	1cfd4b5326	Initial import of RFC 2385 (TCP-MD5) digest support. This is the first of two commits; bringing in the kernel support first. This can be enabled by compiling a kernel with options TCP_SIGNATURE and FAST_IPSEC. For the uninitiated, this is a TCP option which provides for a means of authenticating TCP sessions which came into being before IPSEC. It is still relevant today, however, as it is used by many commercial router vendors, particularly with BGP, and as such has become a requirement for interconnect at many major Internet points of presence. Several parts of the TCP and IP headers, including the segment payload, are digested with MD5, including a shared secret. The PF_KEY interface is used to manage the secrets using security associations in the SADB. There is a limitation here in that as there is no way to map a TCP flow per-port back to an SPI without polluting tcpcb or using the SPD; the code to do the latter is unstable at this time. Therefore this code only supports per-host keying granularity. Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6), TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective users of this feature, this will not pose any problem. This implementation is output-only; that is, the option is honoured when responding to a host initiating a TCP session, but no effort is made [yet] to authenticate inbound traffic. This is, however, sufficient to interwork with Cisco equipment. Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with local patches. Patches for tcpdump to validate TCP-MD5 sessions are also available from me upon request. Sponsored by: sentex.net	2004-02-11 04:26:04 +00:00
Andre Oppermann	53369ac9bb	Limiters and sanity checks for TCP MSS (maximum segement size) resource exhaustion attacks. For network link optimization TCP can adjust its MSS and thus packet size according to the observed path MTU. This is done dynamically based on feedback from the remote host and network components along the packet path. This information can be abused to pretend an extremely low path MTU. The resource exhaustion works in two ways: o during tcp connection setup the advertized local MSS is exchanged between the endpoints. The remote endpoint can set this arbitrarily low (except for a minimum MTU of 64 octets enforced in the BSD code). When the local host is sending data it is forced to send many small IP packets instead of a large one. For example instead of the normal TCP payload size of 1448 it forces TCP payload size of 12 (MTU 64) and thus we have a 120 times increase in workload and packets. On fast links this quickly saturates the local CPU and may also hit pps processing limites of network components along the path. This type of attack is particularly effective for servers where the attacker can download large files (WWW and FTP). We mitigate it by enforcing a minimum MTU settable by sysctl net.inet.tcp.minmss defaulting to 256 octets. o the local host is reveiving data on a TCP connection from the remote host. The local host has no control over the packet size the remote host is sending. The remote host may chose to do what is described in the first attack and send the data in packets with an TCP payload of at least one byte. For each packet the tcp_input() function will be entered, the packet is processed and a sowakeup() is signalled to the connected process. For example an attack with 2 Mbit/s gives 4716 packets per second and the same amount of sowakeup()s to the process (and context switches). This type of attack is particularly effective for servers where the attacker can upload large amounts of data. Normally this is the case with WWW server where large POSTs can be made. We mitigate this by calculating the average MSS payload per second. If it goes below 'net.inet.tcp.minmss' and the pps rate is above 'net.inet.tcp.minmssoverload' defaulting to 1000 this particular TCP connection is resetted and dropped. MITRE CVE: CAN-2004-0002 Reviewed by: sam (mentor) MFC after: 1 day	2004-01-08 17:40:07 +00:00
Andre Oppermann	97d8d152c2	Introduce tcp_hostcache and remove the tcp specific metrics from the routing table. Move all usage and references in the tcp stack from the routing table metrics to the tcp hostcache. It caches measured parameters of past tcp sessions to provide better initial start values for following connections from or to the same source or destination. Depending on the network parameters to/from the remote host this can lead to significant speedups for new tcp connections after the first one because they inherit and shortcut the learning curve. tcp_hostcache is designed for multiple concurrent access in SMP environments with high contention and is hash indexed by remote ip address. It removes significant locking requirements from the tcp stack with regard to the routing table. Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl)	2003-11-20 20:07:39 +00:00
Mike Silbersack	4bd4fa3fe6	Add an additional check to the tcp_twrecycleable function; I had previously only considered the send sequence space. Unfortunately, some OSes (windows) still use a random positive increments scheme for their syn-ack ISNs, so I must consider receive sequence space as well. The value of 250000 bytes / second for Microsoft's ISN rate of increase was determined by testing with an XP machine.	2003-11-02 07:47:03 +00:00
Mike Silbersack	96af9ea52b	- Add a new function tcp_twrecycleable, which tells us if the ISN which we will generate for a given ip/port tuple has advanced far enough for the time_wait socket in question to be safely recycled. - Have in_pcblookup_local use tcp_twrecycleable to determine if time_Wait sockets which are hogging local ports can be safely freed. This change preserves proper TIME_WAIT behavior under normal circumstances while allowing for safe and fast recycling whenever ephemeral port space is scarce.	2003-11-01 07:30:08 +00:00
Jeffrey Hsu	9d11646de7	Unify the "send high" and "recover" variables as specified in the lastest rev of the spec. Use an explicit flag for Fast Recovery. [1] Fix bug with exiting Fast Recovery on a retransmit timeout diagnosed by Lu Guohan. [2] Reviewed by: Thomas Henderson <thomas.r.henderson@boeing.com> Reported and tested by: Lu Guohan <lguohan00@mails.tsinghua.edu.cn> [2] Approved by: Thomas Henderson <thomas.r.henderson@boeing.com>, Sally Floyd <floyd@acm.org> [1]	2003-07-15 21:49:53 +00:00
Robert Watson	430c635447	Correct a bug introduced with reduced TCP state handling; make sure that the MAC label on TCP responses during TIMEWAIT is properly set from either the socket (if available), or the mbuf that it's responding to. Unfortunately, this is made somewhat difficult by the TCP code, as tcp_twstart() calls tcp_twrespond() after discarding the socket but without a reference to the mbuf that causes the "response". Passing both the socket and the mbuf works arounds this--eventually it might be good to make sure the mbuf always gets passed in in "response" scenarios but working through this provided to complicate things too much. Approved by: re (scottl) Reviewed by: hsu Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories	2003-05-07 05:26:27 +00:00
Jeffrey Hsu	48d2549c3e	Observe conservation of packets when entering Fast Recovery while doing Limited Transmit. Only artificially inflate the congestion window by 1 segment instead of the usual 3 to take into account the 2 already sent by Limited Transmit. Approved in principle by: Mark Allman <mallman@grc.nasa.gov>, Hari Balakrishnan <hari@nms.lcs.mit.edu>, Sally Floyd <floyd@icir.org>	2003-04-01 21:16:46 +00:00
Jonathan Lemon	607b0b0cc9	Remove a panic(); if the zone allocator can't provide more timewait structures, reuse the oldest one. Also move the expiry timer from a per-structure callout to the tcp slow timer. Sponsored by: DARPA, NAI Labs	2003-03-08 22:06:20 +00:00
Jonathan Lemon	340c35de6a	Add a TCP TIMEWAIT state which uses less space than a fullblown TCP control block. Allow the socket and tcpcb structures to be freed earlier than inpcb. Update code to understand an inp w/o a socket. Reviewed by: hsu, silby, jayanth Sponsored by: DARPA, NAI Labs	2003-02-19 22:32:43 +00:00
Jonathan Lemon	7990938421	Convert tcp_fillheaders(tp, ...) -> tcpip_fillheaders(inp, ...) so the routine does not require a tcpcb to operate. Since we no longer keep template mbufs around, move pseudo checksum out of this routine, and merge it with the length update. Sponsored by: DARPA, NAI Labs	2003-02-19 22:18:06 +00:00
Jeffrey Hsu	cb942153c8	Fix NewReno. Reviewed by: Tom Henderson <thomas.r.henderson@boeing.com>	2003-01-13 11:01:20 +00:00
Matthew Dillon	1fcc99b5de	Implement TCP bandwidth delay product window limiting, similar to (but not meant to duplicate) TCP/Vegas. Add four sysctls and default the implementation to 'off'. net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off) net.inet.tcp.inflight_debug debugging (defaults to 1=on) net.inet.tcp.inflight_min minimum window limit net.inet.tcp.inflight_max maximum window limit MFC after: 1 week	2002-08-17 18:26:02 +00:00
Matthew Dillon	d65bf08af3	Add the tcps_sndrexmitbad statistic, keep track of late acks that caused unnecessary retransmissions.	2002-07-19 18:29:38 +00:00
Jeffrey Hsu	3ce144ea88	Notify functions can destroy the pcb, so they have to return an indication of whether this happenned so the calling function knows whether or not to unlock the pcb. Submitted by: Jennifer Yang (yangjihui@yahoo.com) Bug reported by: Sid Carter (sidcarter@symonds.net)	2002-06-14 08:35:21 +00:00
Mike Silbersack	eb5afeba22	Re-commit w/fix: Ensure that the syn cache's syn-ack packets contain the same ip_tos, ip_ttl, and DF bits as all other tcp packets. PR: 39141 MFC after: 2 weeks This time, make sure that ipv4 specific code (aka all of the above) is only run in the ipv4 case.	2002-06-14 03:08:05 +00:00
Mike Silbersack	70d2b17029	Back out ip_tos/ip_ttl/DF "fix", it just panic'd my box. :) Pointy-hat to: silby	2002-06-14 02:43:20 +00:00
Mike Silbersack	21c3b2fc69	Ensure that the syn cache's syn-ack packets contain the same ip_tos, ip_ttl, and DF bits as all other tcp packets. PR: 39141 MFC after: 2 weeks	2002-06-14 02:36:34 +00:00
Jeffrey Hsu	f76fcf6d4c	Lock up inpcb. Submitted by: Jennifer Yang <yangjihui@yahoo.com>	2002-06-10 20:05:46 +00:00
Alfred Perlstein	4d77a549fe	Remove __P.	2002-03-19 21:25:46 +00:00
Matthew Dillon	262c1c1a4e	Fix a bug with transmitter restart after receiving a 0 window. The receiver was not sending an immediate ack with delayed acks turned on when the input buffer is drained, preventing the transmitter from restarting immediately. Propogate the TCP_NODELAY option to accept()ed sockets. (Helps tbench and is a good idea anyway). Some cleanup. Identify additonal issues in comments. MFC after: 1 day	2001-12-02 08:49:29 +00:00
Jonathan Lemon	be2ac88c59	Introduce a syncache, which enables FreeBSD to withstand a SYN flood DoS in an improved fashion over the existing code. Reviewed by: silby (in a previous iteration) Sponsored by: DARPA, NAI Labs	2001-11-22 04:50:44 +00:00
Jayanth Vijayaraghavan	c24d5dae7a	Add a flag TF_LASTIDLE, that forces a previously idle connection to send all its data, especially when the data is less than one MSS. This fixes an issue where the stack was delaying the sending of data, eventhough there was enough window to send all the data and the sending of data was emptying the socket buffer. Problem found by Yoshihiro Tsuchiya (tsuchiya@flab.fujitsu.co.jp) Submitted by: Jayanth Vijayaraghavan	2001-10-05 21:33:38 +00:00
Julian Elischer	f0ffb944d2	Patches from Keiichi SHIMA <keiichi@iij.ad.jp> to make ip use the standard protosw structure again. Obtained from: Well, KAME I guess.	2001-09-03 20:03:55 +00:00
Mike Silbersack	b0e3ad758b	Much delayed but now present: RFC 1948 style sequence numbers In order to ensure security and functionality, RFC 1948 style initial sequence number generation has been implemented. Barring any major crypographic breakthroughs, this algorithm should be unbreakable. In addition, the problems with TIME_WAIT recycling which affect our currently used algorithm are not present. Reviewed by: jesper	2001-08-22 00:58:16 +00:00
Mike Silbersack	2d610a5028	Temporary feature: Runtime tuneable tcp initial sequence number generation scheme. Users may now select between the currently used OpenBSD algorithm and the older random positive increment method. While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT handling; this is causing trouble for an increasing number of folks. To switch between generation schemes, one sets the sysctl net.inet.tcp.tcp_seq_genscheme. 0 = random positive increments, 1 = the OpenBSD algorithm. 1 is still the default. Once a secure _and_ compatible algorithm is implemented, this sysctl will be removed. Reviewed by: jlemon Tested by: numerous subscribers of -net	2001-07-08 02:20:47 +00:00
Mike Silbersack	08517d530e	Eliminate the allocation of a tcp template structure for each connection. The information contained in a tcptemp can be reconstructed from a tcpcb when needed. Previously, tcp templates required the allocation of one mbuf per connection. On large systems, this change should free up a large number of mbufs. Reviewed by: bmilekic, jlemon, ru MFC after: 2 weeks	2001-06-23 03:21:46 +00:00
Kris Kennaway	f0a04f3f51	Randomize the TCP initial sequence numbers more thoroughly. Obtained from: OpenBSD Reviewed by: jesper, peter, -developers	2001-04-17 18:08:01 +00:00
Jonathan Lemon	c693a045de	Remove in_pcbnotify and use in_pcblookup_hash to find the cb directly. For TCP, verify that the sequence number in the ICMP packet falls within the tcp receive window before performing any actions indicated by the icmp packet. Clean up some layering violations (access to tcp internals from in_pcb)	2001-02-26 21:19:47 +00:00
Jesper Skriver	694a9ff95b	Remove tcp_drop_all_states, which is unneeded after jlemon removed it from tcp_subr.c in rev 1.92	2001-02-25 17:20:19 +00:00
Poul-Henning Kamp	90fcbbd635	Remove unneeded loop increment in src/sys/netinet/in_pcb.c:in_pcbnotify Add new PRC_UNREACH_ADMIN_PROHIB in sys/sys/protosw.h Remove condition on TCP in src/sys/netinet/ip_icmp.c:icmp_input In src/sys/netinet/ip_icmp.c:icmp_input set code = PRC_UNREACH_ADMIN_PROHIB or PRC_UNREACH_HOST for all unreachables except ICMP_UNREACH_NEEDFRAG Rename sysctl icmp_admin_prohib_like_rst to icmp_unreach_like_rst to reflect the fact that we also react on ICMP unreachables that are not administrative prohibited. Also update the comments to reflect this. In sys/netinet/tcp_subr.c:tcp_ctlinput add code to treat PRC_UNREACH_ADMIN_PROHIB and PRC_UNREACH_HOST different. PR: 23986 Submitted by: Jesper Skriver <jesper@skriver.dk>	2001-02-18 09:34:55 +00:00
Poul-Henning Kamp	442fad6798	Update the "icmp_admin_prohib_like_rst" code to check the tcp-window and to be configurable with respect to acting only in SYN or in all TCP states. PR: 23665 Submitted by: Jesper Skriver <jesper@skriver.dk>	2000-12-24 10:57:21 +00:00
Poul-Henning Kamp	b11d7a4a2f	We currently does not react to ICMP administratively prohibited messages send by routers when they deny our traffic, this causes a timeout when trying to connect to TCP ports/services on a remote host, which is blocked by routers or firewalls. rfc1122 (Requirements for Internet Hosts) section 3.2.2.1 actually requi re that we treat such a message for a TCP session, that we treat it like if we had recieved a RST. quote begin. A Destination Unreachable message that is received MUST be reported to the transport layer. The transport layer SHOULD use the information appropriately; for example, see Sections 4.1.3.3, 4.2.3.9, and 4.2.4 below. A transport protocol that has its own mechanism for notifying the sender that a port is unreachable (e.g., TCP, which sends RST segments) MUST nevertheless accept an ICMP Port Unreachable for the same purpose. quote end. I've written a small extension that implement this, it also create a sysctl "net.inet.tcp.icmp_admin_prohib_like_rst" to control if this new behaviour is activated. When it's activated (set to 1) we'll treat a ICMP administratively prohibited message (icmp type 3 code 9, 10 and 13) for a TCP sessions, as if we recived a TCP RST, but only if the TCP session is in SYN_SENT state. The reason for only reacting when in SYN_SENT state, is that this will solve the problem, and at the same time minimize the risk of this being abused. I suggest that we enable this new behaviour by default, but it would be a change of current behaviour, so if people prefer to leave it disabled by default, at least for now, this would be ok for me, the attached diff actually have the sysctl set to 0 by default. PR: 23086 Submitted by: Jesper Skriver <jesper@skriver.dk>	2000-12-16 19:42:06 +00:00
Jayanth Vijayaraghavan	e7f3269307	When a connection is being dropped due to a listen queue overflow, delete the cloned route that is associated with the connection. This does not exhaust the routing table memory when the system is under a SYN flood attack. The route entry is not deleted if there is any prior information cached in it. Reviewed by: Peter Wemm,asmodai	2000-07-21 23:26:37 +00:00
Sheldon Hearn	571214d4fe	Fix a comment which was broken in rev 1.36. PR: 19947 Submitted by: Tetsuya Isaki <isaki@net.ipc.hiroshima-u.ac.jp>	2000-07-18 16:43:29 +00:00
Jake Burkholder	e39756439c	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others	2000-05-26 02:09:24 +00:00
Jake Burkholder	740a1973a6	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd	2000-05-23 20:41:01 +00:00
Jonathan Lemon	46f5848237	Implement TCP NewReno, as documented in RFC 2582. This allows better recovery for multiple packet losses in a single window. The algorithm can be toggled via the sysctl net.inet.tcp.newreno, which defaults to "on". Submitted by: Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>	2000-05-06 03:31:09 +00:00
Yoshinobu Inoue	fb59c426ff	tcp updates to support IPv6. also a small patch to sys/nfs/nfs_socket.c, as max_hdr size change. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project	2000-01-09 19:17:30 +00:00
Peter Wemm	664a31e496	Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL" is an application space macro and the applications are supposed to be free to use it as they please (but cannot). This is consistant with the other BSD's who made this change quite some time ago. More commits to come.	1999-12-29 04:46:21 +00:00
Yoshinobu Inoue	6a800098cc	IPSEC support in the kernel. pr_input() routines prototype is also changed to support IPSEC and IPV6 chained protocol headers. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project	1999-12-22 19:13:38 +00:00
Yoshinobu Inoue	76429de41a	KAME related header files additions and merges. (only those which don't affect c source files so much) Reviewed by: cvs-committers Obtained from: KAME project	1999-11-05 14:41:39 +00:00
Jonathan Lemon	9b8b58e033	Restructure TCP timeout handling: - eliminate the fast/slow timeout lists for TCP and instead use a callout entry for each timer. - increase the TCP timer granularity to HZ - implement "bad retransmit" recovery, as presented in "On Estimating End-to-End Network Path Properties", by Allman and Paxson. Submitted by: jlemon, wollmann	1999-08-30 21:17:07 +00:00
Peter Wemm	c3aac50f28	$Id$ -> $FreeBSD$	1999-08-28 01:08:13 +00:00
Doug Rabson	ce02431ffa	* Change sysctl from using linker_set to construct its tree using SLISTs. This makes it possible to change the sysctl tree at runtime. * Change KLD to find and register any sysctl nodes contained in the loaded file and to unregister them when the file is unloaded. Reviewed by: Archie Cobbs <archie@whistle.com>, Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)	1999-02-16 10:49:55 +00:00
Bill Fenner	b0acefa8d4	Add a flag, passed to pru_send routines, PRUS_MORETOCOME. This flag means that there is more data to be put into the socket buffer. Use it in TCP to reduce the interaction between mbuf sizes and the Nagle algorithm. Based on: "Justin C. Walker" <justin@apple.com>'s description of Apple's fix for this problem.	1999-01-20 17:32:01 +00:00
Doug Rabson	6effc71332	Re-implement tcp and ip fragment reassembly to not store pointers in the ip header which can't work on alpha since pointers are too big. Reviewed by: Garrett Wollman <wollman@khavrinen.lcs.mit.edu>	1998-08-24 07:47:39 +00:00
Garrett Wollman	cfe8b629f1	Yow! Completely change the way socket options are handled, eliminating another specialized mbuf type in the process. Also clean up some of the cruft surrounding IPFW, multicast routing, RSVP, and other ill-explored corners.	1998-08-23 03:07:17 +00:00
Bruce Evans	07a4df4fee	Declare tcp_seq and tcp_cc as fixed-size types. Half fixed type mismatches exposed by this (the prototype for tcp_respond() didn't match the function definition lexically, and still depends on a gcc feature to match if ints have more than 32 bits).	1998-07-13 11:09:52 +00:00
John Hay	a910fdcb88	Only make struct xtcpcb visable if _NETINET_IN_PCB_H_ and _SYS_SOCKETVAR_H_ are defined. Reviewed by: bde	1998-06-27 07:30:45 +00:00
Garrett Wollman	98271db4d5	Convert socket structures to be type-stable and add a version number. Define a parameter which indicates the maximum number of sockets in a system, and use this to size the zone allocators used for sockets and for certain PCBs. Convert PF_LOCAL PCB structures to be type-stable and add a version number. Define an external format for infomation about socket structures and use it in several places. Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through sysctl(3) without blocking network interrupts for an unreasonable length of time. This probably still has some bugs and/or race conditions, but it seems to work well enough on my machines. It is now possible for `netstat' to get almost all of its information via the sysctl(3) interface rather than reading kmem (changes to follow).	1998-05-15 20:11:40 +00:00
David Greenman	552b7df4c1	Ensure that TCP_REXMTVAL doesn't return a value less than t_rttmin. This is believed to have been broken with the Brakmo/Peterson srtt calculation changes. The result of this bug is that TCP connections could time out extremely quickly (in 12 seconds). Also backed out jdp's partial fix for this problem in rev 1.17 of tcp_timer.c as it is obsoleted by this commit. Bug was pointed out by Kevin Lehey <kml@roller.nas.nasa.gov>. PR: 6068	1998-04-24 09:25:39 +00:00
Poul-Henning Kamp	8e5db87cdb	Remove the last traces of TUBA. Inspired by: PR kern/3317	1998-04-06 06:52:47 +00:00
David Greenman	f498eeeead	Changes to support the addition of a new sysctl variable: net.inet.tcp.delack_enabled Which defaults to 1 and can be set to 0 to disable TCP delayed-ack processing (i.e. all acks are immediate).	1998-02-26 05:25:39 +00:00
David Greenman	c3229e05a3	Improved connection establishment performance by doing local port lookups via a hashed port list. In the new scheme, in_pcblookup() goes away and is replaced by a new routine, in_pcblookup_local() for doing the local port check. Note that this implementation is space inefficient in that the PCB struct is now too large to fit into 128 bytes. I might deal with this in the future by using the new zone allocator, but I wanted these changes to be extensively tested in their current form first. Also: 1) Fixed off-by-one errors in the port lookup loops in in_pcbbind(). 2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash() to do the initialial hash insertion. 3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability. 4) Added a new routine, in_pcbremlists() to remove the PCB from the various hash lists. 5) Added/deleted comments where appropriate. 6) Removed unnecessary splnet() locking. In general, the PCB functions should be called at splnet()...there are unfortunately a few exceptions, however. 7) Reorganized a few structs for better cache line behavior. 8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in the future, however. These changes have been tested on wcarchive for more than a month. In tests done here, connection establishment overhead is reduced by more than 50 times, thus getting rid of one of the major networking scalability problems. Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult. WARNING: Anything that knows about inpcb and tcpcb structs will have to be recompiled; at the very least, this includes netstat(1).	1998-01-27 09:15:13 +00:00

1 2 3 4

189 Commits