freebsd-nq

Author	SHA1	Message	Date
Mike Silbersack	80dd2a81fb	Tighten up reset handling in order to make reset attacks as difficult as possible while maintaining compatibility with the widest range of TCP stacks. The algorithm is as follows: --- For connections in the ESTABLISHED state, only resets with sequence numbers exactly matching last_ack_sent will cause a reset, all other segments will be silently dropped. For connections in all other states, a reset anywhere in the window will cause the connection to be reset. All other segments will be silently dropped. --- The necessity of accepting all in-window resets was discovered by jayanth and jlemon, both of whom have seen TCP stacks that will respond to FIN-ACK packets with resets not meeting the strict last_ack_sent check. Idea by: Darren Reed Reviewed by: truckman, jlemon, others(?)	2004-04-26 02:56:31 +00:00
Bruce M Simpson	de9f59f850	Fix a typo in a comment.	2004-04-20 19:04:24 +00:00
Mike Silbersack	c1537ef063	Enhance our RFC1948 implementation to perform better in some pathlogical TIME_WAIT recycling cases I was able to generate with http testing tools. In short, as the old algorithm relied on ticks to create the time offset component of an ISN, two connections with the exact same host, port pair that were generated between timer ticks would have the exact same sequence number. As a result, the second connection would fail to pass the TIME_WAIT check on the server side, and the SYN would never be acknowledged. I've "fixed" this by adding random positive increments to the time component between clock ticks so that ISNs will always be increasing, no matter how quickly the port is recycled. Except in such contrived benchmarking situations, this problem should never come up in normal usage... until networks get faster. No MFC planned, 4.x is missing other optimizations that are needed to even create the situation in which such quick port recycling will occur.	2004-04-20 06:33:39 +00:00
Warner Losh	f36cfd49ad	Remove advertising clause from University of California Regent's license, per letter dated July 22, 1999 and email from Peter Wemm, Alan Cox and Robert Watson. Approved by: core, peter, alc, rwatson	2004-04-07 20:46:16 +00:00
Robert Watson	a7b6a14aee	Remove now unneeded arguments to tcp_twrespond() -- so and msrc. These were needed by the MAC Framework until inpcbs gained labels. Submitted by: sam	2004-02-28 15:12:20 +00:00
Bruce Evans	0613995bd0	Fixed namespace pollution in rev.1.74. Implementation of the syncache increased <netinet/tcp_var>'s already large set of prerequisites, and this was handled badly. Just don't declare the complete syncache struct unless <netinet/pcb.h> is included before <netinet/tcp_var.h>. Approved by: jlemon (years ago, for a more invasive fix)	2004-02-25 13:03:01 +00:00
Bruce Evans	a545b1dc4d	Don't use the negatively-opaque type uma_zone_t or be chummy with <vm/uma.h>'s idempotency indentifier or its misspelling.	2004-02-25 11:53:19 +00:00
Andre Oppermann	12e2e97051	Convert the tcp segment reassembly queue to UMA and limit the maximum amount of segments it will hold. The following tuneables and sysctls control the behaviour of the tcp segment reassembly queue: net.inet.tcp.reass.maxsegments (loader tuneable) specifies the maximum number of segments all tcp reassemly queues can hold (defaults to 1/16 of nmbclusters). net.inet.tcp.reass.maxqlen specifies the maximum number of segments any individual tcp session queue can hold (defaults to 48). net.inet.tcp.reass.cursegments (readonly) counts the number of segments currently in all reassembly queues. net.inet.tcp.reass.overflows (readonly) counts how often either the global or local queue limit has been reached. Tested by: bms, silby Reviewed by: bms, silby	2004-02-24 15:27:41 +00:00
Bruce M Simpson	265ed01285	Brucification. Submitted by: bde	2004-02-13 18:21:45 +00:00
Bruce M Simpson	b30190b542	Update the prototype for tcpsignature_apply() to reflect the spelling of the types used by m_apply()'s callback function, f, as documented in mbuf(9). Noticed by: njl	2004-02-12 20:16:09 +00:00
Bruce M Simpson	1cfd4b5326	Initial import of RFC 2385 (TCP-MD5) digest support. This is the first of two commits; bringing in the kernel support first. This can be enabled by compiling a kernel with options TCP_SIGNATURE and FAST_IPSEC. For the uninitiated, this is a TCP option which provides for a means of authenticating TCP sessions which came into being before IPSEC. It is still relevant today, however, as it is used by many commercial router vendors, particularly with BGP, and as such has become a requirement for interconnect at many major Internet points of presence. Several parts of the TCP and IP headers, including the segment payload, are digested with MD5, including a shared secret. The PF_KEY interface is used to manage the secrets using security associations in the SADB. There is a limitation here in that as there is no way to map a TCP flow per-port back to an SPI without polluting tcpcb or using the SPD; the code to do the latter is unstable at this time. Therefore this code only supports per-host keying granularity. Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6), TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective users of this feature, this will not pose any problem. This implementation is output-only; that is, the option is honoured when responding to a host initiating a TCP session, but no effort is made [yet] to authenticate inbound traffic. This is, however, sufficient to interwork with Cisco equipment. Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with local patches. Patches for tcpdump to validate TCP-MD5 sessions are also available from me upon request. Sponsored by: sentex.net	2004-02-11 04:26:04 +00:00
Andre Oppermann	53369ac9bb	Limiters and sanity checks for TCP MSS (maximum segement size) resource exhaustion attacks. For network link optimization TCP can adjust its MSS and thus packet size according to the observed path MTU. This is done dynamically based on feedback from the remote host and network components along the packet path. This information can be abused to pretend an extremely low path MTU. The resource exhaustion works in two ways: o during tcp connection setup the advertized local MSS is exchanged between the endpoints. The remote endpoint can set this arbitrarily low (except for a minimum MTU of 64 octets enforced in the BSD code). When the local host is sending data it is forced to send many small IP packets instead of a large one. For example instead of the normal TCP payload size of 1448 it forces TCP payload size of 12 (MTU 64) and thus we have a 120 times increase in workload and packets. On fast links this quickly saturates the local CPU and may also hit pps processing limites of network components along the path. This type of attack is particularly effective for servers where the attacker can download large files (WWW and FTP). We mitigate it by enforcing a minimum MTU settable by sysctl net.inet.tcp.minmss defaulting to 256 octets. o the local host is reveiving data on a TCP connection from the remote host. The local host has no control over the packet size the remote host is sending. The remote host may chose to do what is described in the first attack and send the data in packets with an TCP payload of at least one byte. For each packet the tcp_input() function will be entered, the packet is processed and a sowakeup() is signalled to the connected process. For example an attack with 2 Mbit/s gives 4716 packets per second and the same amount of sowakeup()s to the process (and context switches). This type of attack is particularly effective for servers where the attacker can upload large amounts of data. Normally this is the case with WWW server where large POSTs can be made. We mitigate this by calculating the average MSS payload per second. If it goes below 'net.inet.tcp.minmss' and the pps rate is above 'net.inet.tcp.minmssoverload' defaulting to 1000 this particular TCP connection is resetted and dropped. MITRE CVE: CAN-2004-0002 Reviewed by: sam (mentor) MFC after: 1 day	2004-01-08 17:40:07 +00:00
Andre Oppermann	97d8d152c2	Introduce tcp_hostcache and remove the tcp specific metrics from the routing table. Move all usage and references in the tcp stack from the routing table metrics to the tcp hostcache. It caches measured parameters of past tcp sessions to provide better initial start values for following connections from or to the same source or destination. Depending on the network parameters to/from the remote host this can lead to significant speedups for new tcp connections after the first one because they inherit and shortcut the learning curve. tcp_hostcache is designed for multiple concurrent access in SMP environments with high contention and is hash indexed by remote ip address. It removes significant locking requirements from the tcp stack with regard to the routing table. Reviewed by: sam (mentor), bms Reviewed by: -net, -current, core@kame.net (IPv6 parts) Approved by: re (scottl)	2003-11-20 20:07:39 +00:00
Mike Silbersack	4bd4fa3fe6	Add an additional check to the tcp_twrecycleable function; I had previously only considered the send sequence space. Unfortunately, some OSes (windows) still use a random positive increments scheme for their syn-ack ISNs, so I must consider receive sequence space as well. The value of 250000 bytes / second for Microsoft's ISN rate of increase was determined by testing with an XP machine.	2003-11-02 07:47:03 +00:00
Mike Silbersack	96af9ea52b	- Add a new function tcp_twrecycleable, which tells us if the ISN which we will generate for a given ip/port tuple has advanced far enough for the time_wait socket in question to be safely recycled. - Have in_pcblookup_local use tcp_twrecycleable to determine if time_Wait sockets which are hogging local ports can be safely freed. This change preserves proper TIME_WAIT behavior under normal circumstances while allowing for safe and fast recycling whenever ephemeral port space is scarce.	2003-11-01 07:30:08 +00:00
Jeffrey Hsu	9d11646de7	Unify the "send high" and "recover" variables as specified in the lastest rev of the spec. Use an explicit flag for Fast Recovery. [1] Fix bug with exiting Fast Recovery on a retransmit timeout diagnosed by Lu Guohan. [2] Reviewed by: Thomas Henderson <thomas.r.henderson@boeing.com> Reported and tested by: Lu Guohan <lguohan00@mails.tsinghua.edu.cn> [2] Approved by: Thomas Henderson <thomas.r.henderson@boeing.com>, Sally Floyd <floyd@acm.org> [1]	2003-07-15 21:49:53 +00:00
Robert Watson	430c635447	Correct a bug introduced with reduced TCP state handling; make sure that the MAC label on TCP responses during TIMEWAIT is properly set from either the socket (if available), or the mbuf that it's responding to. Unfortunately, this is made somewhat difficult by the TCP code, as tcp_twstart() calls tcp_twrespond() after discarding the socket but without a reference to the mbuf that causes the "response". Passing both the socket and the mbuf works arounds this--eventually it might be good to make sure the mbuf always gets passed in in "response" scenarios but working through this provided to complicate things too much. Approved by: re (scottl) Reviewed by: hsu Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories	2003-05-07 05:26:27 +00:00
Jeffrey Hsu	48d2549c3e	Observe conservation of packets when entering Fast Recovery while doing Limited Transmit. Only artificially inflate the congestion window by 1 segment instead of the usual 3 to take into account the 2 already sent by Limited Transmit. Approved in principle by: Mark Allman <mallman@grc.nasa.gov>, Hari Balakrishnan <hari@nms.lcs.mit.edu>, Sally Floyd <floyd@icir.org>	2003-04-01 21:16:46 +00:00
Jonathan Lemon	607b0b0cc9	Remove a panic(); if the zone allocator can't provide more timewait structures, reuse the oldest one. Also move the expiry timer from a per-structure callout to the tcp slow timer. Sponsored by: DARPA, NAI Labs	2003-03-08 22:06:20 +00:00
Jonathan Lemon	340c35de6a	Add a TCP TIMEWAIT state which uses less space than a fullblown TCP control block. Allow the socket and tcpcb structures to be freed earlier than inpcb. Update code to understand an inp w/o a socket. Reviewed by: hsu, silby, jayanth Sponsored by: DARPA, NAI Labs	2003-02-19 22:32:43 +00:00
Jonathan Lemon	7990938421	Convert tcp_fillheaders(tp, ...) -> tcpip_fillheaders(inp, ...) so the routine does not require a tcpcb to operate. Since we no longer keep template mbufs around, move pseudo checksum out of this routine, and merge it with the length update. Sponsored by: DARPA, NAI Labs	2003-02-19 22:18:06 +00:00
Jeffrey Hsu	cb942153c8	Fix NewReno. Reviewed by: Tom Henderson <thomas.r.henderson@boeing.com>	2003-01-13 11:01:20 +00:00
Matthew Dillon	1fcc99b5de	Implement TCP bandwidth delay product window limiting, similar to (but not meant to duplicate) TCP/Vegas. Add four sysctls and default the implementation to 'off'. net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off) net.inet.tcp.inflight_debug debugging (defaults to 1=on) net.inet.tcp.inflight_min minimum window limit net.inet.tcp.inflight_max maximum window limit MFC after: 1 week	2002-08-17 18:26:02 +00:00
Matthew Dillon	d65bf08af3	Add the tcps_sndrexmitbad statistic, keep track of late acks that caused unnecessary retransmissions.	2002-07-19 18:29:38 +00:00
Jeffrey Hsu	3ce144ea88	Notify functions can destroy the pcb, so they have to return an indication of whether this happenned so the calling function knows whether or not to unlock the pcb. Submitted by: Jennifer Yang (yangjihui@yahoo.com) Bug reported by: Sid Carter (sidcarter@symonds.net)	2002-06-14 08:35:21 +00:00
Mike Silbersack	eb5afeba22	Re-commit w/fix: Ensure that the syn cache's syn-ack packets contain the same ip_tos, ip_ttl, and DF bits as all other tcp packets. PR: 39141 MFC after: 2 weeks This time, make sure that ipv4 specific code (aka all of the above) is only run in the ipv4 case.	2002-06-14 03:08:05 +00:00
Mike Silbersack	70d2b17029	Back out ip_tos/ip_ttl/DF "fix", it just panic'd my box. :) Pointy-hat to: silby	2002-06-14 02:43:20 +00:00
Mike Silbersack	21c3b2fc69	Ensure that the syn cache's syn-ack packets contain the same ip_tos, ip_ttl, and DF bits as all other tcp packets. PR: 39141 MFC after: 2 weeks	2002-06-14 02:36:34 +00:00
Jeffrey Hsu	f76fcf6d4c	Lock up inpcb. Submitted by: Jennifer Yang <yangjihui@yahoo.com>	2002-06-10 20:05:46 +00:00
Alfred Perlstein	4d77a549fe	Remove __P.	2002-03-19 21:25:46 +00:00
Matthew Dillon	262c1c1a4e	Fix a bug with transmitter restart after receiving a 0 window. The receiver was not sending an immediate ack with delayed acks turned on when the input buffer is drained, preventing the transmitter from restarting immediately. Propogate the TCP_NODELAY option to accept()ed sockets. (Helps tbench and is a good idea anyway). Some cleanup. Identify additonal issues in comments. MFC after: 1 day	2001-12-02 08:49:29 +00:00
Jonathan Lemon	be2ac88c59	Introduce a syncache, which enables FreeBSD to withstand a SYN flood DoS in an improved fashion over the existing code. Reviewed by: silby (in a previous iteration) Sponsored by: DARPA, NAI Labs	2001-11-22 04:50:44 +00:00
Jayanth Vijayaraghavan	c24d5dae7a	Add a flag TF_LASTIDLE, that forces a previously idle connection to send all its data, especially when the data is less than one MSS. This fixes an issue where the stack was delaying the sending of data, eventhough there was enough window to send all the data and the sending of data was emptying the socket buffer. Problem found by Yoshihiro Tsuchiya (tsuchiya@flab.fujitsu.co.jp) Submitted by: Jayanth Vijayaraghavan	2001-10-05 21:33:38 +00:00
Julian Elischer	f0ffb944d2	Patches from Keiichi SHIMA <keiichi@iij.ad.jp> to make ip use the standard protosw structure again. Obtained from: Well, KAME I guess.	2001-09-03 20:03:55 +00:00
Mike Silbersack	b0e3ad758b	Much delayed but now present: RFC 1948 style sequence numbers In order to ensure security and functionality, RFC 1948 style initial sequence number generation has been implemented. Barring any major crypographic breakthroughs, this algorithm should be unbreakable. In addition, the problems with TIME_WAIT recycling which affect our currently used algorithm are not present. Reviewed by: jesper	2001-08-22 00:58:16 +00:00
Mike Silbersack	2d610a5028	Temporary feature: Runtime tuneable tcp initial sequence number generation scheme. Users may now select between the currently used OpenBSD algorithm and the older random positive increment method. While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT handling; this is causing trouble for an increasing number of folks. To switch between generation schemes, one sets the sysctl net.inet.tcp.tcp_seq_genscheme. 0 = random positive increments, 1 = the OpenBSD algorithm. 1 is still the default. Once a secure _and_ compatible algorithm is implemented, this sysctl will be removed. Reviewed by: jlemon Tested by: numerous subscribers of -net	2001-07-08 02:20:47 +00:00
Mike Silbersack	08517d530e	Eliminate the allocation of a tcp template structure for each connection. The information contained in a tcptemp can be reconstructed from a tcpcb when needed. Previously, tcp templates required the allocation of one mbuf per connection. On large systems, this change should free up a large number of mbufs. Reviewed by: bmilekic, jlemon, ru MFC after: 2 weeks	2001-06-23 03:21:46 +00:00
Kris Kennaway	f0a04f3f51	Randomize the TCP initial sequence numbers more thoroughly. Obtained from: OpenBSD Reviewed by: jesper, peter, -developers	2001-04-17 18:08:01 +00:00
Jonathan Lemon	c693a045de	Remove in_pcbnotify and use in_pcblookup_hash to find the cb directly. For TCP, verify that the sequence number in the ICMP packet falls within the tcp receive window before performing any actions indicated by the icmp packet. Clean up some layering violations (access to tcp internals from in_pcb)	2001-02-26 21:19:47 +00:00
Jesper Skriver	694a9ff95b	Remove tcp_drop_all_states, which is unneeded after jlemon removed it from tcp_subr.c in rev 1.92	2001-02-25 17:20:19 +00:00
Poul-Henning Kamp	90fcbbd635	Remove unneeded loop increment in src/sys/netinet/in_pcb.c:in_pcbnotify Add new PRC_UNREACH_ADMIN_PROHIB in sys/sys/protosw.h Remove condition on TCP in src/sys/netinet/ip_icmp.c:icmp_input In src/sys/netinet/ip_icmp.c:icmp_input set code = PRC_UNREACH_ADMIN_PROHIB or PRC_UNREACH_HOST for all unreachables except ICMP_UNREACH_NEEDFRAG Rename sysctl icmp_admin_prohib_like_rst to icmp_unreach_like_rst to reflect the fact that we also react on ICMP unreachables that are not administrative prohibited. Also update the comments to reflect this. In sys/netinet/tcp_subr.c:tcp_ctlinput add code to treat PRC_UNREACH_ADMIN_PROHIB and PRC_UNREACH_HOST different. PR: 23986 Submitted by: Jesper Skriver <jesper@skriver.dk>	2001-02-18 09:34:55 +00:00
Poul-Henning Kamp	442fad6798	Update the "icmp_admin_prohib_like_rst" code to check the tcp-window and to be configurable with respect to acting only in SYN or in all TCP states. PR: 23665 Submitted by: Jesper Skriver <jesper@skriver.dk>	2000-12-24 10:57:21 +00:00
Poul-Henning Kamp	b11d7a4a2f	We currently does not react to ICMP administratively prohibited messages send by routers when they deny our traffic, this causes a timeout when trying to connect to TCP ports/services on a remote host, which is blocked by routers or firewalls. rfc1122 (Requirements for Internet Hosts) section 3.2.2.1 actually requi re that we treat such a message for a TCP session, that we treat it like if we had recieved a RST. quote begin. A Destination Unreachable message that is received MUST be reported to the transport layer. The transport layer SHOULD use the information appropriately; for example, see Sections 4.1.3.3, 4.2.3.9, and 4.2.4 below. A transport protocol that has its own mechanism for notifying the sender that a port is unreachable (e.g., TCP, which sends RST segments) MUST nevertheless accept an ICMP Port Unreachable for the same purpose. quote end. I've written a small extension that implement this, it also create a sysctl "net.inet.tcp.icmp_admin_prohib_like_rst" to control if this new behaviour is activated. When it's activated (set to 1) we'll treat a ICMP administratively prohibited message (icmp type 3 code 9, 10 and 13) for a TCP sessions, as if we recived a TCP RST, but only if the TCP session is in SYN_SENT state. The reason for only reacting when in SYN_SENT state, is that this will solve the problem, and at the same time minimize the risk of this being abused. I suggest that we enable this new behaviour by default, but it would be a change of current behaviour, so if people prefer to leave it disabled by default, at least for now, this would be ok for me, the attached diff actually have the sysctl set to 0 by default. PR: 23086 Submitted by: Jesper Skriver <jesper@skriver.dk>	2000-12-16 19:42:06 +00:00
Jayanth Vijayaraghavan	e7f3269307	When a connection is being dropped due to a listen queue overflow, delete the cloned route that is associated with the connection. This does not exhaust the routing table memory when the system is under a SYN flood attack. The route entry is not deleted if there is any prior information cached in it. Reviewed by: Peter Wemm,asmodai	2000-07-21 23:26:37 +00:00
Sheldon Hearn	571214d4fe	Fix a comment which was broken in rev 1.36. PR: 19947 Submitted by: Tetsuya Isaki <isaki@net.ipc.hiroshima-u.ac.jp>	2000-07-18 16:43:29 +00:00
Jake Burkholder	e39756439c	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others	2000-05-26 02:09:24 +00:00
Jake Burkholder	740a1973a6	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd	2000-05-23 20:41:01 +00:00
Jonathan Lemon	46f5848237	Implement TCP NewReno, as documented in RFC 2582. This allows better recovery for multiple packet losses in a single window. The algorithm can be toggled via the sysctl net.inet.tcp.newreno, which defaults to "on". Submitted by: Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>	2000-05-06 03:31:09 +00:00
Yoshinobu Inoue	fb59c426ff	tcp updates to support IPv6. also a small patch to sys/nfs/nfs_socket.c, as max_hdr size change. Reviewed by: freebsd-arch, cvs-committers Obtained from: KAME project	2000-01-09 19:17:30 +00:00
Peter Wemm	664a31e496	Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL" is an application space macro and the applications are supposed to be free to use it as they please (but cannot). This is consistant with the other BSD's who made this change quite some time ago. More commits to come.	1999-12-29 04:46:21 +00:00

1 2 3

104 Commits