and syncache_respond() into its own generic function tcp_addoptions().
tcp_addoptions() is alignment agnostic and does optimal packing in all cases.
In struct tcpopt rename to_requested_s_scale to just to_wscale.
Add a comment with quote from RFC1323: "The Window field in a SYN (i.e.,
a <SYN> or <SYN,ACK>) segment itself is never scaled."
Reviewed by: silby, mohans, julian
Sponsored by: TCP/IP Optimization Fundraise 2005
potential issues where the peer does not close, potentially leaving
thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl
fast_finwait2_recycle, which is disabled by default.
Reviewed by: gnn, silby.
Normally the socket buffers are static (either derived from global
defaults or set with setsockopt) and do not adapt to real network
conditions. Two things happen: a) your socket buffers are too small
and you can't reach the full potential of the network between both
hosts; b) your socket buffers are too big and you waste a lot of
kernel memory for data just sitting around.
With automatic TCP send and receive socket buffers we can start with a
small buffer and quickly grow it in parallel with the TCP congestion
window to match real network conditions.
FreeBSD has a default 32K send socket buffer. This supports a maximal
transfer rate of only slightly more than 2Mbit/s on a 100ms RTT
trans-continental link. Or at 200ms just above 1Mbit/s. With TCP send
buffer auto scaling and the default values below it supports 20Mbit/s
at 100ms and 10Mbit/s at 200ms. That's an improvement of factor 10, or
1000%. For the receive side it looks slightly better with a default of
64K buffer size.
New sysctls are:
net.inet.tcp.sendbuf_auto=1 (enabled)
net.inet.tcp.sendbuf_inc=8192 (8K, step size)
net.inet.tcp.sendbuf_max=262144 (256K, growth limit)
net.inet.tcp.recvbuf_auto=1 (enabled)
net.inet.tcp.recvbuf_inc=16384 (16K, step size)
net.inet.tcp.recvbuf_max=262144 (256K, growth limit)
Tested by: many (on HEAD and RELENG_6)
Approved by: re
MFC after: 1 month
functionality:
- Remove a rwlock aquisition/release per generated syncookie. Locking
is now integrated with the bucket row locking of syncache itself and
syncookies no longer add any additional lock overhead.
- Syncookie secrets are different for and stored per syncache buck row.
Secrets expire after 16 seconds and are reseeded on-demand.
- The computational overhead for syncookie generation and verification
is one MD5 hash computation as before.
- Syncache can be turned off and run with syncookies only by setting the
sysctl net.inet.tcp.syncookies_only=1.
This implementation extends the orginal idea and first implementation
of FreeBSD by using not only the initial sequence number field to store
information but also the timestamp field if present. This way we can
keep track of the entire state we need to know to recreate the session in
its original form. Almost all TCP speakers implement RFC1323 timestamps
these days. For those that do not we still have to live with the known
shortcomings of the ISN only SYN cookies. The use of the timestamp field
causes the timestamps to be randomized if syncookies are enabled.
The idea of SYN cookies is to encode and include all necessary information
about the connection setup state within the SYN-ACK we send back and thus
to get along without keeping any local state until the ACK to the SYN-ACK
arrives (if ever). Everything we need to know should be available from
the information we encoded in the SYN-ACK.
A detailed description of the inner working of the syncookies mechanism
is included in the comments in tcp_syncache.c.
Reviewed by: silby (slightly earlier version)
Sponsored by: TCP/IP Optimization Fundraise 2005
timeouts for TCP and T/TCP connections in the TIME_WAIT
state, and we had two separate timed wait queues for them.
Now that is has gone, the timeout is always 2*MSL again,
and there is no reason to keep two queues (the first was
unused anyway!).
Also, reimplement the remaining queue using a TAILQ (it
was technically impossible before, with two queues).
o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6
o add CSUM_TSO flag to mbuf pkthdr csum_flags field
o add tso_segsz field to mbuf pkthdr
o enhance ip_output() packet length check to allow for large TSO packets
o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities
o adjust all callers of tcp_maxmtu[46]() accordingly
Discussed on: -current, -net
Sponsored by: TCP/IP Optimization Fundraise 2005
bad under high load. For example with 40k sockets and 25k tcptw
entries, connect() syscall can run for seconds. Debugging showed
that it iterates the cycle millions times and purges thousands of
tcptw entries at a time.
Besides practical unusability this change is architecturally
wrong. First, in_pcblookup_local() is used in connect() and bind()
syscalls. No stale entries purging shouldn't be done here. Second,
it is a layering violation.
o Return back the tcptw purging cycle to tcp_timer_2msl_tw(),
that was removed in rev. 1.78 by rwatson. The commit log of this
revision tells nothing about the reason cycle was removed. Now
we need this cycle, since major cleaner of stale tcptw structures
is removed.
o Disable probably necessary, but now unused
tcp_twrecycleable() function.
Reviewed by: ru
o redefine the parameter 'is_syn' to 'flags', add TO_SYN flag and adjust its
usage accordingly
o update the comments to the tcp_dooptions() invocation in
tcp_input():after_listen to reflect reality
o move the logic checking the echoed timestamp out of tcp_dooptions() to the
only place that uses it next to the invocation described in the previous
item
o adjust parsing of TCPOPT_SACK_PERMITTED to use the same style as the others
o add comments in to struct tcpopt.to_flags #defines
No functional changes.
Sponsored by: TCP/IP Optimization Fundraise 2005
as possible for the syncache_add() case. The syncache timer no longer
aquires the tcpinfo lock and timeout/retransmit runs can happen in
parallel with bucket granularity.
On a P4 the additional locks cause a slight degression of 0.7% in tcp
connections per second. When IP and TCP input are deserialized and
can run in parallel this little overhead can be neglected. The syncookie
handling still leaves room for improvement and its random salts may be
moved to the syncache bucket head structures to remove the second lock
operation currently required for it. However this would be a more
involved change from the way syncookies work at the moment.
Reviewed by: rwatson
Tested by: rwatson, ps (earlier version)
Sponsored by: TCP/IP Optimization Fundraise 2005
pru_abort(), pru_detach(), and in_pcbdetach():
- Universally support and enforce the invariant that so_pcb is
never NULL, converting dozens of unnecessary NULL checks into
assertions, and eliminating dozens of unnecessary error handling
cases in protocol code.
- In some cases, eliminate unnecessary pcbinfo locking, as it is no
longer required to ensure so_pcb != NULL. For example, the receive
code no longer requires the pcbinfo lock, and the send code only
requires it if building a new connection on an otherwise unconnected
socket triggered via sendto() with an address. This should
significnatly reduce tcbinfo lock contention in the receive and send
cases.
- In order to support the invariant that so_pcb != NULL, it is now
necessary for the TCP code to not discard the tcpcb any time a
connection is dropped, but instead leave the tcpcb until the socket
is shutdown. This case is handled by setting INP_DROPPED, to
substitute for using a NULL so_pcb to indicate that the connection
has been dropped. This requires the inpcb lock, but not the pcbinfo
lock.
- Unlike all other protocols in the tree, TCP may need to retain access
to the socket after the file descriptor has been closed. Set
SS_PROTOREF in tcp_detach() in order to prevent the socket from being
freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
or not it needs to free the socket when the connection finally does
close. The typical case where this occurs is if close() is called on
a TCP socket before all sent data in the send socket buffer has been
transmitted or acknowledged. If INP_SOCKREF is found when the
connection is dropped, we release the inpcb, tcpcb, and socket instead
of flagging INP_DROPPED.
- Abort and detach protocol switch methods no longer return failures,
nor attempt to free sockets, as the socket layer does this.
- Annotate the existence of a long-standing race in the TCP timer code,
in which timers are stopped but not drained when the socket is freed,
as waiting for drain may lead to deadlocks, or have to occur in a
context where waiting is not permitted. This race has been handled
by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
versa), which is not normally permitted, but may be true of a inpcb
and tcpcb have been freed. Add a counter to test how often this race
has actually occurred, and a large comment for each instance where
we compare potentially freed memory with NULL. This will have to be
fixed in the near future, but requires is to further address how to
handle the timer shutdown shutdown issue.
- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
so no longer need to return a pointer to indicate whether the argument
passed in is still valid.
- Un-macroize debugging and locking setup for various protocol switch
methods for TCP, as it lead to more obscurity, and as locking becomes
more customized to the methods, offers less benefit.
- Assert copyright on tcp_usrreq.c due to significant modifications that
have been made as part of this work.
These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs. Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.
MFC after: 3 months
right from the beginning and partly clean up the differences in handling
between SYN_SENT and SYN_RCVD (syncache).
Further changes to this code to come. This is a first incremental step
to a general overhaul and streamlining of the TCP code.
PR: kern/15095
PR: kern/92690 (partly)
Reviewed by: qingli (and tested with ANVL)
Sponsored by: TCP/IP Optimization Fundraise 2005
threshold. Inflight doesn't make sense on a LAN as it has
trouble figuring out the maximal bandwidth because of the coarse
tick granularity.
The sysctl net.inet.tcp.inflight.rttthresh specifies the threshold
in milliseconds below which inflight will disengage. It defaults
to 10ms.
Tested by: Joao Barros <joao.barros-at-gmail.com>,
Rich Murphey <rich-at-whiteoaklabs.com>
Sponsored by: TCP/IP Optimization Fundraise 2005
processing is now done in the ACK processing case.
- Merge tcp_sack_option() and tcp_del_sackholes() into a new function
called tcp_sack_doack().
- Test (SEG.ACK < SND.MAX) before processing the ACK.
Submitted by: Noritoshi Demizu
Reveiewed by: Mohan Srinivasan, Raja Mukerji
Approved by: re
- Walks the scoreboard backwards from the tail to reduce the number of
comparisons for each sack option received.
- Introduce functions to add/remove sack scoreboard elements, making
the code more readable.
Submitted by: Noritoshi Demizu
Reviewed by: Raja Mukerji, Mohan Srinivasan
or to compute the total retransmitted bytes in this sack recovery
episode, the scoreboard is traversed. While in sack recovery, this
traversal occurs on every call to tcp_output(), every dupack and
every partial ack. The scoreboard could potentially get quite large,
making this traversal expensive.
This change optimizes this by storing hints (for the next hole to
retransmit and the total retransmitted bytes in this sack recovery
episode) reducing the complexity to find these values from O(n) to
constant time.
The debug code that sanity checks the hints against the computed
value will be removed eventually.
Submitted by: Mohan Srinivasan, Noritoshi Demizu, Raja Mukerji.
code readability and facilitates some anticipated optimizations in
tcp_sack_option().
- Remove tcp_print_holes() and TCP_SACK_DEBUG.
Submitted by: Raja Mukerji.
Reviewed by: Mohan Srinivasan, Noritoshi Demizu.
ineffective, depreciated and can be abused to degrade the performance
of active TCP sessions if spoofed.
Replace a bogus call to tcp_quench() in tcp_output() with the direct
equivalent tcpcb variable assignment.
Security: draft-gont-tcpm-icmp-attacks-03.txt Section 7.1
MFC after: 3 days
in flight in SACK recovery.
Found by: Noritoshi Demizu
Submitted by: Mohan Srinivasan <mohans at yahoo-inc dot com>
Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp>
Raja Mukerji <raja at moselle dot com>
per-connection and globally. This eliminates potential DoS attacks
where SACK scoreboard elements tie up too much memory.
Submitted by: Raja Mukerji (raja at moselle dot com).
Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com).
o Use SYSCTL_IN() macro instead of direct call of copyin(9).
Submitted by: ume
o Move sysctl_drop() implementation to sys/netinet/tcp_subr.c where
most of tcp sysctls live.
o There are net.inet[6].tcp[6].getcred sysctls already, no needs in
a separate struct tcp_ident_mapping.
Suggested by: ume
utility:
The tcpdrop command drops the TCP connection specified by the
local address laddr, port lport and the foreign address faddr,
port fport.
Obtained from: OpenBSD
Reviewed by: rwatson (locking), ru (man page), -current
MFC after: 1 month
A complete rationale and discussion is given in this message
and the resulting discussion:
http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706
Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel. Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols) and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.
Discussed on: -arch
to control the packets injected while in sack recovery (for both
retransmissions and new data).
- Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c.
- Add a new sysctl (net.inet.tcp.sack.initburst) that controls the
number of sack retransmissions done upon initiation of sack recovery.
Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>
- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs
This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.
Approved by: re (scottl)
Submitted by: Xin LI <delphij@frontfree.net>
for the SYN|ACK packet and then letting in6_pcbconnect set the
flowlabel later. Arange for the syncache/syncookie code to set and
recall the flow label so that the flowlabel used for the SYN|ACK
is consistent. This is done by using some of the cookie (when tcp
cookies are enabeled) and by stashing the flowlabel in syncache.
Tested and Discovered by: Orla McGann <orly@cnri.dit.ie>
Approved by: ume, silby
MFC after: 1 month
originated on RELENG_4 and was ported to -CURRENT.
The scoreboarding code was obtained from OpenBSD, and many
of the remaining changes were inspired by OpenBSD, but not
taken directly from there.
You can enable/disable sack using net.inet.tcp.do_sack. You can
also limit the number of sack holes that all senders can have in
the scoreboard with net.inet.tcp.sackhole_limit.
Reviewed by: gnn
Obtained from: Yahoo! (Mohan Srinivasan, Jayanth Vijayaraghavan)
possible while maintaining compatibility with the widest range of TCP stacks.
The algorithm is as follows:
---
For connections in the ESTABLISHED state, only resets with
sequence numbers exactly matching last_ack_sent will cause a reset,
all other segments will be silently dropped.
For connections in all other states, a reset anywhere in the window
will cause the connection to be reset. All other segments will be
silently dropped.
---
The necessity of accepting all in-window resets was discovered
by jayanth and jlemon, both of whom have seen TCP stacks that
will respond to FIN-ACK packets with resets not meeting the
strict last_ack_sent check.
Idea by: Darren Reed
Reviewed by: truckman, jlemon, others(?)
TIME_WAIT recycling cases I was able to generate with http testing tools.
In short, as the old algorithm relied on ticks to create the time offset
component of an ISN, two connections with the exact same host, port pair
that were generated between timer ticks would have the exact same sequence
number. As a result, the second connection would fail to pass the TIME_WAIT
check on the server side, and the SYN would never be acknowledged.
I've "fixed" this by adding random positive increments to the time component
between clock ticks so that ISNs will *always* be increasing, no matter how
quickly the port is recycled.
Except in such contrived benchmarking situations, this problem should never
come up in normal usage... until networks get faster.
No MFC planned, 4.x is missing other optimizations that are needed to even
create the situation in which such quick port recycling will occur.
increased <netinet/tcp_var>'s already large set of prerequisites, and
this was handled badly. Just don't declare the complete syncache struct
unless <netinet/pcb.h> is included before <netinet/tcp_var.h>.
Approved by: jlemon (years ago, for a more invasive fix)
amount of segments it will hold.
The following tuneables and sysctls control the behaviour of the tcp
segment reassembly queue:
net.inet.tcp.reass.maxsegments (loader tuneable)
specifies the maximum number of segments all tcp reassemly queues can
hold (defaults to 1/16 of nmbclusters).
net.inet.tcp.reass.maxqlen
specifies the maximum number of segments any individual tcp session queue
can hold (defaults to 48).
net.inet.tcp.reass.cursegments (readonly)
counts the number of segments currently in all reassembly queues.
net.inet.tcp.reass.overflows (readonly)
counts how often either the global or local queue limit has been reached.
Tested by: bms, silby
Reviewed by: bms, silby
This is the first of two commits; bringing in the kernel support first.
This can be enabled by compiling a kernel with options TCP_SIGNATURE
and FAST_IPSEC.
For the uninitiated, this is a TCP option which provides for a means of
authenticating TCP sessions which came into being before IPSEC. It is
still relevant today, however, as it is used by many commercial router
vendors, particularly with BGP, and as such has become a requirement for
interconnect at many major Internet points of presence.
Several parts of the TCP and IP headers, including the segment payload,
are digested with MD5, including a shared secret. The PF_KEY interface
is used to manage the secrets using security associations in the SADB.
There is a limitation here in that as there is no way to map a TCP flow
per-port back to an SPI without polluting tcpcb or using the SPD; the
code to do the latter is unstable at this time. Therefore this code only
supports per-host keying granularity.
Whilst FAST_IPSEC is mutually exclusive with KAME IPSEC (and thus IPv6),
TCP_SIGNATURE applies only to IPv4. For the vast majority of prospective
users of this feature, this will not pose any problem.
This implementation is output-only; that is, the option is honoured when
responding to a host initiating a TCP session, but no effort is made
[yet] to authenticate inbound traffic. This is, however, sufficient to
interwork with Cisco equipment.
Tested with a Cisco 2501 running IOS 12.0(27), and Quagga 0.96.4 with
local patches. Patches for tcpdump to validate TCP-MD5 sessions are also
available from me upon request.
Sponsored by: sentex.net
resource exhaustion attacks.
For network link optimization TCP can adjust its MSS and thus
packet size according to the observed path MTU. This is done
dynamically based on feedback from the remote host and network
components along the packet path. This information can be
abused to pretend an extremely low path MTU.
The resource exhaustion works in two ways:
o during tcp connection setup the advertized local MSS is
exchanged between the endpoints. The remote endpoint can
set this arbitrarily low (except for a minimum MTU of 64
octets enforced in the BSD code). When the local host is
sending data it is forced to send many small IP packets
instead of a large one.
For example instead of the normal TCP payload size of 1448
it forces TCP payload size of 12 (MTU 64) and thus we have
a 120 times increase in workload and packets. On fast links
this quickly saturates the local CPU and may also hit pps
processing limites of network components along the path.
This type of attack is particularly effective for servers
where the attacker can download large files (WWW and FTP).
We mitigate it by enforcing a minimum MTU settable by sysctl
net.inet.tcp.minmss defaulting to 256 octets.
o the local host is reveiving data on a TCP connection from
the remote host. The local host has no control over the
packet size the remote host is sending. The remote host
may chose to do what is described in the first attack and
send the data in packets with an TCP payload of at least
one byte. For each packet the tcp_input() function will
be entered, the packet is processed and a sowakeup() is
signalled to the connected process.
For example an attack with 2 Mbit/s gives 4716 packets per
second and the same amount of sowakeup()s to the process
(and context switches).
This type of attack is particularly effective for servers
where the attacker can upload large amounts of data.
Normally this is the case with WWW server where large POSTs
can be made.
We mitigate this by calculating the average MSS payload per
second. If it goes below 'net.inet.tcp.minmss' and the pps
rate is above 'net.inet.tcp.minmssoverload' defaulting to
1000 this particular TCP connection is resetted and dropped.
MITRE CVE: CAN-2004-0002
Reviewed by: sam (mentor)
MFC after: 1 day
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.
It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.
tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.
It removes significant locking requirements from the tcp stack with
regard to the routing table.
Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)
previously only considered the send sequence space. Unfortunately,
some OSes (windows) still use a random positive increments scheme for
their syn-ack ISNs, so I must consider receive sequence space as well.
The value of 250000 bytes / second for Microsoft's ISN rate of increase
was determined by testing with an XP machine.
we will generate for a given ip/port tuple has advanced far enough
for the time_wait socket in question to be safely recycled.
- Have in_pcblookup_local use tcp_twrecycleable to determine if
time_Wait sockets which are hogging local ports can be safely
freed.
This change preserves proper TIME_WAIT behavior under normal
circumstances while allowing for safe and fast recycling whenever
ephemeral port space is scarce.
lastest rev of the spec. Use an explicit flag for Fast Recovery. [1]
Fix bug with exiting Fast Recovery on a retransmit timeout
diagnosed by Lu Guohan. [2]
Reviewed by: Thomas Henderson <thomas.r.henderson@boeing.com>
Reported and tested by: Lu Guohan <lguohan00@mails.tsinghua.edu.cn> [2]
Approved by: Thomas Henderson <thomas.r.henderson@boeing.com>,
Sally Floyd <floyd@acm.org> [1]
sure that the MAC label on TCP responses during TIMEWAIT is
properly set from either the socket (if available), or the mbuf
that it's responding to.
Unfortunately, this is made somewhat difficult by the TCP code,
as tcp_twstart() calls tcp_twrespond() after discarding the socket
but without a reference to the mbuf that causes the "response".
Passing both the socket and the mbuf works arounds this--eventually
it might be good to make sure the mbuf always gets passed in in
"response" scenarios but working through this provided to
complicate things too much.
Approved by: re (scottl)
Reviewed by: hsu
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
doing Limited Transmit. Only artificially inflate the congestion
window by 1 segment instead of the usual 3 to take into account
the 2 already sent by Limited Transmit.
Approved in principle by: Mark Allman <mallman@grc.nasa.gov>,
Hari Balakrishnan <hari@nms.lcs.mit.edu>, Sally Floyd <floyd@icir.org>
control block. Allow the socket and tcpcb structures to be freed
earlier than inpcb. Update code to understand an inp w/o a socket.
Reviewed by: hsu, silby, jayanth
Sponsored by: DARPA, NAI Labs
routine does not require a tcpcb to operate. Since we no longer keep
template mbufs around, move pseudo checksum out of this routine, and
merge it with the length update.
Sponsored by: DARPA, NAI Labs
not meant to duplicate) TCP/Vegas. Add four sysctls and default the
implementation to 'off'.
net.inet.tcp.inflight_enable enable algorithm (defaults to 0=off)
net.inet.tcp.inflight_debug debugging (defaults to 1=on)
net.inet.tcp.inflight_min minimum window limit
net.inet.tcp.inflight_max maximum window limit
MFC after: 1 week
indication of whether this happenned so the calling function
knows whether or not to unlock the pcb.
Submitted by: Jennifer Yang (yangjihui@yahoo.com)
Bug reported by: Sid Carter (sidcarter@symonds.net)
Ensure that the syn cache's syn-ack packets contain the same
ip_tos, ip_ttl, and DF bits as all other tcp packets.
PR: 39141
MFC after: 2 weeks
This time, make sure that ipv4 specific code (aka all of the above)
is only run in the ipv4 case.
receiver was not sending an immediate ack with delayed acks turned on
when the input buffer is drained, preventing the transmitter from
restarting immediately.
Propogate the TCP_NODELAY option to accept()ed sockets. (Helps tbench and
is a good idea anyway).
Some cleanup. Identify additonal issues in comments.
MFC after: 1 day
to send all its data, especially when the data is less than one MSS.
This fixes an issue where the stack was delaying the sending
of data, eventhough there was enough window to send all the data and
the sending of data was emptying the socket buffer.
Problem found by Yoshihiro Tsuchiya (tsuchiya@flab.fujitsu.co.jp)
Submitted by: Jayanth Vijayaraghavan
In order to ensure security and functionality, RFC 1948 style
initial sequence number generation has been implemented. Barring
any major crypographic breakthroughs, this algorithm should be
unbreakable. In addition, the problems with TIME_WAIT recycling
which affect our currently used algorithm are not present.
Reviewed by: jesper
generation scheme. Users may now select between the currently used
OpenBSD algorithm and the older random positive increment method.
While the OpenBSD algorithm is more secure, it also breaks TIME_WAIT
handling; this is causing trouble for an increasing number of folks.
To switch between generation schemes, one sets the sysctl
net.inet.tcp.tcp_seq_genscheme. 0 = random positive increments,
1 = the OpenBSD algorithm. 1 is still the default.
Once a secure _and_ compatible algorithm is implemented, this sysctl
will be removed.
Reviewed by: jlemon
Tested by: numerous subscribers of -net
connection. The information contained in a tcptemp can be
reconstructed from a tcpcb when needed.
Previously, tcp templates required the allocation of one
mbuf per connection. On large systems, this change should
free up a large number of mbufs.
Reviewed by: bmilekic, jlemon, ru
MFC after: 2 weeks
For TCP, verify that the sequence number in the ICMP packet falls within
the tcp receive window before performing any actions indicated by the
icmp packet.
Clean up some layering violations (access to tcp internals from in_pcb)
Add new PRC_UNREACH_ADMIN_PROHIB in sys/sys/protosw.h
Remove condition on TCP in src/sys/netinet/ip_icmp.c:icmp_input
In src/sys/netinet/ip_icmp.c:icmp_input set code = PRC_UNREACH_ADMIN_PROHIB
or PRC_UNREACH_HOST for all unreachables except ICMP_UNREACH_NEEDFRAG
Rename sysctl icmp_admin_prohib_like_rst to icmp_unreach_like_rst
to reflect the fact that we also react on ICMP unreachables that
are not administrative prohibited. Also update the comments to
reflect this.
In sys/netinet/tcp_subr.c:tcp_ctlinput add code to treat
PRC_UNREACH_ADMIN_PROHIB and PRC_UNREACH_HOST different.
PR: 23986
Submitted by: Jesper Skriver <jesper@skriver.dk>
messages send by routers when they deny our traffic, this causes
a timeout when trying to connect to TCP ports/services on a remote
host, which is blocked by routers or firewalls.
rfc1122 (Requirements for Internet Hosts) section 3.2.2.1 actually
requi re that we treat such a message for a TCP session, that we
treat it like if we had recieved a RST.
quote begin.
A Destination Unreachable message that is received MUST be
reported to the transport layer. The transport layer SHOULD
use the information appropriately; for example, see Sections
4.1.3.3, 4.2.3.9, and 4.2.4 below. A transport protocol
that has its own mechanism for notifying the sender that a
port is unreachable (e.g., TCP, which sends RST segments)
MUST nevertheless accept an ICMP Port Unreachable for the
same purpose.
quote end.
I've written a small extension that implement this, it also create
a sysctl "net.inet.tcp.icmp_admin_prohib_like_rst" to control if
this new behaviour is activated.
When it's activated (set to 1) we'll treat a ICMP administratively
prohibited message (icmp type 3 code 9, 10 and 13) for a TCP
sessions, as if we recived a TCP RST, but only if the TCP session
is in SYN_SENT state.
The reason for only reacting when in SYN_SENT state, is that this
will solve the problem, and at the same time minimize the risk of
this being abused.
I suggest that we enable this new behaviour by default, but it
would be a change of current behaviour, so if people prefer to
leave it disabled by default, at least for now, this would be ok
for me, the attached diff actually have the sysctl set to 0 by
default.
PR: 23086
Submitted by: Jesper Skriver <jesper@skriver.dk>
delete the cloned route that is associated with the connection.
This does not exhaust the routing table memory when the system
is under a SYN flood attack. The route entry is not deleted if there
is any prior information cached in it.
Reviewed by: Peter Wemm,asmodai
better recovery for multiple packet losses in a single window.
The algorithm can be toggled via the sysctl net.inet.tcp.newreno,
which defaults to "on".
Submitted by: Jayanth Vijayaraghavan <jayanth@yahoo-inc.com>
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.
pr_input() routines prototype is also changed to support IPSEC and IPV6
chained protocol headers.
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project
- eliminate the fast/slow timeout lists for TCP and instead use a
callout entry for each timer.
- increase the TCP timer granularity to HZ
- implement "bad retransmit" recovery, as presented in
"On Estimating End-to-End Network Path Properties", by Allman and Paxson.
Submitted by: jlemon, wollmann
This makes it possible to change the sysctl tree at runtime.
* Change KLD to find and register any sysctl nodes contained in the loaded
file and to unregister them when the file is unloaded.
Reviewed by: Archie Cobbs <archie@whistle.com>,
Peter Wemm <peter@netplex.com.au> (well they looked at it anyway)
flag means that there is more data to be put into the socket buffer.
Use it in TCP to reduce the interaction between mbuf sizes and the
Nagle algorithm.
Based on: "Justin C. Walker" <justin@apple.com>'s description of Apple's
fix for this problem.
another specialized mbuf type in the process. Also clean up some
of the cruft surrounding IPFW, multicast routing, RSVP, and other
ill-explored corners.
mismatches exposed by this (the prototype for tcp_respond() didn't
match the function definition lexically, and still depends on a
gcc feature to match if ints have more than 32 bits).
Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.
Convert PF_LOCAL PCB structures to be type-stable and add a version number.
Define an external format for infomation about socket structures and use
it in several places.
Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time. This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.
It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).
is believed to have been broken with the Brakmo/Peterson srtt
calculation changes. The result of this bug is that TCP connections
could time out extremely quickly (in 12 seconds).
Also backed out jdp's partial fix for this problem in rev 1.17 of
tcp_timer.c as it is obsoleted by this commit.
Bug was pointed out by Kevin Lehey <kml@roller.nas.nasa.gov>.
PR: 6068
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.
Also:
1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.
These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.
Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.
WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).