Commit Graph

271 Commits

Author SHA1 Message Date
George V. Neville-Neil
5d06879adb dd DTrace probe points, translators and a corresponding script
to provide the TCPDEBUG functionality with pure DTrace.

Reviewed by:	rwatson
MFC after:	2 weeks
Sponsored by:	Limelight Networks
Differential Revision:	D3530
2015-09-13 15:50:55 +00:00
Kristof Provost
fc4443a1d5 Remove stale comment.
The IPv6 pseudo header checksum was added by bz in r235961.

Sponsored by:	Essen FreeBSD Hackathon
2015-07-25 16:14:55 +00:00
Xin LI
47a8e86509 Fix resource exhaustion due to sessions stuck in LAST_ACK state.
Submitted by:	Jonathan Looney (Juniper SIRT)
Reviewed by:	lstewart
Security:	CVE-2015-5358
Security:	SA-15:13.tcp
2015-07-21 23:42:15 +00:00
Hiren Panchasara
f85680793b Avoid a situation where we do not set persist timer after a zero window
condition.
If you send a 0-length packet, but there is data is the socket buffer, and
neither the rexmt or persist timer is already set, then activate the persist
timer.

PR:			192599
Differential Revision:	D2946
Submitted by:		jlott at averesystems dot com
Reviewed by:		jhb, jch, gnn, hiren
Tested by:		jlott at averesystems dot com, jch
MFC after:		2 weeks
2015-06-29 21:23:54 +00:00
Robert Watson
ed6a66ca6c To ease changes to underlying mbuf structure and the mbuf allocator, reduce
the knowledge of mbuf layout, and in particular constants such as M_EXT,
MLEN, MHLEN, and so on, in mbuf consumers by unifying various alignment
utility functions (M_ALIGN(), MH_ALIGN(), MEXT_ALIGN() in a single
M_ALIGN() macro, implemented by a now-inlined m_align() function:

- Move m_align() from uipc_mbuf.c to mbuf.h; mark as __inline.
- Reimplement M_ALIGN(), MH_ALIGN(), and MEXT_ALIGN() using m_align().
- Update consumers around the tree to simply use M_ALIGN().

This change eliminates a number of cases where mbuf consumers must be aware
of whether or not mbufs returned by the allocator use external storage, but
also assumptions about the size of the returned mbuf. This will make it
easier to introduce changes in how we use external storage, as well as
features such as variable-size mbufs.

Differential Revision:	https://reviews.freebsd.org/D1436
Reviewed by:	glebius, trasz, gnn, bz
Sponsored by:	EMC / Isilon Storage Division
2015-01-05 09:58:32 +00:00
Gleb Smirnoff
cfa6009e36 In preparation of merging projects/sendfile, transform bare access to
sb_cc member of struct sockbuf to a couple of inline functions:

sbavail() and sbused()

Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-11-12 09:57:15 +00:00
Hans Petter Selasky
3c7c188c16 Fix some minor TSO issues:
- Improve description of TSO limits.
- Remove a not needed KASSERT()
- Remove some not needed variable casts.

Sponsored by:	Mellanox Technologies
Discussed with:	lstewart @
MFC after:	1 week
2014-11-11 12:05:59 +00:00
Gleb Smirnoff
6df8a71067 Remove SYSCTL_VNET_* macros, and simply put CTLFLAG_VNET where needed.
Sponsored by:	Nginx, Inc.
2014-11-07 09:39:05 +00:00
Sean Bruno
0f3e3bc526 Catch ipv6 case when attempting to do PLPMTUD blackhole detection.
Submitted by:	Mikhail <mp@lenta.ru>
MFC after:	2 weeks
Relnotes:	yes
2014-10-13 21:05:29 +00:00
Sean Bruno
f6f6703f27 Implement PLPMTUD blackhole detection (RFC 4821), inspired by code
from xnu sources.  If we encounter a network where ICMP is blocked
the Needs Frag indicator may not propagate back to us.  Attempt to
downshift the mss once to a preconfigured value.

Default this feature to off for now while we do not have a full PLPMTUD
implementation in our stack.

Adds the following new sysctl's for control:
net.inet.tcp.pmtud_blackhole_detection -- turns on/off this feature
net.inet.tcp.pmtud_blackhole_mss       -- mss to try for ipv4
net.inet.tcp.v6pmtud_blackhole_mss     -- mss to try for ipv6

Adds the following new sysctl's for monitoring:
-- Number of times the code was activated to attempt a mss downshift
net.inet.tcp.pmtud_blackhole_activated
-- Number of times the blackhole mss was used in an attempt to downshift
net.inet.tcp.pmtud_blackhole_min_activated
-- Number of times that we failed to connect after we downshifted the mss
net.inet.tcp.pmtud_blackhole_failed

Phabricator:	https://reviews.freebsd.org/D506
Reviewed by:	rpaulo bz
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	Limelight Networks
2014-10-07 21:50:28 +00:00
Hans Petter Selasky
b228e6bf57 Minor code styling.
Suggested by:	glebius @
2014-10-06 06:19:54 +00:00
Hans Petter Selasky
9fd573c39d Improve transmit sending offload, TSO, algorithm in general.
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

Reviewed by:	adrian, rmacklem
Sponsored by:	Mellanox Technologies
MFC after:	1 week
2014-09-22 08:27:27 +00:00
Hans Petter Selasky
72f3100047 Revert r271504. A new patch to solve this issue will be made.
Suggested by:	adrian @
2014-09-13 20:52:01 +00:00
Hans Petter Selasky
eb93b77ae4 Improve transmit sending offload, TSO, algorithm in general.
The current TSO limitation feature only takes the total number of
bytes in an mbuf chain into account and does not limit by the number
of mbufs in a chain. Some kinds of hardware is limited by two
factors. One is the fragment length and the second is the fragment
count. Both of these limits need to be taken into account when doing
TSO. Else some kinds of hardware might have to drop completely valid
mbuf chains because they cannot loaded into the given hardware's DMA
engine. The new way of doing TSO limitation has been made backwards
compatible as input from other FreeBSD developers and will use
defaults for values not set.

MFC after:	1 week
Sponsored by:	Mellanox Technologies
2014-09-13 08:26:09 +00:00
Hiren Panchasara
43630e625a Fix a typo. 2014-07-03 23:12:43 +00:00
Gleb Smirnoff
e3a7aa6f56 - Remove rt_metrics_lite and simply put its members into rtentry.
- Use counter(9) for rt_pksent (former rt_rmx.rmx_pksent). This
  removes another cache trashing ++ from packet forwarding path.
- Create zini/fini methods for the rtentry UMA zone. Via initialize
  mutex and counter in them.
- Fix reporting of rmx_pksent to routing socket.
- Fix netstat(1) to report "Use" both in kvm(3) and sysctl(3) mode.

The change is mostly targeted for stable/10 merge. For head,
rt_pksent is expected to just disappear.

Discussed with:		melifaro
Sponsored by:		Netflix
Sponsored by:		Nginx, Inc.
2014-03-05 01:17:47 +00:00
Andriy Gapon
d9fae5ab88 dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by:	markj
MFC after:	3 weeks
2013-11-26 08:46:27 +00:00
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Mark Johnston
57f6086735 Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by:	gnn, hiren
MFC after:	1 month
2013-08-25 21:54:41 +00:00
Andre Oppermann
3c914c547e Allow drivers to specify a maximum TSO length in bytes if they are
limited in the amount of data they can handle at once.

Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to
change the limit.

The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything
less wouldn't be very useful anymore.  The upper limit is still at
IP_MAXPACKET (65536 bytes).  Raising it requires further auditing of
the IPv4/v6 code path's as the length field in the IP header would
overflow leading to confusion in firewalls and others packet handler on
the real size of the packet.

The placement into "struct ifnet" is a bit hackish but the best place
that was found.  When the stack/driver boundary is updated it should
be handled in a better way.

Submitted by:	cperciva (earlier version)
Reviewed by:	cperciva
Tested by:	cperciva
MFC after:	1 week (using spare struct members to preserve ABI)
2013-06-03 12:55:13 +00:00
Gleb Smirnoff
0e2bc05c47 Fix tcp_output() so that tcpcb is updated in the same manner when an
mbuf allocation fails, as in a case when ip_output() returns error.

To achieve that, move large block of code that updates tcpcb below
the out: label.

This fixes a panic, that requires the following sequence to happen:

1) The SYN was sent to the network, tp->snd_nxt = iss + 1, tp->snd_una = iss
2) The retransmit timeout happened for the SYN we had sent,
   tcp_timer_rexmt() sets tp->snd_nxt = tp->snd_una, and calls tcp_output().
   In tcp_output m_get() fails.
3) Later on the SYN|ACK for the SYN sent in step 1) came,
   tcp_input sets tp->snd_una += 1, which leads to
   tp->snd_una > tp->snd_nxt inconsistency, that later panics in
   socket buffer code.

For reference, this bug fixed in DragonflyBSD repo:

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/1ff9b7d322dc5a26f7173aa8c38ecb79da80e419

Reviewed by:	andre
Tested by:	pho
Sponsored by:	Nginx, Inc.
PR:		kern/177456
Submitted by:	HouYeFei&XiBoLiu <lglion718 163.com>
2013-04-11 18:23:56 +00:00
Gleb Smirnoff
aa8bd99d99 - Replace compat macros with function calls. 2013-03-16 08:58:28 +00:00
Gleb Smirnoff
39f6074e2e - Use m_getcl() instead of hand allocating.
Sponsored by:	Nginx, Inc.
2013-03-15 12:53:53 +00:00
Gleb Smirnoff
eb1b1807af Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00
Kevin Lo
0f5e7edc14 Fix typo; s/ouput/output 2012-11-07 07:00:59 +00:00
Andre Oppermann
78f59b4bfd Forced commit to provide the correct commit message to r242251:
Defer sending an independent window update if a delayed ACK is pending
  saving a packet.  The window update then gets piggy-backed on the next
  already scheduled ACK.

Added grammar fixes as well.

MFC after:	2 weeks
2012-10-29 13:16:33 +00:00
Andre Oppermann
f62563d33c Prevent a flurry of forced window updates when an application is
doing small reads on a (partially) filled receive socket buffer.

Normally one would a send a window update every time the available
space in the socket buffer increases by two times MSS.  This leads
to a flurry of window updates that do not provide any meaningful
new information to the sender.  There still is available space in
the window and the sender can continue sending data.  All window
updates then get carried by the regular ACKs.  Only when the socket
buffer was (almost) full and the window closed accordingly a window
updates delivery new information and allows the sender to start
sending more data again.

Send window updates only every two MSS when the socket buffer
has less than 1/8 space available, or the available space in the
socket buffer increased by 1/4 its full capacity, or the socket
buffer is very small.  The next regular data ACK will carry and
report the exact window size again.

Reported by:	sbruno
Tested by:	darrenr
Tested by:	Darren Baginski
PR:		kern/116335
MFC after:	2 weeks
2012-10-28 17:40:35 +00:00
Andre Oppermann
4249614cb0 When SYN or SYN/ACK had to be retransmitted RFC5681 requires us to
reduce the initial CWND to one segment.  This reduction got lost
some time ago due to a change in initialization ordering.

Additionally in tcp_timer_rexmt() avoid entering fast recovery when
we're still in TCPS_SYN_SENT state.

MFC after:	2 weeks
2012-10-28 17:30:28 +00:00
Gleb Smirnoff
8f134647ca Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

  After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

  After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by:	luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by:	ray, Olivier Cochard-Labbe <olivier cochard.me>
2012-10-22 21:09:03 +00:00
Gleb Smirnoff
df0633a145 If ip_output() returns EMSGSIZE to tcp_output(), then the latter calls
tcp_mtudisc(), which in its turn may call tcp_output(). Under certain
conditions (must admit they are very special) an infinite recursion can
happen.

To avoid recursion we can pass struct route to ip_output() and obtain
correct mtu. This allows us not to use tcp_mtudisc() but call tcp_mss_update()
directly.

PR:		kern/155585
Submitted by:	Andrey Zonov <andrey zonov.org> (original version of patch)
2012-07-16 07:08:34 +00:00
Navdeep Parhar
09fe63205c - Updated TOE support in the kernel.
- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
  These are available as t3_tom and t4_tom modules that augment cxgb(4)
  and cxgbe(4) respectively.  The cxgb/cxgbe drivers continue to work as
  usual with or without these extra features.

- iWARP driver for Terminator 3 ASIC (kernel verbs).  T4 iWARP in the
  works and will follow soon.

Build-tested with make universe.

30s overview
============
What interfaces support TCP offload?  Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE

Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe

Which connections are offloaded?  Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe

Reviewed by:	bz, gnn
Sponsored by:	Chelsio communications.
MFC after:	~3 months (after 9.1, and after ensuring MFC is feasible)
2012-06-19 07:34:13 +00:00
Bjoern A. Zeeb
356ab07e2d It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading.  Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.

To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.

Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible).  Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation.  Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.

This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.

Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.

Individual driver updates will have to follow, as will SCTP.

Reported by:	gallatin, dim, ..
Reviewed by:	gallatin (glanced at?)
MFC after:	3 days
X-MFC with:	r235961,235959,235958
2012-05-28 09:30:13 +00:00
Bjoern A. Zeeb
45747ba53c MFp4 bz_ipv6_fast:
Add code to handle pre-checked TCP checksums as indicated by mbuf
  flags to save the entire computation for validation if not needed.

  In the IPv6 TCP output path only compute the pseudo-header checksum,
  set the checksum offset in the mbuf field along the appropriate flag
  as done in IPv4.

  In tcp_respond() just initialize the IPv6 payload length to 0 as
  ip6_output() will properly set it.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-25 02:23:26 +00:00
Gleb Smirnoff
ef341ee1e3 When we receive an ICMP unreach need fragmentation datagram, we take
proposed MTU value from it and update the TCP host cache. Then
tcp_mss_update() is called on the corresponding tcpcb. It finds the
just allocated entry in the TCP host cache and updates MSS on the
tcpcb. And then we do a fast retransmit of what we have in the tcp
send buffer.

This sequence gets broken if the TCP host cache is exausted. In this
case allocation fails, and later called tcp_mss_update() finds nothing
in cache. The fast retransmit is done with not reduced MSS and is
immidiately replied by remote host with new ICMP datagrams and the
cycle repeats. This ping-pong can go up to wirespeed.

To fix this:
- tcp_mss_update() gets new parameter - mtuoffer, that is like
  offer, but needs to have min_protoh subtracted.
- tcp_mtudisc() as notification method renamed to tcp_mtudisc_notify().
- tcp_mtudisc() now accepts not a useless error argument, but proposed
  MTU value, that is passed to tcp_mss_update() as mtuoffer.

Reported by:	az
Reported by:	Andrey Zonov <andrey zonov.org>
Reviewed by:	andre (previous version of patch)
2012-04-16 13:49:03 +00:00
Bjoern A. Zeeb
d8951c8a2f Fix PAWS (Protect Against Wrapped Sequence numbers) in cases when
hz >> 1000 and thus getting outside the timestamp clock frequenceny of
1ms < x < 1s per tick as mandated by RFC1323, leading to connection
resets on idle connections.

Always use a granularity of 1ms using getmicrouptime() making all but
relevant callouts independent of hz.

Use getmicrouptime(), not getmicrotime() as the latter may make a jump
possibly breaking TCP nfsroot mounts having our timestamps move forward
for more than 24.8 days in a second without having been idle for that
long.

PR:		kern/61404
Reviewed by:	jhb, mav, rrs
Discussed with:	silby, lstewart
Sponsored by:	Sandvine Incorporated (originally in 2011)
MFC after:	6 weeks
2012-02-15 16:09:56 +00:00
Sergey Kandaurov
ddd0c4a969 Restore sysctl names for tcp_sendspace/tcp_recvspace.
They seem to be changed unintentionally in r226437, and there were no
any mentions of renaming in commit log message.

Reported by:	Anton Yuzhaninov <citrin citrin ru>
2011-11-02 20:58:47 +00:00
Andre Oppermann
873789cb0f Move the tcp_sendspace and tcp_recvspace sysctl's from
the middle of tcp_usrreq.c to the top of tcp_output.c
and tcp_input.c respectively next to the socket buffer
autosizing controls.

MFC after:	1 week
2011-10-16 20:18:39 +00:00
Andre Oppermann
9ec4a4cca5 Remove the ss_fltsz and ss_fltsz_local sysctl's which have
long been superseded by the RFC3390 initial CWND sizing.

Also remove the remnants of TCP_METRICS_CWND which used the
TCP hostcache to set the initial CWND in a non-RFC compliant
way.

MFC after:	1 week
2011-10-16 20:06:44 +00:00
Bjoern A. Zeeb
b233773bb9 Increase the defaults for the maximum socket buffer limit,
and the maximum TCP send and receive buffer limits from 256kB
to 2MB.

For sb_max_adj we need to add the cast as already used in the sysctl
handler to not overflow the type doing the maths.

Note that this is just the defaults.  They will allow more memory
to be consumed per socket/connection if needed but not change the
default "idle" memory consumption.   All values are still tunable
by sysctls.

Suggested by:	gnn
Discussed on:	arch (Mar and Aug 2011)
MFC after:	3 weeks
Approved by:	re (kib)
2011-08-25 09:20:13 +00:00
Colin Percival
472ea5befb Remove #ifdef notyet code dating back to 4.3BSD Net/2 (and possibly earlier).
I think the benefit of making the code cleaner and easier to understand
outweighs the humour of leaving this intact (or possibly changing it to
#ifdef not_yet_and_probably_never).

MFC after:	2 weeks
2011-07-05 18:49:55 +00:00
Bjoern A. Zeeb
75497cc5eb Fix a KASSERT from r212803 to check the correct length also in case of
IPsec being compiled in and used.  Improve reporting by adding the length
fields to the panic message, so that we would have some immediate debugging
hints.

Discussed with:	jhb
2011-06-20 07:07:18 +00:00
John Baldwin
6b7c15e580 Advance the advertised window (rcv_adv) to the currently received data
(rcv_nxt) if we advertising a zero window.  This can be true when ACK'ing
a window probe whose one byte payload was accepted rather than dropped
because the socket's receive buffer was not completely full, but the
remaining space was smaller than the window scale.

This ensures that window probe ACKs satisfy the assumption made in r221346
and closes a window where rcv_nxt could be greater than rcv_adv.

Tested by:	trasz, pho, trociny
Reviewed by:	silby
MFC after:	1 week
2011-06-13 15:38:31 +00:00
John Baldwin
f701e30d7f Handle a rare edge case with nearly full TCP receive buffers. If a TCP
buffer fills up causing the remote sender to enter into persist mode, but
there is still room available in the receive buffer when a window probe
arrives (either due to window scaling, or due to the local application
very slowing draining data from the receive buffer), then the single byte
of data in the window probe is accepted.  However, this can cause rcv_nxt
to be greater than rcv_adv.  This condition will only last until the next
ACK packet is pushed out via tcp_output(), and since the previous ACK
advertised a zero window, the ACK should be pushed out while the TCP
pcb is write-locked.

During the window while rcv_nxt is greather than rcv_adv, a few places
would compute the remaining receive window via rcv_adv - rcv_nxt.
However, this value was then (uint32_t)-1.  On a 64 bit machine this
could expand to a positive 2^32 - 1 when cast to a long.  In particular,
when calculating the receive window in tcp_output(), the result would be
that the receive window was computed as 2^32 - 1 resulting in advertising
a far larger window to the remote peer than actually existed.

Fix various places that compute the remaining receive window to either
assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the
window as full if rcv_nxt is greather than rcv_adv.

Reviewed by:	bz
MFC after:	1 month
2011-05-02 21:05:52 +00:00
Bjoern A. Zeeb
b287c6c70c Make the TCP code compile without INET. Sort #includes and add #ifdef INETs.
Add some comments at #endifs given more nestedness.  To make the compiler
happy, some default initializations were added in accordance with the style
on the files.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-30 11:21:29 +00:00
John Baldwin
672dc4aea2 TCP reuses t_rxtshift to determine the backoff timer used for both the
persist state and the retransmit timer.  However, the code that implements
"bad retransmit recovery" only checks t_rxtshift to see if an ACK has been
received in during the first retransmit timeout window.  As a result, if
ticks has wrapped over to a negative value and a socket is in the persist
state, it can incorrectly treat an ACK from the remote peer as a
"bad retransmit recovery" and restore saved values such as snd_ssthresh and
snd_cwnd.  However, if the socket has never had a retransmit timeout, then
these saved values will be zero, so snd_ssthresh and snd_cwnd will be set
to 0.

If the socket is in fast recovery (this can be caused by excessive
duplicate ACKs such as those fixed by 220794), then each ACK that arrives
triggers either NewReno or SACK partial ACK handling which clamps snd_cwnd
to be no larger than snd_ssthresh.  In effect, the socket's send window
is permamently stuck at 0 even though the remote peer is advertising a
much larger window and pending data is only sent via TCP window probes
(so one byte every few seconds).

Fix this by adding a new TCP pcb flag (TF_PREVVALID) that indicates that
the various snd_*_prev fields in the pcb are valid and only perform
"bad retransmit recovery" if this flag is set in the pcb.  The flag is set
on the first retransmit timeout that occurs and is cleared on subsequent
retransmit timeouts or when entering the persist state.

Reviewed by:	bz
MFC after:	2 weeks
2011-04-29 15:40:12 +00:00
John Baldwin
da84b2e6c5 When checking to see if a window update should be sent to the remote peer,
don't force a window update if the window would not actually grow due to
window scaling.  Specifically, if the window scaling factor is larger than
2 * MSS, then after the local reader has drained 2 * MSS bytes from the
socket, a window update can end up advertising the same window.  If this
happens, the supposed window update actually ends up being a duplicate ACK.
This can result in an excessive number of duplicate ACKs when using a
higher maximum socket buffer size.

Reviewed by:	bz
MFC after:	1 month
2011-04-18 17:43:16 +00:00
Lawrence Stewart
39bc9de532 - Add some helper hook points to the TCP stack. The hooks allow Khelp modules to
access inbound/outbound events and associated data for established TCP
  connections. The hooks only run if at least one hook function is registered
  for the hook point, ensuring the impact on the stack is effectively nil when
  no TCP Khelp modules are loaded. struct tcp_hhook_data is passed as contextual
  data to any registered Khelp module hook functions.

- Add an OSD (Object Specific Data) pointer to struct tcpcb to allow Khelp
  modules to associate per-connection data with the TCP control block.

- Bump __FreeBSD_version and add a note to UPDATING regarding to ABI changes
  introduced by this commit and r216753.

In collaboration with:	David Hayes <dahayes at swin edu au> and
				Grenville Armitage <garmitage at swin edu au>
Sponsored by:	FreeBSD Foundation
Reviewed by:	bz, others along the way
MFC after:	3 months
2010-12-28 12:13:30 +00:00
Lawrence Stewart
2ea8da28e9 - Reinstantiate the after_idle hook call in tcp_output(), which got lost
somewhere along the way due to mismerging r211464 in our development tree.

- Capture the essence of r211464 in NewReno's after_idle() hook. We don't
  use V_ss_fltsz/V_ss_fltsz_local yet which needs to be revisited.

Sponsored by:	FreeBSD Foundation
Submitted by:	David Hayes <dahayes at swin edu au>
MFC after:	9 weeks
X-MFC with:	r215166
2010-12-02 01:36:00 +00:00
George V. Neville-Neil
f5d34df525 Add new, per connection, statistics for TCP, including:
Retransmitted Packets
Zero Window Advertisements
Out of Order Receives

These statistics are available via the -T argument to
netstat(1).
MFC after:	2 weeks
2010-11-17 18:55:12 +00:00
Lawrence Stewart
dbc4240942 This commit marks the first formal contribution of the "Five New TCP Congestion
Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details
about the project are available at: http://caia.swin.edu.au/freebsd/5cc/

- Add a KPI and supporting infrastructure to allow modular congestion control
  algorithms to be used in the net stack. Algorithms can maintain per-connection
  state if required, and connections maintain their own algorithm pointer, which
  allows different connections to concurrently use different algorithms. The
  TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to
  programmatically query or change the congestion control algorithm respectively
  from within an application at runtime.

- Integrate the framework with the TCP stack in as least intrusive a manner as
  possible. Care was also taken to develop the framework in a way that should
  allow integration with other congestion aware transport protocols (e.g. SCTP)
  in the future. The hope is that we will one day be able to share a single set
  of congestion control algorithm modules between all congestion aware transport
  protocols.

- Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack
  and use it to decouple the meaning of recovery from a congestion event and
  recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based
  congestion control protocols don't generally need to recover from packet loss
  and need a different way to note a congestion recovery episode within the
  stack.

- Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code
  and ensures the stack always uses the appropriate mechanisms for recovering
  from packet loss during a congestion recovery episode.

- Extract the NewReno congestion control algorithm from the TCP stack and
  massage it into module form. NewReno is always built into the kernel and will
  remain the default algorithm for the forseeable future. Implementations of
  additional different algorithms will become available in the near future.

- Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code
  that relies on the size of "struct tcpcb" is required.

Many thanks go to the Cisco University Research Program Fund at Community
Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work
at the Centre for Advanced Internet Architectures, Swinburne University of
Technology is greatly appreciated.

In collaboration with:	David Hayes <dahayes at swin edu au> and
			Grenville Armitage <garmitage at swin edu au>
Sponsored by:	Cisco URP, FreeBSD Foundation
Reviewed by:	rpaulo
Tested by:	David Hayes (and many others over the years)
MFC after:	3 months
2010-11-12 06:41:55 +00:00
Andre Oppermann
ed42031102 Rearrange the TSO code to make it more readable and to clearly
separate the decision logic, of whether we can do TSO, and the
calculation of the burst length into two distinct parts.

Change the way the TSO burst length calculation is done. While
TSO could do bursts of 65535 bytes that can't be represented in
ip_len together with the IP and TCP header. Account for that and
use IP_MAXPACKET instead of TCP_MAXWIN as base constant (both
have the same value of 64K). When more data is available prevent
less than MSS sized segments from being sent during the current
TSO burst.

Add two more KASSERTs to ensure the integrity of the packets.

Tested by:	Ben Wilber <ben-at-desync com>
MFC after:	10 days
2010-09-17 22:05:27 +00:00
Andre Oppermann
1c18314d17 Remove the TCP inflight bandwidth limiter as announced in r211315
to give way for the pluggable congestion control framework.  It is
the task of the congestion control algorithm to set the congestion
window and amount of inflight data without external interference.

In 'struct tcpcb' the variables previously used by the inflight
limiter are renamed to spares to keep the ABI intact and to have
some more space for future extensions.

In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to
preserve the ABI.  It is always set to 0.

In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed
to preserve the ABI.  It is always set to 0.

These unused variable in the various structures may be reused in the
future or garbage collected before the next release or at some other
point when an ABI change happens anyway for other reasons.

No MFC is planned.  The inflight bandwidth limiter stays disabled by
default in the other branches but remains available.
2010-09-16 21:06:45 +00:00
Andre Oppermann
c3f0bdc66b If a TCP connection has been idle for one retransmit timeout or more
it must reset its congestion window back to the initial window.

RFC3390 has increased the initial window from 1 segment to up to
4 segments.

The initial window increase of RFC3390 wasn't reflected into the
restart window which remained at its original defaults of 4 segments
for local and 1 segment for all other connections.  Both values are
controllable through sysctl net.inet.tcp.local_slowstart_flightsize
and net.inet.tcp.slowstart_flightsize.

The increase helps TCP's slow start algorithm to open up the congestion
window much faster.

Reviewed by:	lstewart
MFC after:	1 week
2010-08-18 18:05:54 +00:00
Andre Oppermann
e4e9266071 Fix the interaction between 'ICMP fragmentation needed' MTU updates,
path MTU discovery and the tcp_minmss limiter for very small MTU's.

When the MTU suggested by the gateway via ICMP, or if there isn't
any the next smaller step from ip_next_mtu(), is lower than the
floor enforced by net.inet.tcp.minmss (default 216) the value is
ignored and the default MSS (512) is used instead.  However the
DF flag in the IP header is still set in tcp_output() preventing
fragmentation by the gateway.

Fix this by using tcp_minmss as the MSS and clear the DF flag if
the suggested MTU is too low.  This turns off path MTU dissovery
for the remainder of the session and allows fragmentation to be
done by the gateway.

Only MTU's smaller than 256 are affected.  The smallest official
MTU specified is for AX.25 packet radio at 256 octets.

PR:		kern/146628
Tested by:	Matthew Luckie <mjl-at-luckie org nz>
MFC after:	1 week
2010-08-15 13:25:18 +00:00
Andre Oppermann
153e5b57af When using TSO and sending more than TCP_MAXWIN sendalot is set
and we loop back to 'again'.  If the remainder is less or equal
to one full segment, the TSO flag was not cleared even though
it isn't necessary anymore.  Enabling the TSO flag on a segment
that doesn't require any offloaded segmentation by the NIC may
cause confusion in the driver or hardware.

Reset the internal tso flag in tcp_output() on every iteration
of sendalot.

PR:		kern/132832
Submitted by:	Renaud Lienhart <renaud-at-vmware com>
MFC after:	1 week
2010-08-14 21:41:33 +00:00
Bjoern A. Zeeb
82cea7e6f3 MFP4: @176978-176982, 176984, 176990-176994, 177441
"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by:	jhb
Discussed with:	rwatson
Sponsored by:	The FreeBSD Foundation
Sponsored by:	CK Software GmbH
MFC after:	6 days
2010-04-29 11:52:42 +00:00
Kenneth D. Merry
3579cf4c4f Don't clear other flags (e.g. CSUM_TCP) when setting CSUM_TSO. This was
causing TSO to break for the Xen netfront driver.

Reviewed by:	gibbs, rwatson
MFC after:	7 days
2010-04-19 15:15:36 +00:00
John Baldwin
c6d9480519 Several years ago a feature was added to TCP that casued soreceive() to
send an ACK right away if data was drained from a TCP socket that had
previously advertised a zero-sized window.  The current code requires the
receive window to be exactly zero for this to kick in.  If window scaling is
enabled and the window is smaller than the scale, then the effective window
that is advertised is zero.  However, in that case the zero-sized window
handling is not enabled because the window is not exactly zero.  The fix
changes the code to check the raw window value against zero.

Reviewed by:	bz
MFC after:	1 week
2009-11-06 16:55:05 +00:00
Robert Watson
530c006014 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
Robert Watson
eddfbb763d Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00
John Baldwin
6b0c5521b5 Trim extra sets of ()'s.
Requested by:	bde
2009-06-16 19:00:48 +00:00
Robert Watson
bcf11e8d00 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
Robert Watson
78b5071407 Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and
TCPSTAT_INC(), rather than directly manipulating the fields across the
kernel.  This will make it easier to change the implementation of
these statistics, such as using per-CPU versions of the data structures.

MFC after:	3 days
2009-04-11 22:07:19 +00:00
Bjoern A. Zeeb
4b79449e2f Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by:	brooks, gnn, des, zec, imp
Sponsored by:	The FreeBSD Foundation
2008-12-02 21:37:28 +00:00
Bjoern A. Zeeb
5cd54324ee Replace most INP_CHECK_SOCKAF() uses checking if it is an
IPv6 socket by comparing a constant inp vflag.
This is expected to help to reduce extra locking.

Suggested by:	rwatson
Reviewed by:	rwatson
MFC after:	6 weeks
2008-11-27 13:19:42 +00:00
Marko Zec
44e33a0758 Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions.  As a rule,
initialization at instatiation for such variables should never be
introduced again from now on.  Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact.  In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-11-19 09:39:34 +00:00
Marko Zec
8b615593fc Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by:	julian, bz, brooks, zec
Reviewed by:	julian, bz, brooks, kris, rwatson, ...
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
Bjoern A. Zeeb
3418daf2f1 Implement IPv6 support for TCP MD5 Signature Option (RFC 2385)
the same way it has been implemented for IPv4.

Reviewed by:	bms (skimmed)
Tested by:	Nick Hilliard (nick netability.ie) (with more changes)
MFC after:	2 months
2008-09-13 17:26:46 +00:00
Bjoern A. Zeeb
c4982fae59 Add a second KASSERT checking for len >= 0 in the tcp output path.
This is different to the first one (as len gets updated between those
two) and would have caught various edge cases (read bugs) at a well
defined place I had been debugging the last months instead of
triggering (random) panics further down the call graph.

MFC after:	2 months
2008-09-07 11:38:30 +00:00
Bjoern A. Zeeb
603724d3ab Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from:	//depot/projects/vimage-commit2/...
Reviewed by:	brooks, des, ed, mav, julian,
		jamie, kris, rwatson, zec, ...
		(various people I forgot, different versions)
		md5 (with a bit of help)
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
X-MFC after:	never
V_Commit_Message_Reviewed_By:	more people than the patch
2008-08-17 23:27:27 +00:00
Rui Paulo
f2512ba12a MFp4 (//depot/projects/tcpecn/):
TCP ECN support. Merge of my GSoC 2006 work for NetBSD.
  TCP ECN is defined in RFC 3168.

Partly reviewed by:	dwmalone, silby
Obtained from:		NetBSD
2008-07-31 15:10:09 +00:00
Rui Paulo
b27227029b Fix commment in typo.
M    tcp_output.c
2008-07-15 10:32:35 +00:00
Robert Watson
8501a69cc9 Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after:	3 months
Tested by:	kris (superset of committered patch)
2008-04-17 21:38:18 +00:00
Andre Oppermann
3a4018c4e8 Remove TCP options ordering assumptions in tcp_addoptions(). Ordering
was changed in rev. 1.161 of tcp_var.h.  All option now test for sufficient
space in TCP header before getting added.

Reported by:	Mark Atkinson <atkin901-at-yahoo.com>
Tested by:	Mark Atkinson <atkin901-at-yahoo.com>
MFC after:	1 week
2008-04-07 19:09:23 +00:00
Andre Oppermann
5b2e33eab5 Remove now unnecessary comment. 2008-04-07 18:50:05 +00:00
Andre Oppermann
c343c524e1 Use #defines for TCP options padding after EOL to be consistent.
Reviewed by:	bz
2008-04-07 18:43:59 +00:00
Bjoern A. Zeeb
413deb1262 Padding after EOL option must be zeros according to RFC793 but
the NOPs used are 0x01.
While we could simply pad with EOLs (which are 0x00), rather use an
explicit 0x00 constant there to not confuse poeple with 'EOL padding'.
Put in a comment saying just that.

Problem discussed on:	src-committers with andre, silby, dwhite as
			follow up to the rev. 1.161 commit of tcp_var.h.
MFC after:		11 days
2008-03-09 13:26:50 +00:00
Bjoern A. Zeeb
ee763d0d9c Centralize and correct computation of TCP-MD5 signature offset within
the packet (tcp header options field).

Reviewed by:	tools/regression/netinet/tcpconnect
MFC after:	3 days
Tested by:	Nick Hilliard (see net@)
2007-11-30 23:46:51 +00:00
Bjoern A. Zeeb
4a411b9fcc Let opt be an array. Though &opt[0] == opt == &opt, &opt is highly
confusing and hard to understand so change it to just opt and
remove the extra cast no longer/not needed.

Discussed with: rwatson
MFC after:      3 days
2007-11-28 13:33:27 +00:00
Bjoern A. Zeeb
9ad0173df1 Make TSO work with IPSEC compiled into the kernel.
The lookup hurts a bit for connections but had been there anyway
if IPSEC was compiled in. So moving the lookup up a bit gives us
TSO support at not extra cost.

PR:		kern/115586
Tested by:	gallatin
Discussed with:	kmacy
MFC after:	2 months
2007-11-21 22:30:14 +00:00
Robert Watson
30d239bc4c Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00
Mike Silbersack
4b421e2daa Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by:	re (kensmith)
2007-10-07 20:44:24 +00:00
George V. Neville-Neil
b2630c2934 Commit the change from FAST_IPSEC to IPSEC. The FAST_IPSEC
option is now deprecated, as well as the KAME IPsec code.
What was FAST_IPSEC is now IPSEC.

Approved by: re
Sponsored by: Secure Computing
2007-07-03 12:13:45 +00:00
George V. Neville-Neil
2cb64cb272 Commit IPv6 support for FAST_IPSEC to the tree.
This commit includes only the kernel files, the rest of the files
will follow in a second commit.

Reviewed by:    bz
Approved by:    re
Supported by:   Secure Computing
2007-07-01 11:41:27 +00:00
Andre Oppermann
104ebb2a45 Make the handling of the tcp window explicit for the SYN_SENT case
in tcp_outout().  This is currently not strictly necessary but paves
the way to simplify the entire SYN options handling quite a bit.
Clarify comment.  No change in effective behavour with this commit.

RFC1323 requires the window field in a SYN (i.e., a <SYN> or
<SYN,ACK>) segment itself never be scaled.
2007-06-09 21:19:12 +00:00
Andre Oppermann
b7de7d87a0 Don't send pure window updates when the peer has closed the connection
and won't ever send more data.
2007-06-09 19:39:14 +00:00
John Baldwin
0ba5d2eedb Fix statistical accounting for bytes and packets during sack retransmits.
MFC after:	1 week
Submitted by:	mohans
2007-05-18 19:56:24 +00:00
Andre Oppermann
4b8e42baab Fix an incorrect replace of a timer reference made during the TCP timer
rewrite in rev. 1.132.  This unmasked yet another bug that causes certain
connections to get indefinately stuck in LAST_ACK state.
2007-05-10 23:11:29 +00:00
Andre Oppermann
3529149e9a Use existing TF_SACK_PERMIT flag in struct tcpcb t_flags field instead of
a decdicated sack_enable int for this bool.  Change all users accordingly.
2007-05-06 15:56:31 +00:00
Andre Oppermann
0d957bba48 o Remove unused and redundant TCP option definitions
o Replace usage of MAX_TCPOPTLEN with the correctly constructed and
  derived MAX_TCPOPTLEN
2007-04-20 15:08:09 +00:00
Andre Oppermann
b8152ba793 Change the TCP timer system from using the callout system five times
directly to a merged model where only one callout, the next to fire,
is registered.

Instead of callout_reset(9) and callout_stop(9) the new function
tcp_timer_activate() is used which then internally manages the callout.

The single new callout is a mutex callout on inpcb simplifying the
locking a bit.

tcp_timer() is the called function which handles all race conditions
in one place and then dispatches the individual timer functions.

Reviewed by:	rwatson (earlier version)
2007-04-11 09:45:16 +00:00
Andre Oppermann
5dd9dfefd6 Retire unused TCP_SACK_DEBUG. 2007-04-04 14:44:15 +00:00
Andre Oppermann
ad3f9ab320 ANSIfy function declarations and remove register keywords for variables.
Consistently apply style to all function declarations.
2007-03-21 19:37:55 +00:00
Andre Oppermann
eec9d82d8e Subtract optlen in the maximum length check for TSO and finally avoid
slightly oversized TSO mbuf chains.

Submitted by:	kmacy
2007-03-21 19:04:07 +00:00
Andre Oppermann
8b8ed7a78e Match up SYSCTL_INT declarations in style. 2007-03-19 18:42:27 +00:00
Andre Oppermann
4e02375908 Maintain a pointer and offset pair into the socket buffer mbuf chain to
avoid traversal of the entire socket buffer for larger offsets on stream
sockets.

Adjust tcp_output() make use of it.

Tested by:	gallatin
2007-03-19 18:35:13 +00:00
Andre Oppermann
02a1a64357 Consolidate insertion of TCP options into a segment from within tcp_output()
and syncache_respond() into its own generic function tcp_addoptions().

tcp_addoptions() is alignment agnostic and does optimal packing in all cases.

In struct tcpopt rename to_requested_s_scale to just to_wscale.

Add a comment with quote from RFC1323: "The Window field in a SYN (i.e.,
a <SYN> or <SYN,ACK>) segment itself is never scaled."

Reviewed by:	silby, mohans, julian
Sponsored by:	TCP/IP Optimization Fundraise 2005
2007-03-15 15:59:28 +00:00
Andre Oppermann
6aa5b62315 Prevent TSO mbuf chain from overflowing a few bytes by subtracting the
TCP options size before the TSO total length calculation.

Bug found by:	kmacy
2007-03-01 13:12:09 +00:00
Gleb Smirnoff
8bec3467b1 Add EHOSTDOWN and ENETUNREACH to the list of soft errors, that shouldn't
be returned up to the caller.

PR:		100172
Submitted by:	"Andrew - Supernews" <andrew supernews.net>
Reviewed by:	rwatson, bms
2007-02-28 12:47:49 +00:00
Gleb Smirnoff
72757d9a53 Toss the code, that handles errors from ip_output(), to make it more
readable:
- Merge two embedded if() into one.
- Introduce switch() block to handle different kinds of errors.

Reviewed by:	rwatson, bms
2007-02-28 12:41:49 +00:00