Commit Graph

4033 Commits

Author SHA1 Message Date
Marko Zec
0593983963 Remove an apparently redundant CURVNET_SET() / CURVNET_RESTORE() pair.
MFC after:	3 days
2010-11-22 14:16:23 +00:00
Lawrence Stewart
92ea5581dd Fix a minor code redundancy nit.
MFC after:	3 days
2010-11-20 08:40:37 +00:00
Lawrence Stewart
052aec123c When enabling or disabling SIFTR with a VIMAGE kernel, ensure we add or remove
the SIFTR pfil(9) hook functions to or from all network stacks. This patch
allows packets inbound or outbound from a vnet to be "seen" by SIFTR.

Additional work is required to allow SIFTR to actually generate log messages for
all vnet related packets because the siftr_findinpcb() function does not yet
search for inpcbs across all vnets. This issue will be fixed separately.

Reported and tested by:	David Hayes <dahayes at swin edu au>
MFC after:	3 days
2010-11-20 07:36:43 +00:00
George V. Neville-Neil
f5d34df525 Add new, per connection, statistics for TCP, including:
Retransmitted Packets
Zero Window Advertisements
Out of Order Receives

These statistics are available via the -T argument to
netstat(1).
MFC after:	2 weeks
2010-11-17 18:55:12 +00:00
Michael Tuexen
6a67588bbb Add an SCTP socket option to retrieve the number of timeouts
of an association.

MFC after: 3 days.
2010-11-16 22:16:38 +00:00
Lawrence Stewart
78b01840af Make the CC framework more VIMAGE friendly by adding the machinery to allow
vnets to select their own default CC algorithm independent of each other and the
base system. If the base system or a vnet has set a default which gets unloaded,
we reset that netstack's default to NewReno.

Sponsored by:	FreeBSD Foundation
Tested by:	Mikolaj Golub <to.my.trociny at gmail com>
Reviewed by:	bz (briefly)
MFC after:	3 months
2010-11-16 09:34:31 +00:00
Lawrence Stewart
ebf92e869f - Querying the default CC algo is more common than setting it and the function
is small, so there is no good reason not to declare the buffer at the top.

- Fix a whitespace nit.

Sponsored by:	FreeBSD Foundation
MFC after:	11 weeks
X-MFC with:	r215166
2010-11-16 08:43:25 +00:00
Lawrence Stewart
99065ae6a8 Move protocol specific implementation detail out of the core CC framework.
Sponsored by:	FreeBSD Foundation
Tested by:	Mikolaj Golub <to.my.trociny at gmail com>
MFC after:	11 weeks
X-MFC with:	r215166
2010-11-16 08:30:39 +00:00
Lawrence Stewart
4e805854ed On CC algorithm module unload, we walk the list of active TCP control blocks.
Any found to be using the algorithm that is about to go away are switched back
to NewReno to avoid leaving dangling pointers which would trigger a panic. For
VIMAGE kernels, there is a list per vnet to walk, yet the implementation was
only examining one of the vnet lists.

Fix the implementation of the above feature for VIMAGE kernels by looping
through all active TCP control blocks across all vnets.

Sponsored by:	FreeBSD Foundation
Tested by:	Mikolaj Golub <to.my.trociny at gmail com>
Reviewed by:	bz (briefly)
MFC after:	11 weeks
2010-11-16 07:57:56 +00:00
Lawrence Stewart
14f57a8b02 cc_init() should only be run once on system boot, but with VIMAGE kernels it
runs on boot and each time a vnet jail is created. Running cc_init() multiple
times results in a panic when attempting to initialise the cc_list lock again,
and so r215166 effectively broke the use of vnet jails.

Switch to using a SYSINIT to run cc_init() on boot. CC algorithm modules loaded
on boot register in the same SI_SUB_PROTO_IFATTACHDOMAIN category as is used in
this patch, so cc_init() is run at SI_ORDER_FIRST to ensure the framework is
initialised before module registration is attempted.

Sponsored by:	FreeBSD Foundation
Reported and tested by:	Mikolaj Golub <to.my.trociny at gmail com>
MFC after:	11 weeks
X-MFC with:	r215166
2010-11-16 07:09:05 +00:00
Dimitry Andric
31c6a0037e Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.
2010-11-14 20:38:11 +00:00
Michael Tuexen
e635c7b881 Take out special code for disable CRC computations on
the loopback interface for IPv6. It will be handled
by the loopback interface.
2010-11-14 16:44:18 +00:00
Michael Tuexen
cafa98a989 Simplify sctp_delayed_cksum() a bit.
MFC after: 3 days.
2010-11-14 14:37:20 +00:00
Michael Tuexen
27387daca6 Fix a locking issue reported by brucec@ affecting
1-to-1 style sockets which have not yet been
accepted.

MFC after: 3 days.
2010-11-13 12:52:44 +00:00
George V. Neville-Neil
e162ea60d4 Add a queue to hold packets while we await an ARP reply.
When a fast machine first brings up some non TCP networking program
it is quite possible that we will drop packets due to the fact that
only one packet can be held per ARP entry.  This leads to packets
being missed when a program starts or restarts if the ARP data is
not currently in the ARP cache.

This code adds a new sysctl, net.link.ether.inet.maxhold, which defines
a system wide maximum number of packets to be held in each ARP entry.
Up to maxhold packets are queued until an ARP reply is received or
the ARP times out.  The default setting is the old value of 1
which has been part of the BSD networking code since time
immemorial.

Expose the time we hold an incomplete ARP entry by adding
the sysctl net.link.ether.inet.wait, which defaults to 20
seconds, the value used when the new ARP code was added..

Reviewed by:	bz, rpaulo
MFC after: 3 weeks
2010-11-12 22:03:02 +00:00
Michael Tuexen
448a42a61e Don't print an empty line when printing mapping arrays.
MFC after: 3 days.
2010-11-12 20:46:33 +00:00
Michael Tuexen
4ce091cda9 Fix more issues with the SACK/NR-SACK generation code.
MFC after: 3 days.
2010-11-12 20:45:21 +00:00
Luigi Rizzo
ae99fd0e07 The first customer of the SO_USER_COOKIE option:
the "sockarg" ipfw option matches packets associated to
a local socket and with a non-zero so_user_cookie value.
The value is made available as tablearg, so it can be used
as a skipto target or pipe number in ipfw/dummynet rules.

Code by Paul Joe, manpage by me.

Submitted by:	Paul Joe
MFC after:	1 week
2010-11-12 13:05:17 +00:00
Lawrence Stewart
dbc4240942 This commit marks the first formal contribution of the "Five New TCP Congestion
Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details
about the project are available at: http://caia.swin.edu.au/freebsd/5cc/

- Add a KPI and supporting infrastructure to allow modular congestion control
  algorithms to be used in the net stack. Algorithms can maintain per-connection
  state if required, and connections maintain their own algorithm pointer, which
  allows different connections to concurrently use different algorithms. The
  TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to
  programmatically query or change the congestion control algorithm respectively
  from within an application at runtime.

- Integrate the framework with the TCP stack in as least intrusive a manner as
  possible. Care was also taken to develop the framework in a way that should
  allow integration with other congestion aware transport protocols (e.g. SCTP)
  in the future. The hope is that we will one day be able to share a single set
  of congestion control algorithm modules between all congestion aware transport
  protocols.

- Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack
  and use it to decouple the meaning of recovery from a congestion event and
  recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based
  congestion control protocols don't generally need to recover from packet loss
  and need a different way to note a congestion recovery episode within the
  stack.

- Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code
  and ensures the stack always uses the appropriate mechanisms for recovering
  from packet loss during a congestion recovery episode.

- Extract the NewReno congestion control algorithm from the TCP stack and
  massage it into module form. NewReno is always built into the kernel and will
  remain the default algorithm for the forseeable future. Implementations of
  additional different algorithms will become available in the near future.

- Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code
  that relies on the size of "struct tcpcb" is required.

Many thanks go to the Cisco University Research Program Fund at Community
Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work
at the Centre for Advanced Internet Architectures, Swinburne University of
Technology is greatly appreciated.

In collaboration with:	David Hayes <dahayes at swin edu au> and
			Grenville Armitage <garmitage at swin edu au>
Sponsored by:	Cisco URP, FreeBSD Foundation
Reviewed by:	rpaulo
Tested by:	David Hayes (and many others over the years)
MFC after:	3 months
2010-11-12 06:41:55 +00:00
Lawrence Stewart
619ad9eb3e Standardise all Swinburne related copyright/licence statements throughout the
tree in preparation for another large code import. Swinburne University is the
legal entity that owns copyright and the 2-clause BSD licence is acceptable.
2010-11-12 00:44:18 +00:00
Lawrence Stewart
67f285a22e The university does not require that its CRICOS number be included in source
code. Remove all references from the tree.

MFC after:	3 days
2010-11-12 00:19:42 +00:00
Michael Tuexen
eefcb5cd2a Fix the SACK/NR-SACK generation code.
MFC after: 3 days.
2010-11-11 18:41:03 +00:00
Randall Stewart
04215ed220 Fix so that a multicast packet can be sent
even if there is no route out to that mcast address. The code in
in_pcb inadvertantly would error (no route) even though
the user may have specified the address with the
proper socket option (to specify the egress interface).
Thanks bz for reminding me I forgot to commit this ;-)

Reviewed by:	bz
MFC after:	1 week
2010-11-11 05:40:39 +00:00
Michael Tuexen
034b88b092 Improve the scalability by using the local and remote port when
putting inps in the tcpephash.

MFC after: 3 days.
2010-11-09 16:18:32 +00:00
Michael Tuexen
8b4da1c3d9 Fix a bug which resulted in kevent() reporting an event twice on
1-to-1 style sockets when an ABORT was received.

MFC after: 3 days.
2010-11-09 12:00:39 +00:00
Rebecca Cran
b1ce21c6ef Fix typos.
PR:	bin/148894
Submitted by:	olgeni
2010-11-09 10:59:09 +00:00
Michael Tuexen
437fc91ae6 Do not have the MTU table twice in the code. Therefore move the
function from the timer code to util, rename it appropriately and
also fix a bug in sctp_get_prev_mtu(), where calling it with a
value existing in the MTU table did not return a smaller one.

MFC after: 3 days.
2010-11-07 18:50:35 +00:00
Michael Tuexen
c7532199ea Remove two functions which are not used.
MFC after: 3 days.
2010-11-07 17:50:56 +00:00
Michael Tuexen
b61c358887 * Use exponential backoff for retransmission of SHUTDOWN and
SHUTDOWN-ACK chunks.
* While there, do some cleanups.

MFC after: 3 days.
2010-11-07 17:44:04 +00:00
Michael Tuexen
12af6654a3 Not only stop all timers when entering the SHUTDOWN_SENT state,
but also when entering the SHUTDOWN_ACK_SEND state.

MFC after: 3 days.
2010-11-07 14:39:40 +00:00
Michael Tuexen
7da23bc820 Do not resend DATA chunks without delay when dropped by the peer and
the CRC was correct.

MFC after: 3 days.
2010-11-06 13:43:18 +00:00
Michael Tuexen
699437a2ba * Fix an accounting bug regarding SACK/NR-SACK chunks.
* Fix the generation of the SACK/NR-SACK gap lists.

MFC after: 3 days.
2010-11-06 13:30:54 +00:00
Nick Hibma
770c6c3310 Don't spam the console with loaded modules during boot and/or during
startup of ppp.

Note: This cannot be hidden behind bootverbose as this file is included
from lib/libalias as well.
2010-11-03 21:10:12 +00:00
John Baldwin
33b31db666 Don't leak the LLE lock if the arptimer callout is pending or inactive.
Reported by:	David Rhodus
MFC after:	1 month
2010-11-02 13:00:56 +00:00
Gleb Smirnoff
27bf126d23 Remove meaningless XXXXX, that is a remain of comment, removed in r186200. 2010-10-29 11:13:42 +00:00
Gleb Smirnoff
28e1f17c81 Revert a small part of the r198301, that is entirely unrelated to the
r198301 itself. It also broke the logic of not sending more than one
ARP request per second, that consequently lead to a potential problem
of flooding network with broadcast packets.

MFC after:	1 week
2010-10-29 10:57:18 +00:00
Bjoern A. Zeeb
0ef7c8a20b Add initial inet DDB support for show in_ifaddr and show sin commands which
proved to be useful while debugging address list problems.

MFC after:	6 days
2010-10-24 22:02:36 +00:00
Bjoern A. Zeeb
4a85b5e2ea Make the IPsec SADB embedded route cache a union to be able to hold both the
legacy and IPv6 route destination address.
Previously in case of IPv6, there was a memory overwrite due to not enough
space for the IPv6 address.

PR:		kern/122565
MFC After:	2 weeks
2010-10-23 20:35:40 +00:00
Ulrich Spörlein
7cc1fde083 mdoc: drop even more redundant .Pp calls
No change in rendered output, less mandoc lint warnings.

Tool provided by:	Nobuyuki Koganemaru n-kogane at syd.odn.ne.jp
2010-10-19 12:35:40 +00:00
Bjoern A. Zeeb
12112cf676 MfP4 CH182763 (original version):
Make it harder to exploit certain in_control() related races between the
intiial lookup at the beginning and the time we will remove the entry
from the lists by re-checking that entry is still in the list before
trying to remove it.

(*) It is believed that with the current code and locking strategy we
    cannot completely fix all race.

Reported by:	Nima Misaghian (nima_misa hotmail.com) on net@ 20100817
Tested by:	Nima Misaghian (nima_misa hotmail.com) (original version)
PR:		kern/146250
Submitted by:	Mikolaj Golub (to.my.trociny gmail.com) (different version)
MFC after:	1 week
2010-10-16 19:53:22 +00:00
Lawrence Stewart
ca09d7728b Retire the system-wide, per-reassembly queue segment limit. The mechanism is far
too coarse grained to be useful and the default value significantly degrades TCP
performance on moderate to high bandwidth-delay product paths with non-zero loss
(e.g. 5+Mbps connections across the public Internet often suffer).

Replace the outgoing mechanism with an individual per-queue limit based on the
number of MSS segments that fit into the socket's receive buffer. This should
strike a good balance between performance and the potential for resource
exhaustion when FreeBSD is acting as a TCP receiver. With socket buffer
autotuning (which is enabled by default), the reassembly queue tracks the
socket buffer and benefits too.

As the XXX comment suggests, my testing uncovered some unexpected behaviour
which requires further investigation. By using so->so_rcv.sb_hiwat
instead of sbspace(&so->so_rcv), we allow more segments to be held across both
the socket receive buffer and reassembly queue than we probably should. The
tradeoff is better performance in at least one common scenario, versus a devious
sender's ability to consume more resources on a FreeBSD receiver.

Sponsored by:	FreeBSD Foundation
Reviewed by:	andre, gnn, rpaulo
MFC after:	2 weeks
2010-10-16 07:12:39 +00:00
Lawrence Stewart
c8dc0ab886 - Switch the "net.inet.tcp.reass.cursegments" and
"net.inet.tcp.reass.maxsegments" sysctl variables to be based on UMA zone
  stats. The value returned by the cursegments sysctl is approximate owing to
  the way in which uma_zone_get_cur is implemented.

- Discontinue use of V_tcp_reass_qsize as a global reassembly segment count
  variable in the reassembly implementation. The variable was used without
  proper synchronisation and was duplicating accounting done by UMA already. The
  lack of synchronisation was particularly problematic on SMP systems
  terminating many TCP sessions, resulting in poor TCP performance for
  connections with non-zero packet loss.

Sponsored by:	FreeBSD Foundation
Reviewed by:	andre, gnn, rpaulo (as part of a larger patch)
MFC after:	2 weeks
2010-10-16 05:37:45 +00:00
Bjoern A. Zeeb
dc699bac75 Use ifa_ifwithaddr_check() rather than ifa_ifwithaddr() as we are not
interested in the result and would leak a reference otherwise.

PR:		kern/151435
Submitted by:	Andrew Boyer (aboyer averesystems.com)
MFC after:	3 days
2010-10-14 12:32:49 +00:00
Luigi Rizzo
3f18b51c8d put back the assigment to sched_time. It was correct, and
it was necessary.

Submitted by:	Riccardo Panicucci
2010-10-01 15:38:35 +00:00
Bjoern A. Zeeb
544794507a Proper bracketing.
PR:		kern/151100
Submitted by:	SunMinghao (sunminghao hotmail.com)
MFC after:	3 days
2010-10-01 11:48:14 +00:00
Luigi Rizzo
e53a34a766 remove an unnecessary (and wrong) assignment.
It was meant to reset idle_time (and it was not needed),
but i even used the wrong field.

Obtained from:	Oleg
MFC after:	3 days
2010-09-29 21:02:31 +00:00
Luigi Rizzo
38cc301f9f whitespace changes in preparation for future commits 2010-09-29 09:40:20 +00:00
Luigi Rizzo
a47ee22718 fix handling of initial credit for an idle pipe.
This fixes the bug where setting bw > 1 MTU/tick resulted in
infinite bandwidth if io_fast=1

PR:		147245 148429
Obtained from:	Riccardo Panicucci
MFC after:	3 days
2010-09-29 09:22:12 +00:00
Luigi Rizzo
8d74ca8ce9 fix breakage in in-kernel NAT: the code did not honor
net.inet.ip.fw.one_pass and always moved to the next rule
in case of a successful nat.

This should fix several related PR (waiting for feedback
before closing them)

PR:		145167 149572 150141
MFC after:	3 days
2010-09-28 23:23:23 +00:00
Luigi Rizzo
c08e545e99 Whitespace changes to reduce diffs wrt the most recent ipfw/dummynet code:
+ remove an unused macro,
+ adjust the constants in an enum
+ small whitespace changes

MFC after:	3 days
2010-09-28 22:46:13 +00:00
Xin LI
64e0f48e7c Add a bandaid for a long-standing race condition during route entry
un-expiring.

The previous version of code have no locking when testing rt_refcnt.
The result of the lack of locking may result in a condition where
a routing entry have a reference count but at the same time have
RTPRF_OURS bit set and an expiration timer.  These would eventually
lead to a panic:

	panic: rtqkill route really not free

When the system have ICMP redirects accepted from local gateway
in a moderate frequency, for instance.

Commit this workaround for now until we have some better solution.

PR:		kern/149804
Reviewed by:	bz
Tested by:	Zhao Xin, Pete French
MFC after:	2 weeks
2010-09-27 19:26:56 +00:00
Lawrence Stewart
d4d3e21865 Log the number of segments currently in the reassembly queue.
Sponsored by:	FreeBSD Foundation
2010-09-25 09:16:46 +00:00
Lawrence Stewart
0c236c4ebd Internalise reassembly queue related functionality and variables which should
not be used outside of the reassembly queue implementation. Provide a new
function to flush all segments from a reassembly queue and call it from the
appropriate places instead of manipulating the queue directly.

Sponsored by:	FreeBSD Foundation
Reviewed by:	andre, gnn, rpaulo
MFC after:	2 weeks
2010-09-25 04:58:46 +00:00
Attilio Rao
109c1de8ba Make the RPC specific __rpc_inet_ntop() and __rpc_inet_pton() general
in the kernel (just as inet_ntoa() and inet_aton()) are and sync their
prototype accordingly with already mentioned functions.

Sponsored by:	Sandvine Incorporated
Reviewed by:	emaste, rstone
Approved by:	dfr
MFC after:	2 weeks
2010-09-24 15:01:45 +00:00
Attilio Rao
5f6bf4518d IP_BINDANY is not correctly handled in getsockopt() case.
Fix it by specifying the correct bits.

Sponsored by:	Sandvine Incorporated
Reviewed by:	bz, emaste, rstone
Obtained from:	Sandvine Incorporated
MFC after:	10 days
2010-09-24 14:38:54 +00:00
Gleb Smirnoff
6baf7a243a Do not convert some meaningful error value to EINVAL.
Reviewed by:	will
2010-09-20 12:23:10 +00:00
Michael Tuexen
1ea735c802 Fix a locking issue which resulted in aborted associations
due to a corrupted nr-mapping array.

MFC after: 2 weeks.
2010-09-20 12:19:11 +00:00
Michael Tuexen
231b700b17 Allow the initial congestion window to be configure
to one MTU. Improve the description.

MFC after: 2 weeks.
2010-09-19 11:57:21 +00:00
Michael Tuexen
f8faf20cf6 Fix a locking issue which shows up when the code is used
on Mac OS X.

MFC after: 2 weeks.
2010-09-19 11:42:16 +00:00
Andre Oppermann
ed42031102 Rearrange the TSO code to make it more readable and to clearly
separate the decision logic, of whether we can do TSO, and the
calculation of the burst length into two distinct parts.

Change the way the TSO burst length calculation is done. While
TSO could do bursts of 65535 bytes that can't be represented in
ip_len together with the IP and TCP header. Account for that and
use IP_MAXPACKET instead of TCP_MAXWIN as base constant (both
have the same value of 64K). When more data is available prevent
less than MSS sized segments from being sent during the current
TSO burst.

Add two more KASSERTs to ensure the integrity of the packets.

Tested by:	Ben Wilber <ben-at-desync com>
MFC after:	10 days
2010-09-17 22:05:27 +00:00
Michael Tuexen
99ddc825f3 Fix a bug where the wrong PR-SCTP policy was considered.
While there, use always the same code for the check of
TTL expiration.

MFC after: 2 weeks.
2010-09-17 19:20:39 +00:00
Michael Tuexen
dcfc062535 Make the initial congestion window configurable via sysctl.
MFC after: 2 weeks.
2010-09-17 18:53:07 +00:00
Michael Tuexen
25a2a18706 * Implement initial version of send buffer splitting.
* Make send/recv buffer splitting switchable via sysctl.
* While there: Fix some comments.
2010-09-17 16:20:29 +00:00
Andre Oppermann
1c18314d17 Remove the TCP inflight bandwidth limiter as announced in r211315
to give way for the pluggable congestion control framework.  It is
the task of the congestion control algorithm to set the congestion
window and amount of inflight data without external interference.

In 'struct tcpcb' the variables previously used by the inflight
limiter are renamed to spares to keep the ABI intact and to have
some more space for future extensions.

In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to
preserve the ABI.  It is always set to 0.

In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed
to preserve the ABI.  It is always set to 0.

These unused variable in the various structures may be reused in the
future or garbage collected before the next release or at some other
point when an ABI change happens anyway for other reasons.

No MFC is planned.  The inflight bandwidth limiter stays disabled by
default in the other branches but remains available.
2010-09-16 21:06:45 +00:00
Andre Oppermann
2c9879e8d3 Improve comment to TCP_MINMSS by taking the wording from lstewart (with
a small difference in the last paragraph though) as suggested by jhb.

Clarify that the 'reviewed by' in r212653 by lstewart was for the
functional change, not the comments in the committed version.
2010-09-16 12:13:06 +00:00
Michael Tuexen
b3f7949dc5 Remove old debug code.
MFC after: 2 weeks.
2010-09-15 23:56:25 +00:00
Michael Tuexen
94b0d96992 Remove unused variable/assignment.
MFC after: 3 weeks.
2010-09-15 23:40:36 +00:00
Michael Tuexen
9eea4a2da7 Delay the assignment of a path for DATA chunk until they hit
the sent_queue. Honor a given path when the SCTP_ADDR_OVER
flag is set.

MFC after: 2 weeks.
2010-09-15 23:10:45 +00:00
Michael Tuexen
24f52bbd9b Use TAILQ_EMPTY() for testing if a tail queue is empty.
Set whoFrom to NULL after freeing whoFrom.
2010-09-15 21:53:10 +00:00
Michael Tuexen
3c8c191bae Remove unused variable/assignment.
MFC after: 2 weeks.
2010-09-15 21:19:54 +00:00
Michael Tuexen
b90b577ff3 Remove assignment without effect.
MFC after: 2 weeks.
2010-09-15 21:08:57 +00:00
Michael Tuexen
107cad7449 * Use !TAILQ_EMPTY() for checking if a tail queue is not empty.
* Remove assignment without any effect.

MFC after: 2 weeks.
2010-09-15 20:53:20 +00:00
Andre Oppermann
c183b9c683 Change the default MSS for IPv4 and IPv6 TCP connections from an
artificial power-of-2 rounded number to their real values specified
in RFC879 and RFC2460.

From the history and existing comments it appears that the rounded
numbers were intended to be advantageous for the kernel and mbuf
system.  However this hasn't been the case at for at least a long
time.  The mbuf clusters used in tcp_output() have enough space
to hold the larger real value for the default MSS for both IPv4 and
IPv6.  Note that the default MSS is only used when path MTU discovery
is disabled.

Update and expand related comments.

Reviewed by:	lsteward (including some word-smithing)
MFC after:	2 weeks
2010-09-15 10:39:30 +00:00
Qing Li
a458eaa039 Adding an address on an interface also requires the loopback route to
that address be installed.

PR:		kern/150481
Submitted by:	Ingo Flaschberger <if at xip.at>
MFC after:	5 days
2010-09-12 18:04:47 +00:00
Michael Tuexen
e95307c5c5 * Remove code which has no effect.
* Clean up the handling in sctp_lower_sosend().

MFC after: 3 weeks.
2010-09-09 20:51:23 +00:00
Will Andrews
15249f73e9 Fix CARP in backup mode by properly registering its hooks for INET and INET6
using ipproto_{un,}register() and the newly created ip6proto_{un,}register()
so that it can again receive IPPROTO_CARP packets allowing its state machine
to work.

Reviewed by:	bz
Approved by:	ken (mentor)
2010-09-06 21:06:06 +00:00
Will Andrews
e24fa11d3e Fix static kernel builds with carp(4) by changing its SYSINIT order so that
it is initialized after basic protocol initialization, which allows it to
register via pf_proto_register().

Reviewed by:	bz
Approved by:	ken (mentor)
2010-09-06 21:03:30 +00:00
Gleb Smirnoff
14a268a073 in_delayed_cksum() requires host byte order.
Reported by:	Alexander Levin <amindomao googlemail.com>
MFC after:	1 week
2010-09-06 13:17:01 +00:00
Michael Tuexen
049640c1f0 Implement correct handling of address parameter and
sendinfo for SCTP send calls.

MFC after: 4 weeks.
2010-09-05 20:13:07 +00:00
Randall Stewart
52129fcd78 Fix some CLANG warnings. One clang warning is left
due to the fact that its bogus.. nam->sa_family will
not change from AF_INET6 to AF_INET (but clang
thinks it does ;-D)
2010-09-05 13:41:45 +00:00
Bjoern A. Zeeb
42db1b87d6 In case of RADIX_MPATH do not leak the IN_IFADDR read lock on
early return.

MFC after:	3 days
2010-09-04 16:06:01 +00:00
Bjoern A. Zeeb
1b48d24533 MFp4 CH=183052 183053 183258:
In protosw we define pr_protocol as short, while on the wire
  it is an uint8_t.  That way we can have "internal" protocols
  like DIVERT, SEND or gaps for modules (PROTO_SPACER).
  Switch ipproto_{un,}register to accept a short protocol number(*)
  and do an upfront check for valid boundries. With this we
  also consistently report EPROTONOSUPPORT for out of bounds
  protocols, as we did for proto == 0.  This allows a caller
  to not error for this case, which is especially important
  if we want to automatically call these from domain handling.

  (*) the functions have been without any in-tree consumer
  since the initial introducation, so this is considered save.

  Implement ip6proto_{un,}register() similarly to their legacy IP
  counter parts to allow modules to hook up dynamically.

Reviewed by:	philip, will
MFC after:	1 week
2010-09-02 17:43:44 +00:00
Michael Tuexen
fc0487080a Fix a bug which results in peer IPv4 addresses a.b.c.d with 224<=d<=239
incorrectly being detected as multicast addresses on little endian systems.

MFC after: 2 weeks
2010-09-01 16:11:26 +00:00
Maxim Konovalov
5a47f206a1 o Some programs could send broadcast/multicast traffic to ipfw
pseudo-interface.  This leads to a panic due to uninitialized
if_broadcastaddr address.  Initialize it and implement ip_output()
method to prevent mbuf leak later.

ipfw pseudo-interface should never send anything therefore call
panic(9) in if_start() method.

PR:		kern/149807
Submitted by:	Dmitrij Tejblum
MFC after:	2 weeks
2010-08-30 09:29:51 +00:00
Michael Tuexen
9c7635e18b Fix the the SCTP_WITH_NO_CSUM option when used in combination with
interface supporting CRC offload. While at it, make use of the
feature that the loopback interface provides CRC offloading.

MFC after: 4 weeks
2010-08-29 18:50:30 +00:00
Michael Tuexen
e24ea413e0 Bugfix: Do not send a packet drop report in response to a received
INIT-ACK with incorrect CRC.
2010-08-28 21:15:00 +00:00
Michael Tuexen
20083c2eb1 Fix the switching on/off of CMT using sysctl and socket option.
Fix the switching on/off of PF and NR-SACKs using sysctl.
Add minor improvement in handling malloc failures.
Improve the address checks when sending.

MFC after: 4 weeks
2010-08-28 17:59:51 +00:00
John Baldwin
98b9eb0db2 Simplify the tcp pcblist estimate logic slightly.
MFC after:	3 days
2010-08-27 18:17:46 +00:00
Andre Oppermann
8502ec25dc Use timestamp modulo comparison macro for automatic receive buffer
scaling to correctly handle wrapping of ticks value.

MFC after:	1 week
2010-08-27 12:34:53 +00:00
Ana Kukec
1db8d1f843 MFp4: anchie_soc2009 branch:
Add kernel side support for Secure Neighbor Discovery (SeND), RFC 3971.

The implementation consists of a kernel module that gets packets from
the nd6 code, sends them to user space on a dedicated socket and reinjects
them back for further processing.

Hooks are used from nd6 code paths to divert relevant packets to the
send implementation for processing in user space.  The hooks are only
triggered if the send module is loaded. In case no user space
application is connected to the send socket, processing continues
normaly as if the module would not be loaded. Unloading the module
is not possible at this time due to missing nd6 locking.

The native SeND socket is similar to a raw IPv6 socket but with its own,
internal pseudo-protocol.

Approved by:	bz (mentor)
2010-08-19 11:31:03 +00:00
Andre Oppermann
c3f0bdc66b If a TCP connection has been idle for one retransmit timeout or more
it must reset its congestion window back to the initial window.

RFC3390 has increased the initial window from 1 segment to up to
4 segments.

The initial window increase of RFC3390 wasn't reflected into the
restart window which remained at its original defaults of 4 segments
for local and 1 segment for all other connections.  Both values are
controllable through sysctl net.inet.tcp.local_slowstart_flightsize
and net.inet.tcp.slowstart_flightsize.

The increase helps TCP's slow start algorithm to open up the congestion
window much faster.

Reviewed by:	lstewart
MFC after:	1 week
2010-08-18 18:05:54 +00:00
Andre Oppermann
b7d747ecec Untangle the net.inet.tcp.log_in_vain and net.inet.tcp.log_debug
sysctl's and remove any side effects.

Both sysctl's share the same backend infrastructure and due to the
way it was implemented enabling net.inet.tcp.log_in_vain would also
cause log_debug output to be generated.  This was surprising and
eventually annoying to the user.

The log output backend is kept the same but a little shim is inserted
to properly separate log_in_vain and log_debug and to remove any side
effects.

PR:		kern/137317
MFC after:	1 week
2010-08-18 17:39:47 +00:00
Bjoern A. Zeeb
2278f9927d When calculating the expected memory size for userspace, also take the
number of syncache entries into account for the surplus we add to account
for a possible increase of records in the re-entry window.

Discussed with:		jhb, silby
MFC after:		1 week
2010-08-18 09:28:12 +00:00
John Baldwin
c007b96a78 Ensure a minimum "slop" of 10 extra pcb structures when providing a
memory size estimate to userland for pcb list sysctls.  The previous
behavior of a "slop" of n/8 does not work well for small values of n
(e.g. no slop at all if you have less than 8 open UDP connections).

Reviewed by:	bz
MFC after:	1 week
2010-08-17 16:41:16 +00:00
Andre Oppermann
e4e9266071 Fix the interaction between 'ICMP fragmentation needed' MTU updates,
path MTU discovery and the tcp_minmss limiter for very small MTU's.

When the MTU suggested by the gateway via ICMP, or if there isn't
any the next smaller step from ip_next_mtu(), is lower than the
floor enforced by net.inet.tcp.minmss (default 216) the value is
ignored and the default MSS (512) is used instead.  However the
DF flag in the IP header is still set in tcp_output() preventing
fragmentation by the gateway.

Fix this by using tcp_minmss as the MSS and clear the DF flag if
the suggested MTU is too low.  This turns off path MTU dissovery
for the remainder of the session and allows fragmentation to be
done by the gateway.

Only MTU's smaller than 256 are affected.  The smallest official
MTU specified is for AX.25 packet radio at 256 octets.

PR:		kern/146628
Tested by:	Matthew Luckie <mjl-at-luckie org nz>
MFC after:	1 week
2010-08-15 13:25:18 +00:00
Andre Oppermann
0e678ed825 Initializing the new error variable to zero in syncache_socket()
is not necessary.

Noticed by:	bz
2010-08-15 13:07:08 +00:00
Andre Oppermann
943044b01f Add more logging points for failures in syncache_socket() to
report when a new socket couldn't be created because one of
in_pcbinshash(), in6_pcbconnect() or in_pcbconnect() failed.

Logging is conditional on net.inet.tcp.log_debug being enabled.

MFC after:	1 week
2010-08-15 09:30:13 +00:00
Andre Oppermann
153e5b57af When using TSO and sending more than TCP_MAXWIN sendalot is set
and we loop back to 'again'.  If the remainder is less or equal
to one full segment, the TSO flag was not cleared even though
it isn't necessary anymore.  Enabling the TSO flag on a segment
that doesn't require any offloaded segmentation by the NIC may
cause confusion in the driver or hardware.

Reset the internal tso flag in tcp_output() on every iteration
of sendalot.

PR:		kern/132832
Submitted by:	Renaud Lienhart <renaud-at-vmware com>
MFC after:	1 week
2010-08-14 21:41:33 +00:00
Andre Oppermann
40fe9eff47 Change the messages of the ICMP bad port bandwidth limiter from
a kernel printf to a log output with the priority of LOG_NOTICE.

This way the messages still show up in /var/log/messages but no
longer spam the console every other second on busy servers that
are port scanned:
 "Limiting open port RST response from 114 to 100 packets/sec"

PR:		kern/147352
Submitted by:	Eugene Grosbein <eugen-at-eg sd rdtc ru>
MFC after:	1 week
2010-08-14 21:04:27 +00:00
Andre Oppermann
bee4e5afa9 Disable TCP inflight limiter by default.
It was experimental and interferes with the normal congestion control
algorithms by instating a separate, possibly lower, ceiling for the
amount of data that is in flight to the remote host.  With high speed
internet connections the inflight limit frequently has been estimated
too low due to the noisy nature of the RTT measurements.

This code gives way for the upcoming pluggable congestion control
framework.  It is the task of the congestion control algorithm to
set the congestion window and amount of inflight data without external
interference.

Reviewed by:	lstewart
MFC after:	1 week
Removal after:	1 month
2010-08-14 20:40:55 +00:00
Will Andrews
9963e8a52c Unbreak LINT by moving all carp hooks to net/if.c / netinet/ip_carp.h, with
the appropriate ifdefs.

Reviewed by:	bz
Approved by:	ken (mentor)
2010-08-11 20:18:19 +00:00
Will Andrews
54bfbd5153 Allow carp(4) to be loaded as a kernel module. Follow precedent set by
bridge(4), lagg(4) etc. and make use of function pointers and
pf_proto_register() to hook carp into the network stack.

Currently, because of the uncertainty about whether the unload path is free
of race condition panics, unloads are disallowed by default.  Compiling with
CARPMOD_CAN_UNLOAD in CFLAGS removes this anti foot shooting measure.

This commit requires IP6PROTOSPACER, introduced in r211115.

Reviewed by:	bz, simon
Approved by:	ken (mentor)
MFC after:	2 weeks
2010-08-11 00:51:50 +00:00
Xin LI
9fe5092de1 Address an edge condition that we found at work, where the carp(4)
interface goes to issue LINK_UP, then LINK_DOWN, then LINK_UP at
cold boot.  This behavior is not observed when carp(4) interface
is created slightly later, when the underlying interface is fully
up.

Before this change what happen at boot is roughly:

 - ifconfig creates em0 interface;
 - ifconfig clones a carp device using em0;
   (em0's link state is DOWN at this point)
 - carp state: INIT -> BACKUP [*]
 - carp state: BACKUP -> MASTER
 - [Some negotiate between em0 and switch]
 - em0 kicks up link state change event
   (em0's link state is now up DOWN at this point)
 - do_link_state_change() -> carp_carpdev_state()
 - carp state: MASTER -> INIT (via carp_set_state(sc, INIT)) [+]
 - carp state: INIT -> BACKUP
 - carp state: BACKUP -> MASTER

At the [*] stage, em0 did not received any broadcast message from other
node, and assume our node is the master, thus carp(4) sets the link
state to "UP" after becoming a master.  At [+], the master status
is forcely set to "INIT", then an election is casted, after which our
node would actually become a master.

We believe that at the [*] stage, the master status should remain as
"INIT" since the underlying parent interface's link state is not up.

Obtained from:	iXsystems, Inc.
Reported by:	jpaetzel
MFC after:	2 months
2010-08-08 07:04:27 +00:00
Ed Schouten
367698346b Don't use struct timezone.
The timezone structure acquired by gettimeofday() is not used at all.
Just remove it.
2010-08-08 02:51:32 +00:00
Michael Tuexen
87a37484eb Fix a bug where endpoints bound to wildcard addresses where
using addresses not announced to the peer due to address
scoping.

MFC after: 3 weeks
2010-08-05 16:52:13 +00:00
Michael Tuexen
d2604d08d0 Cleanup code.
MFC after: 2 weeks
2010-08-01 08:06:59 +00:00
Bjoern A. Zeeb
19291ab3de Document the mandatory argument to the arptimer() and
nd6_llinfo_timer() functions with a KASSERT().
Note: there is no need to return after panic.

In the legacy IP case, only assign the arg after the check,
in the IPv6 case, remove the extra checks for the table and
interface as they have to be there unless we freed and forgot
to cancel the timer.  It doesn't matter anyway as we would
panic on the NULL pointer deref immediately and the bug is
elsewhere.
This unifies the code of both address families to some extend.

Reviewed by:	rwatson
MFC after:	6 days
2010-07-31 21:33:18 +00:00
Bjoern A. Zeeb
4579930d2e MFp4 @181628:
Free the rtentry after we diconnected it from the FIB and are counting
it as rttrash.  There might still be a chance we leak it from a different
code path but there is nothing we can do about this here.

Sponsored by:	ISPsystem (in February)
Reviewed by:	julian (in February)
MFC after:	2 weeks
2010-07-31 15:31:23 +00:00
Andre Oppermann
28a53f037a Fix a bug in syncache where the initial CWND for new incoming connections
was limited to one segment under the faulty assumption of a retransmit.
Due to this the opportunity to initialize the increased congestion window
according to RFC3390 was missed.

Support for RFC3465 introduced in r187289 uncovered the bug as the ACK
to SYN/ACK no longer caused snd_cwnd increase by MSS (actually, this
increase shouldn't happen as it's explicitly forbidden by RFC3390, but
it's another issue).  Snd_cwnd remains really small (1*MSS + 1) and this
causes really bad interaction with delayed acks on other side.

The variable name sc_rxmits is a bit misleading as it counts all transmits,
not just retransmits.

Submitted by:	Maxim Dounin <mdounin-at-mdounin-dot-ru>
MFC after:	10 days
2010-07-30 21:45:53 +00:00
Randall Stewart
753358d725 Fix the comment block that has the nice
table to really have the nice table :-)

MFC after:	1 month
2010-07-29 12:01:59 +00:00
Randall Stewart
44fbe46280 PR SCTP Bugs. Basically a full sized frame of
PR SCTP FWD-TSN's would not be sent and thus
cause a stalled connection. Also the rwnd
Calculation was also off on the receiver side for
PR-SCTP.
MFC after:	1 month
2010-07-29 11:37:04 +00:00
Gleb Smirnoff
b9bff254af Fix operation of "netgraph" action in conjunction with the
net.inet.ip.fw.one_pass sysctl.

The "ngtee" action is still broken.

PR:		kern/148885
Submitted by:	Nickolay Dudorov <nnd mail.nsk.ru>
2010-07-27 14:26:34 +00:00
Michael Tuexen
74e906fa94 Fix a bug where the length of a FORWARD-TSN chunk was set incorrectly in
the chunk. This resulted in malformed frames.
Remove a duplicate assignment.

MFC after: 2 weeks
2010-07-26 09:26:55 +00:00
Randall Stewart
8db924defb Make sure that we report chunks if a socket
still exists that were not sent. In either
case carefully remove the data if it does not
get taken by the reporting routines.

MFC after:	2 weeks
2010-07-26 09:22:52 +00:00
Randall Stewart
6c065bbe06 When counting the number of chunks in the
retransmission queue to validate the retran count, we
need to include the chunks in the control send queue
too. Otherwise the count will not match and you will get
the invarient warning if invarients are on.

MFC after:	2 weeks
2010-07-26 09:20:55 +00:00
Lawrence Stewart
79848522b5 - Move common code from the hook functions that fills in a packet node struct to
a separate inline function. This further reduces duplicate code that didn't
  have a good reason to stay as it was.

- Reorder the malloc of a pkt_node struct in the hook functions such that it
  only occurs if we managed to find a usable tcpcb associated with the packet.

- Make the inp_locally_locked variable's type consistent with the prototype of
  siftr_siftdata().

Sponsored by:	FreeBSD Foundation
2010-07-18 05:09:10 +00:00
Warner Losh
43e05a6523 machine/cpu.h isn't appropriate for this file,so remove it 2010-07-16 06:32:38 +00:00
Luigi Rizzo
71ad35a185 remove some conditional #ifdefs (no-op on FreeBSD);
run the timer routine on cpu 0.
2010-07-15 14:43:12 +00:00
Luigi Rizzo
297151a0f3 whitespace fixes 2010-07-15 14:37:59 +00:00
Luigi Rizzo
e6fef96ef4 fix a comment and final empty line 2010-07-15 14:37:02 +00:00
Lawrence Stewart
adc5f0109d The SIFTR DPCPU statistics struct was not being zeroed between enable/disable
cycles so the values would accumulate rather than reset for each cycle.

Sponsored by:	FreeBSD Foundation
2010-07-13 08:23:46 +00:00
Lawrence Stewart
985147dec6 Catch up with the rename of DPCPU_SUM to DPCPU_VARSUM in r209978.
Sponsored by:	FreeBSD Foundation
2010-07-13 07:00:57 +00:00
Gleb Smirnoff
281b584e8e Improve last commit: use bpf_mtap2() to avoiding stack usage.
Prodded by:	julian
2010-07-09 11:27:33 +00:00
Gleb Smirnoff
a5f9fc17c2 Since r209216 bpf(4) searches for mbuf_tags(9) and thus will not work with
a stub m_hdr instead of a full mbuf.

PR:		kern/148050
2010-07-08 13:07:40 +00:00
Randall Stewart
478fbccb67 This fixes a crash in SCTP. It was possible to have a
large number of packets queued to a crashing process.
In a specific case you may get 2 ABORT's back (from
say two packets in flight). If the aborts happened to
be processed at the same time its possible to have
one free the association while the other is trying
to report all the outbound packets. When this occured
it could lead to a crash.

MFC after:	3 days
2010-07-03 14:03:31 +00:00
Lawrence Stewart
a5548bf685 Import the Statistical Information For TCP Research (SIFTR) kernel module into
FreeBSD. SIFTR logs a range of statistics on active TCP connections to a log
file, providing the ability to make highly granular measurements of TCP
connection state. The tool is aimed at system administrators, developers and
researchers alike. Please take it for a spin and test it out - the man page
should have all the information required to get you going.

Many thanks go to the Cisco University Research Program Fund at Community
Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work
at the Centre for Advanced Internet Architectures, Swinburne University of
Technology is greatly appreciated.

Sponsored by:	Cisco URP, FreeBSD Foundation
Reviewed by:	dwmalone, gnn, rpaulo
Tested by:	Many on freebsd-current@ and elsewhere over the years
MFC after:	1 month
2010-07-03 13:32:39 +00:00
Randall Stewart
606c58db25 Fix a bug that WILL cause a panic. Basically
a read-lock is being called to check the vtag-timewait cache.
Then in two cases (where a vtag is bad i.e. in the time-wait
state) the write-unlock is called NOT the read-unlock. Under
conditions where lots of associations are coming and going
this will cause the system to panic at some point.

MFC after:	3 days
2010-07-02 09:53:26 +00:00
Gleb Smirnoff
24536f92c5 After processing the O_SKIPTO opcode our cmd points to the next rule, and
"match" processing at the end of inner loop would look ahead into the next
rule, which is incorrect. Particularly, in the case when the next rule
started with F_NOT opcode it was skipped blindly.

To fix this, exit the inner loop with the continue operator forcibly and
explicitly.

PR:		kern/147798
2010-06-29 16:57:30 +00:00
Michael Tuexen
370d524f00 Fix a bug I introduced in r209470.
MFC after: 3 days
2010-06-24 07:43:25 +00:00
Michael Tuexen
749c49ac62 * Implement sctp_does_stcb_own_this_addr() correclty. It was taking the
wrong side into account.
* sctp_findassociation_ep_addr() must check the local address if available.
This fixes a bug where ABORT chunks were accepted even in the case where
the local was not owned by the endpoint.
Thanks to brucec for pointing out a bug in my first version of the fix.
MFC after: 3 days
2010-06-23 15:19:07 +00:00
Michael Tuexen
cd1386ab50 Fix a rece condition in the shutdown handling.
The race condition resulted in a panic.

MFC after: 3 days
2010-06-18 09:01:44 +00:00
Michael Tuexen
fc066a6137 * Fix a bug where the length of the ASCONF-ACK was calculated wrong due
to using an uninitialized variable.
* Fix a bug where a NULL pointer was dereferenced when interfaces
  come and go at a high rate.
* Fix a bug where inps where not deregistered from iterators.
* Fix a race condition in freeing an association.
* Fix a refcount problem related to the iterator.
Each of the above bug results in a panic. It shows up when
interfaces come and go at a high rate.

Obtained from: rrs (partly)
MFC after: 3 days
2010-06-14 21:25:07 +00:00
Randall Stewart
ec4c19fcf0 3 Fixes -
a) There was a case where a ICMP message could cause
   us to return leaving a stuck lock on an stcb.
b) The iterator needed some tweaks to fix its lock
   ordering.
c) The ITERATOR_LOCK is no longer needed in the freeing
   of a stcb. Now that the timer based one is gone we don't
   have a multiple resume situation. Add to that that there
   was somewhere a path out of the freeing of an assoc that
   did NOT release the iterator_lock.. it was time to clean
   this old code up and in the process fix the lock bug.

MFC after:	1 week
2010-06-11 03:54:00 +00:00
Randall Stewart
41291ef07f Found by Michael. In cases where we run
out of memory (no more inp space) we don't
propely NULL the INP on return.

Obtained from:	tuexen
MFC after:	3 Days
2010-06-09 22:05:29 +00:00
Randall Stewart
b3a44e469d Fix serveral bugs all having to do with freeing an
sctp_inpcb:
1) Make sure not to remove the flag on the PCB until
   after the close() caller is back in control with the
   lock. Otherwise a quickly freeing assoc could kill the
   inpcb and cause a panic.

2) Make sure all calls to log_closing have not released
   the locks before calling the log function, we don't
   want the logging function to crash us due to a freed
   inpcb.

3) Make sure that when we get to the end, we release all
   locks (after removing them from view) and as long as
   we are NOT the inp-kill timer removing the inp, call
   the callout_drain() function so a racing timer won't
   later call in and cause a racing crash.
MFC after:	1 week
2010-06-09 16:42:42 +00:00
Randall Stewart
8dcde5165e BUG:Turns out we need to use both bit maps
to calculate the cum-ack (we were not doing
it for the NR-Sack case). With this fix
NR-sack should now work correctly.
MFC after:	1 week
2010-06-09 16:39:18 +00:00
Randall Stewart
9b2e0767e2 2 Bugs:
1) Only use both mapping arrays when NR sack is off. This
   way we can hold off moving the cumack (not the best but
   workable) when NR-sack is on.

2) We must make sure to just return on the move of the
   bit to the NR array if the cum-ack as already went
   past the TSN. This prevents marking a bit behind the
   array and hitting the invariant code that panic's us.

MFC after:	1 week
2010-06-08 03:39:31 +00:00
Randall Stewart
66bd30bd4f This fixes a BUG in the handling of the cum-ack calculation.
We were only paying attention to the nr-mapping-array. Which
seems to make sense on the surface, by definition things
up to the cum-ack should be deliverable thus in the nr-mapping-array.
However (there is always a gotcha) thats not true when it
comes to large messages. The stack may hold the message
while re-assembling it not not deliver it based on several
thresholds. If that happens (which it would for smaller
large messages) then the cum-ack is figured wrong. We
now properly use both arrays in the cum-ack calculation.

MFC after:	1 week.
2010-06-07 18:29:10 +00:00
Randall Stewart
b9771f0404 Opps... my bad.. we don't need a SOCK_UNLOCK() after
calling socantrcvmore_locked() since it will unlock
the lock for you.

MFC after:	1 week
2010-06-07 11:33:20 +00:00
Randall Stewart
9ed1e280f6 Fix so we call socantrcvmore_locked so we
don't see a race where we unlock to call
the non-locked version and have the socket
go away.

MFC after:	1 week
2010-06-07 04:01:38 +00:00
Randall Stewart
8ce4a9a255 1) Optimize the cleanup and don't always depend on
the timer. This is done by considering the locks
   we will destroy and if they are contended we consider
   it the same as a reference count being up. Fixing this
   appears to cleanup another crash that was appearing with
   all the timers where the socket buf lock got corrupted.

2) Fix the sysctl code to take a lot more care when looking
   at INP's that are in the GONE or ALLGONE state.

MFC after:	1 week
2010-06-06 20:34:17 +00:00
Randall Stewart
0c7dc84076 Ok, yet another bug in killing off all the hundreds
of apitesters.. Basically we end up with attempting
to destroy a lock thats contended on. A cookie echo
arrives at the same time that the close is happening.
The close gets the lock but the cookie echo has already
passed the check for the gone flag and is then locked
waiting on the create lock.. when we go to destroy it
bam. For now we do the timer destroy for all calls
to close.. We can probably optimize this later so that
we check whats being contended on and if there is contention
then do the timer thing. but this is probably safest since
the inp has been removed from all lists and references and
only the timer can find it.. once the locks are released all
other places will instantly see the GONE flag and bail (thats
what the change in sctp_input is one place that was lacking
the bail code).

MFC after:	1 week
2010-06-06 19:24:32 +00:00
Randall Stewart
faa1e3f4a9 1) Further enhance the INVARIANT lock validation (no locks) are
held by checking the create and inp locks as well.

2) Fix a bug in that when a socket is closed an INIT-ACK
   is returned, we do NOT unlock the locked_tcb unless its
   different (an unlikely scenario). If we blindly unlock as
   we were doing before we can end up unlocking the actual
   stcb thats about to be sent down to the free function which
   requires the lock be held.

MFC after:	1 week
2010-06-06 16:11:16 +00:00
Randall Stewart
7c82e9fa93 Fix a bug in the sctp_inpcb_free. Basically if the socket
was setup to do an abortive close an association that was
in the accept_queue could get stuck and never freed. Now
we properly start the kill timer on the socket and turn
off the flag (same thing we do for the graceful close method).
MFC after:	1 week
2010-06-06 16:09:12 +00:00
Randall Stewart
3d7001cdcb Fix a bug in sctp_abort_assoc(). DON'T call the sctp_inpcb_free
when the gone flag is set. You don't know what locks the
caller has set and there is already a kill timer running.

MFC after:	1 week
2010-06-06 16:07:40 +00:00
Randall Stewart
2c6b25b4cd Hopefully this fixes a LOR by making
so we only hold the iterator lock during
updates to the iterators work.

MFC after:	1 week
2010-06-06 02:33:46 +00:00
Randall Stewart
a67294246e Bruce's fix for some return's in
error legs.

MFC after:	1 week
2010-06-06 02:32:20 +00:00
Randall Stewart
8e57327bbf Purge out a Windows def that somehow slipped
past the scrubber.

MFC after:	1 Week
2010-06-05 21:39:52 +00:00
Randall Stewart
1909799a4c Spacing issues
MFC after:	1 Week
2010-06-05 21:33:16 +00:00
Randall Stewart
aca14c2aa8 This change does the following:
1) Fix the alignment of a comment.
2) Fix a BUG where we were NOT paying attention
   to the RESEND marking on retransmitting control
   chunks.. and worse we were not decrementing the
   retran count that could cause us to loop forever.
3) Add in the valdiate_no_lock function on invariants
   so that we will really check all ways out to be sure
   a lock does not slip out locked.

MFC after:	1 week.
2010-06-05 21:27:43 +00:00