Commit Graph

3934 Commits

Author SHA1 Message Date
Dimitry Andric
3e288e6238 After some off-list discussion, revert a number of changes to the
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files.  A better long-term solution is
still being considered.  This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.

Changes reverted:

------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines

Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.

------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.

------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines

Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
2010-11-22 19:32:54 +00:00
Marko Zec
0593983963 Remove an apparently redundant CURVNET_SET() / CURVNET_RESTORE() pair.
MFC after:	3 days
2010-11-22 14:16:23 +00:00
Lawrence Stewart
92ea5581dd Fix a minor code redundancy nit.
MFC after:	3 days
2010-11-20 08:40:37 +00:00
Lawrence Stewart
052aec123c When enabling or disabling SIFTR with a VIMAGE kernel, ensure we add or remove
the SIFTR pfil(9) hook functions to or from all network stacks. This patch
allows packets inbound or outbound from a vnet to be "seen" by SIFTR.

Additional work is required to allow SIFTR to actually generate log messages for
all vnet related packets because the siftr_findinpcb() function does not yet
search for inpcbs across all vnets. This issue will be fixed separately.

Reported and tested by:	David Hayes <dahayes at swin edu au>
MFC after:	3 days
2010-11-20 07:36:43 +00:00
George V. Neville-Neil
f5d34df525 Add new, per connection, statistics for TCP, including:
Retransmitted Packets
Zero Window Advertisements
Out of Order Receives

These statistics are available via the -T argument to
netstat(1).
MFC after:	2 weeks
2010-11-17 18:55:12 +00:00
Michael Tuexen
6a67588bbb Add an SCTP socket option to retrieve the number of timeouts
of an association.

MFC after: 3 days.
2010-11-16 22:16:38 +00:00
Lawrence Stewart
78b01840af Make the CC framework more VIMAGE friendly by adding the machinery to allow
vnets to select their own default CC algorithm independent of each other and the
base system. If the base system or a vnet has set a default which gets unloaded,
we reset that netstack's default to NewReno.

Sponsored by:	FreeBSD Foundation
Tested by:	Mikolaj Golub <to.my.trociny at gmail com>
Reviewed by:	bz (briefly)
MFC after:	3 months
2010-11-16 09:34:31 +00:00
Lawrence Stewart
ebf92e869f - Querying the default CC algo is more common than setting it and the function
is small, so there is no good reason not to declare the buffer at the top.

- Fix a whitespace nit.

Sponsored by:	FreeBSD Foundation
MFC after:	11 weeks
X-MFC with:	r215166
2010-11-16 08:43:25 +00:00
Lawrence Stewart
99065ae6a8 Move protocol specific implementation detail out of the core CC framework.
Sponsored by:	FreeBSD Foundation
Tested by:	Mikolaj Golub <to.my.trociny at gmail com>
MFC after:	11 weeks
X-MFC with:	r215166
2010-11-16 08:30:39 +00:00
Lawrence Stewart
4e805854ed On CC algorithm module unload, we walk the list of active TCP control blocks.
Any found to be using the algorithm that is about to go away are switched back
to NewReno to avoid leaving dangling pointers which would trigger a panic. For
VIMAGE kernels, there is a list per vnet to walk, yet the implementation was
only examining one of the vnet lists.

Fix the implementation of the above feature for VIMAGE kernels by looping
through all active TCP control blocks across all vnets.

Sponsored by:	FreeBSD Foundation
Tested by:	Mikolaj Golub <to.my.trociny at gmail com>
Reviewed by:	bz (briefly)
MFC after:	11 weeks
2010-11-16 07:57:56 +00:00
Lawrence Stewart
14f57a8b02 cc_init() should only be run once on system boot, but with VIMAGE kernels it
runs on boot and each time a vnet jail is created. Running cc_init() multiple
times results in a panic when attempting to initialise the cc_list lock again,
and so r215166 effectively broke the use of vnet jails.

Switch to using a SYSINIT to run cc_init() on boot. CC algorithm modules loaded
on boot register in the same SI_SUB_PROTO_IFATTACHDOMAIN category as is used in
this patch, so cc_init() is run at SI_ORDER_FIRST to ensure the framework is
initialised before module registration is attempted.

Sponsored by:	FreeBSD Foundation
Reported and tested by:	Mikolaj Golub <to.my.trociny at gmail com>
MFC after:	11 weeks
X-MFC with:	r215166
2010-11-16 07:09:05 +00:00
Dimitry Andric
31c6a0037e Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.
2010-11-14 20:38:11 +00:00
Michael Tuexen
e635c7b881 Take out special code for disable CRC computations on
the loopback interface for IPv6. It will be handled
by the loopback interface.
2010-11-14 16:44:18 +00:00
Michael Tuexen
cafa98a989 Simplify sctp_delayed_cksum() a bit.
MFC after: 3 days.
2010-11-14 14:37:20 +00:00
Michael Tuexen
27387daca6 Fix a locking issue reported by brucec@ affecting
1-to-1 style sockets which have not yet been
accepted.

MFC after: 3 days.
2010-11-13 12:52:44 +00:00
George V. Neville-Neil
e162ea60d4 Add a queue to hold packets while we await an ARP reply.
When a fast machine first brings up some non TCP networking program
it is quite possible that we will drop packets due to the fact that
only one packet can be held per ARP entry.  This leads to packets
being missed when a program starts or restarts if the ARP data is
not currently in the ARP cache.

This code adds a new sysctl, net.link.ether.inet.maxhold, which defines
a system wide maximum number of packets to be held in each ARP entry.
Up to maxhold packets are queued until an ARP reply is received or
the ARP times out.  The default setting is the old value of 1
which has been part of the BSD networking code since time
immemorial.

Expose the time we hold an incomplete ARP entry by adding
the sysctl net.link.ether.inet.wait, which defaults to 20
seconds, the value used when the new ARP code was added..

Reviewed by:	bz, rpaulo
MFC after: 3 weeks
2010-11-12 22:03:02 +00:00
Michael Tuexen
448a42a61e Don't print an empty line when printing mapping arrays.
MFC after: 3 days.
2010-11-12 20:46:33 +00:00
Michael Tuexen
4ce091cda9 Fix more issues with the SACK/NR-SACK generation code.
MFC after: 3 days.
2010-11-12 20:45:21 +00:00
Luigi Rizzo
ae99fd0e07 The first customer of the SO_USER_COOKIE option:
the "sockarg" ipfw option matches packets associated to
a local socket and with a non-zero so_user_cookie value.
The value is made available as tablearg, so it can be used
as a skipto target or pipe number in ipfw/dummynet rules.

Code by Paul Joe, manpage by me.

Submitted by:	Paul Joe
MFC after:	1 week
2010-11-12 13:05:17 +00:00
Lawrence Stewart
dbc4240942 This commit marks the first formal contribution of the "Five New TCP Congestion
Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details
about the project are available at: http://caia.swin.edu.au/freebsd/5cc/

- Add a KPI and supporting infrastructure to allow modular congestion control
  algorithms to be used in the net stack. Algorithms can maintain per-connection
  state if required, and connections maintain their own algorithm pointer, which
  allows different connections to concurrently use different algorithms. The
  TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to
  programmatically query or change the congestion control algorithm respectively
  from within an application at runtime.

- Integrate the framework with the TCP stack in as least intrusive a manner as
  possible. Care was also taken to develop the framework in a way that should
  allow integration with other congestion aware transport protocols (e.g. SCTP)
  in the future. The hope is that we will one day be able to share a single set
  of congestion control algorithm modules between all congestion aware transport
  protocols.

- Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack
  and use it to decouple the meaning of recovery from a congestion event and
  recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based
  congestion control protocols don't generally need to recover from packet loss
  and need a different way to note a congestion recovery episode within the
  stack.

- Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code
  and ensures the stack always uses the appropriate mechanisms for recovering
  from packet loss during a congestion recovery episode.

- Extract the NewReno congestion control algorithm from the TCP stack and
  massage it into module form. NewReno is always built into the kernel and will
  remain the default algorithm for the forseeable future. Implementations of
  additional different algorithms will become available in the near future.

- Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code
  that relies on the size of "struct tcpcb" is required.

Many thanks go to the Cisco University Research Program Fund at Community
Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work
at the Centre for Advanced Internet Architectures, Swinburne University of
Technology is greatly appreciated.

In collaboration with:	David Hayes <dahayes at swin edu au> and
			Grenville Armitage <garmitage at swin edu au>
Sponsored by:	Cisco URP, FreeBSD Foundation
Reviewed by:	rpaulo
Tested by:	David Hayes (and many others over the years)
MFC after:	3 months
2010-11-12 06:41:55 +00:00
Lawrence Stewart
619ad9eb3e Standardise all Swinburne related copyright/licence statements throughout the
tree in preparation for another large code import. Swinburne University is the
legal entity that owns copyright and the 2-clause BSD licence is acceptable.
2010-11-12 00:44:18 +00:00
Lawrence Stewart
67f285a22e The university does not require that its CRICOS number be included in source
code. Remove all references from the tree.

MFC after:	3 days
2010-11-12 00:19:42 +00:00
Michael Tuexen
eefcb5cd2a Fix the SACK/NR-SACK generation code.
MFC after: 3 days.
2010-11-11 18:41:03 +00:00
Randall Stewart
04215ed220 Fix so that a multicast packet can be sent
even if there is no route out to that mcast address. The code in
in_pcb inadvertantly would error (no route) even though
the user may have specified the address with the
proper socket option (to specify the egress interface).
Thanks bz for reminding me I forgot to commit this ;-)

Reviewed by:	bz
MFC after:	1 week
2010-11-11 05:40:39 +00:00
Michael Tuexen
034b88b092 Improve the scalability by using the local and remote port when
putting inps in the tcpephash.

MFC after: 3 days.
2010-11-09 16:18:32 +00:00
Michael Tuexen
8b4da1c3d9 Fix a bug which resulted in kevent() reporting an event twice on
1-to-1 style sockets when an ABORT was received.

MFC after: 3 days.
2010-11-09 12:00:39 +00:00
Rebecca Cran
b1ce21c6ef Fix typos.
PR:	bin/148894
Submitted by:	olgeni
2010-11-09 10:59:09 +00:00
Michael Tuexen
437fc91ae6 Do not have the MTU table twice in the code. Therefore move the
function from the timer code to util, rename it appropriately and
also fix a bug in sctp_get_prev_mtu(), where calling it with a
value existing in the MTU table did not return a smaller one.

MFC after: 3 days.
2010-11-07 18:50:35 +00:00
Michael Tuexen
c7532199ea Remove two functions which are not used.
MFC after: 3 days.
2010-11-07 17:50:56 +00:00
Michael Tuexen
b61c358887 * Use exponential backoff for retransmission of SHUTDOWN and
SHUTDOWN-ACK chunks.
* While there, do some cleanups.

MFC after: 3 days.
2010-11-07 17:44:04 +00:00
Michael Tuexen
12af6654a3 Not only stop all timers when entering the SHUTDOWN_SENT state,
but also when entering the SHUTDOWN_ACK_SEND state.

MFC after: 3 days.
2010-11-07 14:39:40 +00:00
Michael Tuexen
7da23bc820 Do not resend DATA chunks without delay when dropped by the peer and
the CRC was correct.

MFC after: 3 days.
2010-11-06 13:43:18 +00:00
Michael Tuexen
699437a2ba * Fix an accounting bug regarding SACK/NR-SACK chunks.
* Fix the generation of the SACK/NR-SACK gap lists.

MFC after: 3 days.
2010-11-06 13:30:54 +00:00
Nick Hibma
770c6c3310 Don't spam the console with loaded modules during boot and/or during
startup of ppp.

Note: This cannot be hidden behind bootverbose as this file is included
from lib/libalias as well.
2010-11-03 21:10:12 +00:00
John Baldwin
33b31db666 Don't leak the LLE lock if the arptimer callout is pending or inactive.
Reported by:	David Rhodus
MFC after:	1 month
2010-11-02 13:00:56 +00:00
Gleb Smirnoff
27bf126d23 Remove meaningless XXXXX, that is a remain of comment, removed in r186200. 2010-10-29 11:13:42 +00:00
Gleb Smirnoff
28e1f17c81 Revert a small part of the r198301, that is entirely unrelated to the
r198301 itself. It also broke the logic of not sending more than one
ARP request per second, that consequently lead to a potential problem
of flooding network with broadcast packets.

MFC after:	1 week
2010-10-29 10:57:18 +00:00
Bjoern A. Zeeb
0ef7c8a20b Add initial inet DDB support for show in_ifaddr and show sin commands which
proved to be useful while debugging address list problems.

MFC after:	6 days
2010-10-24 22:02:36 +00:00
Bjoern A. Zeeb
4a85b5e2ea Make the IPsec SADB embedded route cache a union to be able to hold both the
legacy and IPv6 route destination address.
Previously in case of IPv6, there was a memory overwrite due to not enough
space for the IPv6 address.

PR:		kern/122565
MFC After:	2 weeks
2010-10-23 20:35:40 +00:00
Ulrich Spörlein
7cc1fde083 mdoc: drop even more redundant .Pp calls
No change in rendered output, less mandoc lint warnings.

Tool provided by:	Nobuyuki Koganemaru n-kogane at syd.odn.ne.jp
2010-10-19 12:35:40 +00:00
Bjoern A. Zeeb
12112cf676 MfP4 CH182763 (original version):
Make it harder to exploit certain in_control() related races between the
intiial lookup at the beginning and the time we will remove the entry
from the lists by re-checking that entry is still in the list before
trying to remove it.

(*) It is believed that with the current code and locking strategy we
    cannot completely fix all race.

Reported by:	Nima Misaghian (nima_misa hotmail.com) on net@ 20100817
Tested by:	Nima Misaghian (nima_misa hotmail.com) (original version)
PR:		kern/146250
Submitted by:	Mikolaj Golub (to.my.trociny gmail.com) (different version)
MFC after:	1 week
2010-10-16 19:53:22 +00:00
Lawrence Stewart
ca09d7728b Retire the system-wide, per-reassembly queue segment limit. The mechanism is far
too coarse grained to be useful and the default value significantly degrades TCP
performance on moderate to high bandwidth-delay product paths with non-zero loss
(e.g. 5+Mbps connections across the public Internet often suffer).

Replace the outgoing mechanism with an individual per-queue limit based on the
number of MSS segments that fit into the socket's receive buffer. This should
strike a good balance between performance and the potential for resource
exhaustion when FreeBSD is acting as a TCP receiver. With socket buffer
autotuning (which is enabled by default), the reassembly queue tracks the
socket buffer and benefits too.

As the XXX comment suggests, my testing uncovered some unexpected behaviour
which requires further investigation. By using so->so_rcv.sb_hiwat
instead of sbspace(&so->so_rcv), we allow more segments to be held across both
the socket receive buffer and reassembly queue than we probably should. The
tradeoff is better performance in at least one common scenario, versus a devious
sender's ability to consume more resources on a FreeBSD receiver.

Sponsored by:	FreeBSD Foundation
Reviewed by:	andre, gnn, rpaulo
MFC after:	2 weeks
2010-10-16 07:12:39 +00:00
Lawrence Stewart
c8dc0ab886 - Switch the "net.inet.tcp.reass.cursegments" and
"net.inet.tcp.reass.maxsegments" sysctl variables to be based on UMA zone
  stats. The value returned by the cursegments sysctl is approximate owing to
  the way in which uma_zone_get_cur is implemented.

- Discontinue use of V_tcp_reass_qsize as a global reassembly segment count
  variable in the reassembly implementation. The variable was used without
  proper synchronisation and was duplicating accounting done by UMA already. The
  lack of synchronisation was particularly problematic on SMP systems
  terminating many TCP sessions, resulting in poor TCP performance for
  connections with non-zero packet loss.

Sponsored by:	FreeBSD Foundation
Reviewed by:	andre, gnn, rpaulo (as part of a larger patch)
MFC after:	2 weeks
2010-10-16 05:37:45 +00:00
Bjoern A. Zeeb
dc699bac75 Use ifa_ifwithaddr_check() rather than ifa_ifwithaddr() as we are not
interested in the result and would leak a reference otherwise.

PR:		kern/151435
Submitted by:	Andrew Boyer (aboyer averesystems.com)
MFC after:	3 days
2010-10-14 12:32:49 +00:00
Luigi Rizzo
3f18b51c8d put back the assigment to sched_time. It was correct, and
it was necessary.

Submitted by:	Riccardo Panicucci
2010-10-01 15:38:35 +00:00
Bjoern A. Zeeb
544794507a Proper bracketing.
PR:		kern/151100
Submitted by:	SunMinghao (sunminghao hotmail.com)
MFC after:	3 days
2010-10-01 11:48:14 +00:00
Luigi Rizzo
e53a34a766 remove an unnecessary (and wrong) assignment.
It was meant to reset idle_time (and it was not needed),
but i even used the wrong field.

Obtained from:	Oleg
MFC after:	3 days
2010-09-29 21:02:31 +00:00
Luigi Rizzo
38cc301f9f whitespace changes in preparation for future commits 2010-09-29 09:40:20 +00:00
Luigi Rizzo
a47ee22718 fix handling of initial credit for an idle pipe.
This fixes the bug where setting bw > 1 MTU/tick resulted in
infinite bandwidth if io_fast=1

PR:		147245 148429
Obtained from:	Riccardo Panicucci
MFC after:	3 days
2010-09-29 09:22:12 +00:00
Luigi Rizzo
8d74ca8ce9 fix breakage in in-kernel NAT: the code did not honor
net.inet.ip.fw.one_pass and always moved to the next rule
in case of a successful nat.

This should fix several related PR (waiting for feedback
before closing them)

PR:		145167 149572 150141
MFC after:	3 days
2010-09-28 23:23:23 +00:00