Commit Graph

274 Commits

Author SHA1 Message Date
Julien Charbon
eb96dc3336 In TCP, connect() can return incorrect error code EINVAL
instead of EADDRINUSE or ECONNREFUSED

PR:			https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=196035
Differential Revision:	https://reviews.freebsd.org/D1982
Reported by:		Mark Nunberg <mnunberg@haskalah.org>
Submitted by:		Harrison Grundy <harrison.grundy@astrodoggroup.com>
Reviewed by:		adrian, jch, glebius, gnn
Approved by:		jhb
MFC after:		2 weeks
2015-03-09 20:29:16 +00:00
Gleb Smirnoff
2cbcd3c198 Merge from projects/sendfile:
- Provide pru_ready function for TCP.
- Don't call tcp_output() from tcp_usr_send() if no ready data was put
  into the socket buffer.
- In case of dropped connection don't try to m_freem() not ready data.

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2014-11-30 13:43:52 +00:00
Gleb Smirnoff
651e4e6a30 Merge from projects/sendfile: extend protocols API to support
sending not ready data:
o Add new flag to pru_send() flags - PRUS_NOTREADY.
o Add new protocol method pru_ready().

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2014-11-30 13:24:21 +00:00
Gleb Smirnoff
300fa232ee Missed in r274421: use sbavail() instead of bare access to sb_cc. 2014-11-30 12:11:01 +00:00
Julien Charbon
cea40c4888 Fix a race condition in TCP timewait between tcp_tw_2msl_reuse() and
tcp_tw_2msl_scan().  This race condition drives unplanned timewait
timeout cancellation.  Also simplify implementation by holding inpcb
reference and removing tcptw reference counting.

Differential Revision:	https://reviews.freebsd.org/D826
Submitted by:		Marc De la Gueronniere <mdelagueronniere@verisign.com>
Submitted by:		jch
Reviewed By:		jhb (mentor), adrian, rwatson
Sponsored by:		Verisign, Inc.
MFC after:		2 weeks
X-MFC-With:		r264321
2014-10-30 08:53:56 +00:00
Julien Charbon
489dcc9262 A connection in TIME_WAIT state before calling close() actually did not
received any RST packet.  Do not set error to ECONNRESET in this case.

Differential Revision:	https://reviews.freebsd.org/D879
Reviewed by:		rpaulo, adrian
Approved by:		jhb (mentor)
Sponsored by:		Verisign, Inc.
2014-10-12 23:01:25 +00:00
Andrey V. Elsukov
a7e201bbac Make in6_pcblookup_hash_locked and in6_pcbladdr static.
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
2014-09-10 13:17:35 +00:00
Gleb Smirnoff
e407b67be4 The FreeBSD-SA-14:08.tcp was a lesson on not doing acrobatics with
mixing on stack memory and UMA memory in one linked list.

Thus, rewrite TCP reassembly code in terms of memory usage. The
algorithm remains unchanged.

We actually do not need extra memory to build a reassembly queue.
Arriving mbufs are always packet header mbufs. So we got the length
of data as pkthdr.len. We got m_nextpkt for linkage. And we need
only one pointer to point at the tcphdr, use PH_loc for that.

In tcpcb the t_segq fields becomes mbuf pointer. The t_segqlen
field now counts not packets, but bytes in the queue. This gives
us more precision when comparing to socket buffer limits.

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2014-05-04 23:25:32 +00:00
George V. Neville-Neil
6f3caa6d81 Decrease lock contention within the TCP accept case by removing
the INP_INFO lock from tcp_usr_accept.  As the PR/patch states
this was following the advice already in the code.
See the PR below for a full disucssion of this change and its
measured effects.

PR:		183659
Submitted by:	Julian Charbon
Reviewed by:	jhb
2014-01-28 20:28:32 +00:00
Gleb Smirnoff
2f3eb7f4d8 Make TCP_KEEP* socket options readable. At least PostgreSQL wants
to read the values.

Reported by:	sobomax
2013-11-08 13:04:14 +00:00
Gleb Smirnoff
76039bc84f The r48589 promised to remove implicit inclusion of if_var.h soon. Prepare
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h

Sponsored by:	Netflix
Sponsored by:	Nginx, Inc.
2013-10-26 17:58:36 +00:00
Mark Johnston
57f6086735 Implement the ip, tcp, and udp DTrace providers. The probe definitions use
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.

Tested by:	gnn, hiren
MFC after:	1 month
2013-08-25 21:54:41 +00:00
Navdeep Parhar
adfaf8f6ad Add checks for SO_NO_OFFLOAD in a couple of places that I missed earlier
in r245915.
2013-01-26 01:41:42 +00:00
Navdeep Parhar
460cf046c2 There is no need to call into the TOE driver twice in pru_rcvd (tod_rcvd
and then tod_output right after that).

Reviewed by:	bz@
2013-01-25 22:50:52 +00:00
Navdeep Parhar
37cc0ecb1b Heed SO_NO_OFFLOAD.
MFC after:	1 week
2013-01-25 20:23:33 +00:00
Gleb Smirnoff
85c05144f1 Fix bug in TCP_KEEPCNT setting, which slipped in in the last round
of reviewing of r231025.

Unlike other options from this family TCP_KEEPCNT doesn't specify
time interval, but a count, thus parameter supplied doesn't need
to be multiplied by hz.

Reported & tested by:	amdmi3
2012-09-27 07:13:21 +00:00
Navdeep Parhar
09fe63205c - Updated TOE support in the kernel.
- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
  These are available as t3_tom and t4_tom modules that augment cxgb(4)
  and cxgbe(4) respectively.  The cxgb/cxgbe drivers continue to work as
  usual with or without these extra features.

- iWARP driver for Terminator 3 ASIC (kernel verbs).  T4 iWARP in the
  works and will follow soon.

Build-tested with make universe.

30s overview
============
What interfaces support TCP offload?  Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE

Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe

Which connections are offloaded?  Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe

Reviewed by:	bz, gnn
Sponsored by:	Chelsio communications.
MFC after:	~3 months (after 9.1, and after ensuring MFC is feasible)
2012-06-19 07:34:13 +00:00
Gleb Smirnoff
9077f38738 Add new socket options: TCP_KEEPINIT, TCP_KEEPIDLE, TCP_KEEPINTVL and
TCP_KEEPCNT, that allow to control initial timeout, idle time, idle
re-send interval and idle send count on a per-socket basis.

Reviewed by:	andre, bz, lstewart
2012-02-05 16:53:02 +00:00
Navdeep Parhar
db3cee5170 Always release the inp lock before returning from tcp_detach.
MFC after:	5 days
2012-01-06 18:29:40 +00:00
Andre Oppermann
873789cb0f Move the tcp_sendspace and tcp_recvspace sysctl's from
the middle of tcp_usrreq.c to the top of tcp_output.c
and tcp_input.c respectively next to the socket buffer
autosizing controls.

MFC after:	1 week
2011-10-16 20:18:39 +00:00
Andre Oppermann
e233e2acb3 VNET virtualize tcp_sendspace/tcp_recvspace and change the
type to INT.  A long is not necessary as the TCP window is
limited to 2**30.  A larger initial window isn't useful.

MFC after:	1 week
2011-10-16 15:08:43 +00:00
Andre Oppermann
c8360ae220 Update the comment and description of tcp_sendspace and tcp_recvspace
to better reflect their purpose.
MFC after:	1 week
2011-10-16 13:54:46 +00:00
Robert Watson
b598155a85 Do not leak the pcbinfohash lock in the case where in6_pcbladdr() returns
an error during TCP connect(2) on an IPv6 socket.

Submitted by:	bz
Sponsored by:	Juniper Networks, Inc.
2011-06-02 10:21:05 +00:00
Robert Watson
fa046d8774 Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
  inpcb counter.  This lock is now relegated to a small number of
  allocation and free operations, and occasional operations that walk
  all connections (including, awkwardly, certain UDP multicast receive
  operations -- something to revisit).

- A new ipi_hash_lock protects the two inpcbinfo hash tables for
  looking up connections and bound sockets, manipulated using new
  INP_HASH_*() macros.  This lock, combined with inpcb locks, protects
  the 4-tuple address space.

Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required.  As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.

A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb.  Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed.  In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup.  New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:

  INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
  INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb

Callers must pass exactly one of these flags (for the time being).

Some notes:

- All protocols are updated to work within the new regime; especially,
  TCP, UDPv4, and UDPv6.  pcbinfo ipi_lock acquisitions are largely
  eliminated, and global hash lock hold times are dramatically reduced
  compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
  may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
  is no longer available -- hash lookup locks are now held only very
  briefly during inpcb lookup, rather than for potentially extended
  periods.  However, the pcbinfo ipi_lock will still be acquired if a
  connection state might change such that a connection is added or
  removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
  due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
  callers to acquire hash locks and perform one or more lookups atomically
  with 4-tuple allocation: this is required only for TCPv6, as there is no
  in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
  locking, which relates to source address selection.  This needs
  attention, as it likely significantly reduces parallelism in this code
  for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
  somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
  is no longer sufficient.  A second check once the inpcb lock is held
  should do the trick, keeping the general case from requiring the inpcb
  lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
  which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
  undesirable, and probably another argument is required to take care of
  this (or a char array name field in the pcbinfo?).

This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics.  It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.

Reviewed by:    bz
Sponsored by:   Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
Bjoern A. Zeeb
b287c6c70c Make the TCP code compile without INET. Sort #includes and add #ifdef INETs.
Add some comments at #endifs given more nestedness.  To make the compiler
happy, some default initializations were added in accordance with the style
on the files.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-30 11:21:29 +00:00
John Baldwin
d28b9e89a9 When turning off TCP_NOPUSH, only call tcp_output() to immediately flush
any pending data if the connection is established.

Submitted by:	csjp
Reviewed by:	lstewart
MFC after:	1 week
2011-02-04 14:13:15 +00:00
Bjoern A. Zeeb
7f79e7e4db Remove duplicate printing of TF_NOPUSH in db_print_tflags().
MFC after:	10 days
2011-01-29 22:11:13 +00:00
John Baldwin
79e955ed63 Trim extra spaces before tabs. 2011-01-07 21:40:34 +00:00
George V. Neville-Neil
f5d34df525 Add new, per connection, statistics for TCP, including:
Retransmitted Packets
Zero Window Advertisements
Out of Order Receives

These statistics are available via the -T argument to
netstat(1).
MFC after:	2 weeks
2010-11-17 18:55:12 +00:00
Lawrence Stewart
dbc4240942 This commit marks the first formal contribution of the "Five New TCP Congestion
Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details
about the project are available at: http://caia.swin.edu.au/freebsd/5cc/

- Add a KPI and supporting infrastructure to allow modular congestion control
  algorithms to be used in the net stack. Algorithms can maintain per-connection
  state if required, and connections maintain their own algorithm pointer, which
  allows different connections to concurrently use different algorithms. The
  TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to
  programmatically query or change the congestion control algorithm respectively
  from within an application at runtime.

- Integrate the framework with the TCP stack in as least intrusive a manner as
  possible. Care was also taken to develop the framework in a way that should
  allow integration with other congestion aware transport protocols (e.g. SCTP)
  in the future. The hope is that we will one day be able to share a single set
  of congestion control algorithm modules between all congestion aware transport
  protocols.

- Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack
  and use it to decouple the meaning of recovery from a congestion event and
  recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based
  congestion control protocols don't generally need to recover from packet loss
  and need a different way to note a congestion recovery episode within the
  stack.

- Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code
  and ensures the stack always uses the appropriate mechanisms for recovering
  from packet loss during a congestion recovery episode.

- Extract the NewReno congestion control algorithm from the TCP stack and
  massage it into module form. NewReno is always built into the kernel and will
  remain the default algorithm for the forseeable future. Implementations of
  additional different algorithms will become available in the near future.

- Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code
  that relies on the size of "struct tcpcb" is required.

Many thanks go to the Cisco University Research Program Fund at Community
Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work
at the Centre for Advanced Internet Architectures, Swinburne University of
Technology is greatly appreciated.

In collaboration with:	David Hayes <dahayes at swin edu au> and
			Grenville Armitage <garmitage at swin edu au>
Sponsored by:	Cisco URP, FreeBSD Foundation
Reviewed by:	rpaulo
Tested by:	David Hayes (and many others over the years)
MFC after:	3 months
2010-11-12 06:41:55 +00:00
Andre Oppermann
1c18314d17 Remove the TCP inflight bandwidth limiter as announced in r211315
to give way for the pluggable congestion control framework.  It is
the task of the congestion control algorithm to set the congestion
window and amount of inflight data without external interference.

In 'struct tcpcb' the variables previously used by the inflight
limiter are renamed to spares to keep the ABI intact and to have
some more space for future extensions.

In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to
preserve the ABI.  It is always set to 0.

In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed
to preserve the ABI.  It is always set to 0.

These unused variable in the various structures may be reused in the
future or garbage collected before the next release or at some other
point when an ABI change happens anyway for other reasons.

No MFC is planned.  The inflight bandwidth limiter stays disabled by
default in the other branches but remains available.
2010-09-16 21:06:45 +00:00
Robert Watson
8296cddfdd Add a comment to tcp_usr_accept() to indicate why it is we acquire the
tcbinfo lock there: r175612, which re-added it, masked a race between
sonewconn(2) and accept(2) that could allow an incompletely initialized
address on a newly-created socket on a listen queue to be exposed.  Full
details can be found in that commit message.

MFC after:	1 week
Sponsored by:	Juniper Networks
2010-03-06 21:38:31 +00:00
John Baldwin
43d9473499 - Rename the __tcpi_(snd|rcv)_mss fields of the tcp_info structure to remove
the leading underscores since they are now implemented.
- Implement the tcpi_rto and tcpi_last_data_recv fields in the tcp_info
  structure.

Reviewed by:	rwatson
MFC after:	2 weeks
2009-12-22 15:47:40 +00:00
Andre Oppermann
11c99a6d7b -Put the optimized soreceive_stream() under a compile time option called
TCP_SORECEIVE_STREAM for the time being.

Requested by:	brooks

Once compiled in make it easily switchable for testers by using a tuneable
 net.inet.tcp.soreceive_stream
and a corresponding read-only sysctl to report the current state.

Suggested by:	rwatson

MFC after:	2 days
2009-09-15 22:23:45 +00:00
Robert Watson
530c006014 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
Robert Watson
eddfbb763d Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00
Bjoern A. Zeeb
88d166bf19 Make callers to in6_selectsrc() and in6_pcbladdr() pass in memory
to save the selected source address rather than returning an
unreferenced copy to a pointer that might long be gone by the
time we use the pointer for anything meaningful.

Asked for by:	rwatson
Reviewed by:	rwatson
2009-06-23 22:08:55 +00:00
Andre Oppermann
ef760e6ad2 Add soreceive_stream(), an optimized version of soreceive() for
stream (TCP) sockets.

It is functionally identical to generic soreceive() but has a
number stream specific optimizations:
o does only one sockbuf unlock/lock per receive independent of
  the length of data to be moved into the uio compared to
  soreceive() which unlocks/locks per *mbuf*.
o uses m_mbuftouio() instead of its own copy(out) variant.
o much more compact code flow as a large number of special
  cases is removed.
o much improved reability.

It offers significantly reduced CPU usage and lock contention
when receiving fast TCP streams.  Additional gains are obtained
when the receiving application is using SO_RCVLOWAT to batch up
some data before a read (and wakeup) is done.

This function was written by "reverse engineering" and is not
just a stripped down variant of soreceive().

It is not yet enabled by default on TCP sockets.  Instead it is
commented out in the protocol initialization in tcp_usrreq.c
until more widespread testing has been done.

Testers, especially with 10GigE gear, are welcome.

MFP4:	r164817 //depot/user/andre/soreceive_stream/
2009-06-22 23:08:05 +00:00
John Baldwin
9f78a87a06 - Change members of tcpcb that cache values of ticks from int to u_int:
t_rcvtime, t_starttime, t_rtttime, t_bw_rtttime, ts_recent_age,
  t_badrxtwin.
- Change t_recent in struct timewait from u_long to u_int32_t to match
  the type of the field it shadows from tcpcb: ts_recent.
- Change t_starttime in struct timewait from u_long to u_int to match
  the t_starttime field in tcpcb.

Requested by:	bde (1, 3)
2009-06-16 18:58:50 +00:00
John Baldwin
a13c655c64 Correct printf format type mismatches. 2009-06-11 14:37:18 +00:00
John Baldwin
0e8cc7e748 Change a few members of tcpcb that store cached copies of ticks to be ints
instead of unsigned longs.  This fixes a few overflow edge cases on 64-bit
platforms.  Specifically, if an idle connection receives a packet shortly
before 2^31 clock ticks of uptime (about 25 days with hz=1000) and the keep
alive timer fires after 2^31 clock ticks, the keep alive timer will think
that the connection has been idle for a very long time and will immediately
drop the connection instead of sending a keep alive probe.

Reviewed by:	silby, gnn, lstewart
MFC after:	1 week
2009-06-10 18:27:15 +00:00
Robert Watson
78b5071407 Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and
TCPSTAT_INC(), rather than directly manipulating the fields across the
kernel.  This will make it easier to change the implementation of
these statistics, such as using per-CPU versions of the data structures.

MFC after:	3 days
2009-04-11 22:07:19 +00:00
Bjoern A. Zeeb
970caf60dd With the right comparison we get a proper wscale value and thus
more adequate TCP performance with IPv6.

Changes for IPv4, r166403 and r172795, both ignored the
IPv6 counterpart and left it in the state of art of year 2000.

The same logic in syncache already shares code between v4 and v6 so
things do not need to be adapted there.

Reported by:	Steinar Haug (sthaug nethelp.no)
Tested by:	Steinar Haug (sthaug nethelp.no)
MFC after:	3 days
2009-04-07 14:42:40 +00:00
Robert Watson
ad71fe3c35 Correct a number of evolved problems with inp_vflag and inp_flags:
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted.  Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):

  INP_TIMEWAIT
  INP_SOCKREF
  INP_ONESBCAST
  INP_DROPPED

Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.

MFC after:      1 week (or after dependencies are MFC'd)
Reviewed by:    bz
2009-03-15 09:58:31 +00:00
Robert Watson
ce2ae9ab4b In tcp_usr_shutdown() and tcp_usr_send(), I missed converting NULL
checks for the tcpcb, previously used to detect complete disconnection,
with INP_DROPPED checks.  Correct that, preventing shutdown() from
improperly generating a TCP segment with destination IP and port of
0.0.0.0:0.

PR:		kern/132050
Reported by:	david gueluy <david.gueluy at netasq.com>
MFC after:	3 weeks
2009-02-24 11:17:50 +00:00
Jamie Gritton
b89e82dd87 Standardize the various prison_foo_ip[46] functions and prison_if to
return zero on success and an error code otherwise.  The possible errors
are EADDRNOTAVAIL if an address being checked for doesn't match the
prison, and EAFNOSUPPORT if the prison doesn't have any addresses in
that address family.  For most callers of these functions, use the
returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or
EINVAL.

Always include a jailed() check in these functions, where a non-jailed
cred always returns success (and makes no changes).  Remove the explicit
jailed() checks that preceded many of the function calls.

Approved by:	bz (mentor)
2009-02-05 14:06:09 +00:00
Bjoern A. Zeeb
dcdb4371ca Use inc_flags instead of the inc_isipv6 alias which so far
had been the only flag with random usage patterns.
Switch inc_flags to be used as a real bit field by using
INC_ISIPV6 with bitops to check for the 'isipv6' condition.

While here fix a place or two where in case of v4 inc_flags
were not properly initialized before.[1]

Found by:	rwatson during review [1]
Discussed with:	rwatson
Reviewed by:	rwatson
MFC after:	4 weeks
2008-12-17 12:52:34 +00:00
Bjoern A. Zeeb
fc384fa5d6 Another step assimilating IPv[46] PCB code - directly use
the inpcb names rather than the following IPv6 compat macros:
in6pcb,in6p_sp, in6p_ip6_nxt,in6p_flowinfo,in6p_vflag,
in6p_flags,in6p_socket,in6p_lport,in6p_fport,in6p_ppcb and
sotoin6pcb().

Apart from removing duplicate code in netipsec, this is a pure
whitespace, not a functional change.

Discussed with:	rwatson
Reviewed by:	rwatson (version before review requested changes)
MFC after:	4 weeks (set the timer and see then)
2008-12-15 21:50:54 +00:00
Bjoern A. Zeeb
4b79449e2f Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by:	brooks, gnn, des, zec, imp
Sponsored by:	The FreeBSD Foundation
2008-12-02 21:37:28 +00:00
Bjoern A. Zeeb
413628a7e3 MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
  and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
  help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
  suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
  on cluster machines as well as all the testers and people
  who provided feedback the last months on freebsd-jail and
  other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by:	(see above)
MFC after:	3 months (this is just so that I get the mail)
X-MFC Before:   7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
Bjoern A. Zeeb
5cd54324ee Replace most INP_CHECK_SOCKAF() uses checking if it is an
IPv6 socket by comparing a constant inp vflag.
This is expected to help to reduce extra locking.

Suggested by:	rwatson
Reviewed by:	rwatson
MFC after:	6 weeks
2008-11-27 13:19:42 +00:00
Bjoern A. Zeeb
6aee2fc550 Merge in6_pcbfree() into in_pcbfree() which after the previous
IPsec change in r185366 only differed in two additonal IPv6 lines.
Rather than splattering conditional code everywhere add the v6
check centrally at this single place.

Reviewed by:	rwatson (as part of a larger changset)
MFC after:	6 weeks (*)
(*) possibly need to leave a stub wrapper in 7 to keep the symbol.
2008-11-27 12:04:35 +00:00
Bjoern A. Zeeb
0206cdb846 Remove in6_pcbdetach() as it is exactly the same function
as in_pcbdetach() and we don't need the code twice.

Reviewed by:	rwatson
MFC after:	6 weeks (*)
(*) possibly need to leave a stub wrapper in 7 to keep the symbol.
2008-11-26 20:52:26 +00:00
Marko Zec
8b615593fc Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by:	julian, bz, brooks, zec
Reviewed by:	julian, bz, brooks, kris, rwatson, ...
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
Bjoern A. Zeeb
603724d3ab Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from:	//depot/projects/vimage-commit2/...
Reviewed by:	brooks, des, ed, mav, julian,
		jamie, kris, rwatson, zec, ...
		(various people I forgot, different versions)
		md5 (with a bit of help)
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
X-MFC after:	never
V_Commit_Message_Reviewed_By:	more people than the patch
2008-08-17 23:27:27 +00:00
Rui Paulo
f2512ba12a MFp4 (//depot/projects/tcpecn/):
TCP ECN support. Merge of my GSoC 2006 work for NetBSD.
  TCP ECN is defined in RFC 3168.

Partly reviewed by:	dwmalone, silby
Obtained from:		NetBSD
2008-07-31 15:10:09 +00:00
Kip Macy
8ab7ce7c61 replace spaces added in last change with tabs 2008-05-05 23:13:27 +00:00
Kip Macy
535fbad68f add rcv_nxt, snd_nxt, and toe offload id to FreeBSD-specific
extension fields for tcp_info
2008-05-05 20:13:31 +00:00
Robert Watson
8501a69cc9 Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after:	3 months
Tested by:	kris (superset of committered patch)
2008-04-17 21:38:18 +00:00
Robert Watson
109058b094 tcp_usrreq.c:1.313 removed tcbinfo locking from tcp_usr_accept(), which
while in principle a good idea, opened us up to a race inherrent to
the syncache's direct insertion of incoming TCP connections into the
"completed connection" listen queue, as it transpires that the socket
is inserted before the inpcb is fully filled in by syncache_expand().
The bug manifested with the occasional returning of 0.0.0.0:0 in the
address returned by the accept() system call, which occurred if accept
managed to execute tcp_usr_accept() before syncache_expand() had copied
the endpoint addresses into inpcb connection state.

Re-add tcbinfo locking around the address copyout, which has the effect
of delaying the copy until syncache_expand() has finished running, as
it is run while the tcbinfo lock is held.  This is undesirable in that
it increases contention on tcbinfo further, but a more significant
change will be required to how the syncache inserts new sockets in
order to fix this and keep more granular locking here.  In particular,
either more state needs to be passed into sonewconn() so that
pru_attach() can fill in the fields *before* the socket is inserted, or
the socket needs to be inserted in the incomplete connection queue
until it is actually ready to be used.

Reported by:	glebius (and kris)
Tested by:	glebius
2008-01-23 21:15:51 +00:00
Robert Watson
1e8f5ffa35 In tcp_ctloutput(), don't hold the inpcb lock over sooptcopyin(), rather,
drop the lock and then re-acquire it, revalidating TCP connection state
assumptions when we do so.  This avoids a potential lock order reversal
(and potential deadlock, although none have been reported) due to the
inpcb lock being held over a page fault.

MFC after:	1 week
PR:		102752
Reviewed by:	bz
Reported by:	Václav Haisman <v dot haisman at sh dot cvut dot cz>
2008-01-18 12:19:50 +00:00
Kip Macy
bc65987ade Incorporate TCP offload hooks in to core TCP code.
- Rename output routines tcp_gen_* -> tcp_output_*.
  - Rename notification routines that turn in to no-ops in the absence of TOE
    from tcp_gen_* -> tcp_offload_*.
  - Fix some minor comment nits.
  - Add a /* FALLTHROUGH */

Reviewed by: Sam Leffler, Robert Watson, and Mike Silbersack
2007-12-18 22:59:07 +00:00
Mike Silbersack
9b3bc6bf83 Pick the smallest possible TCP window scaling factor that will still allow
us to scale up to sb_max, aka kern.ipc.maxsockbuf.

We do this because there are broken firewalls that will corrupt the window
scale option, leading to the other endpoint believing that our advertised
window is unscaled.  At scale factors larger than 5 the unscaled window will
drop below 1500 bytes, leading to serious problems when traversing these
broken firewalls.

With the default maxsockbuf of 256K, a scale factor of 3 will be chosen by
this algorithm.  Those who choose a larger maxsockbuf should watch out
for the compatiblity problems mentioned above.

Reviewed by:	andre
2007-10-19 08:53:14 +00:00
Mike Silbersack
4b421e2daa Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by:	re (kensmith)
2007-10-07 20:44:24 +00:00
Mike Silbersack
e2f2059f68 Two changes:
- Reintegrate the ANSI C function declaration change
  from tcp_timer.c rev 1.92

- Reorganize the tcpcb structure so that it has a single
  pointer to the "tcp_timer" structure which contains all
  of the tcp timer callouts.  This change means that when
  the single tcp timer change is reintegrated, tcpcb will
  not change in size, and therefore the ABI between
  netstat and the kernel will not change.

Neither of these changes should have any functional
impact.

Reviewed by: bmah, rrs
Approved by: re (bmah)
2007-09-24 05:26:24 +00:00
Robert Watson
85d9437250 Back out tcp_timer.c:1.93 and associated changes that reimplemented the many
TCP timers as a single timer, but retain the API changes necessary to
reintroduce this change.  This will back out the source of at least two
reported problems: lock leaks in certain timer edge cases, and TCP timers
continuing to fire after a connection has closed (a bug previously fixed and
then reintroduced with the timer rewrite).

In a follow-up commit, some minor restylings and comment changes performed
after the TCP timer rewrite will be reapplied, and a further change to allow
the TCP timer rewrite to be added back without disturbing the ABI.  The new
design is believed to be a good thing, but the outstanding issues are
leading to significant stability/correctness problems that are holding
up 7.0.

This patch was generated by silby, but is being committed by proxy due to
poor network connectivity for silby this week.

Approved by:	re (kensmith)
Submitted by:	silby
Tested by:	rwatson, kris
Problems reported by:	peter, kris, others
2007-09-07 09:19:22 +00:00
Dag-Erling Smørgrav
218cbbea9a Make tcpstates[] static, and make sure TCPSTATES is defined before
<netinet/tcp_fsm.h> is included into any compilation unit that needs
tcpstates[].  Also remove incorrect extern declarations and TCPDEBUG
conditionals.  This allows kernels both with and without TCPDEBUG to
build, and unbreaks the tinderbox.

Approved by:	re (rwatson)
2007-07-30 11:06:42 +00:00
Matt Jacob
24face5416 Fix compilation problems- tcpstates is only available if TCPDEBUG
is set.

Approved by:	re (in spirit)
2007-07-29 01:31:33 +00:00
Matt Jacob
3c010a416c Garbage collect some debug code that not only no longer could
work but in fact probably causes a random pointer dereferences.
Garbage collect the tp variable too.
2007-06-15 22:54:11 +00:00
Robert Watson
abc7d91030 (1) In tcp_usrclosed(), tp can never become NULL, so don't test for NULL
before handling the socket disconnection case.

(2) Clean up surrounding comments and formatting.

Found with:	Coverity Prevent(tm) (1)
CID:		2203
2007-05-31 12:06:02 +00:00
Robert Watson
54d642bbe5 Reduce network stack oddness: implement .pru_sockaddr and .pru_peeraddr
protocol entry points using functions named proto_getsockaddr and
proto_getpeeraddr rather than proto_setsockaddr and proto_setpeeraddr.
While it's true that sockaddrs are allocated and set, the net effect is
to retrieve (get) the socket address or peer address from a socket, not
set it, so align names to that intent.
2007-05-11 10:20:51 +00:00
Robert Watson
169db7b25d Remove unneeded wrappers for in_setsockaddr() and in_setpeeraddr(), which
used to exist so pcbinfo locks could be acquired, but are no longer
required as a result of socket/pcb reference model refinements.
2007-05-11 09:54:53 +00:00
Robert Watson
f2565d68a4 Move universally to ANSI C function declarations, with relatively
consistent style(9)-ish layout.
2007-05-10 15:58:48 +00:00
Andre Oppermann
1a5537409f Remove unused requested_s_scale from struct tcpcb. 2007-05-06 16:04:36 +00:00
Andre Oppermann
3529149e9a Use existing TF_SACK_PERMIT flag in struct tcpcb t_flags field instead of
a decdicated sack_enable int for this bool.  Change all users accordingly.
2007-05-06 15:56:31 +00:00
Robert Watson
84ca8aa609 Remove unused pcbinfo arguments to in_setsockaddr() and
in_setpeeraddr().
2007-05-01 16:31:02 +00:00
Andre Oppermann
b8152ba793 Change the TCP timer system from using the callout system five times
directly to a merged model where only one callout, the next to fire,
is registered.

Instead of callout_reset(9) and callout_stop(9) the new function
tcp_timer_activate() is used which then internally manages the callout.

The single new callout is a mutex callout on inpcb simplifying the
locking a bit.

tcp_timer() is the called function which handles all race conditions
in one place and then dispatches the individual timer functions.

Reviewed by:	rwatson (earlier version)
2007-04-11 09:45:16 +00:00
Andre Oppermann
ad3f9ab320 ANSIfy function declarations and remove register keywords for variables.
Consistently apply style to all function declarations.
2007-03-21 19:37:55 +00:00
Andre Oppermann
e406f5a1c9 Remove tcp_minmssoverload DoS detection logic. The problem it tried to
protect us from wasn't really there and it only bloats the code.  Should
the problem surface in the future we can simply resurrect it from cvs
history.
2007-03-21 18:05:54 +00:00
Mohan Srinivasan
7c72af8770 Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate
potential issues where the peer does not close, potentially leaving
thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl
fast_finwait2_recycle, which is disabled by default.

Reviewed by: gnn, silby.
2007-02-26 22:25:21 +00:00
Robert Watson
497057eeea Add "show inpcb", "show tcpcb" DDB commands, which should come in handy
for debugging sblock and other network panics.
2007-02-17 21:02:38 +00:00
Bruce M Simpson
1baaf8347c Expose smoothed RTT and RTT variance measurements to userland via
socket option TCP_INFO.
Note that the units used in the original Linux API are in microseconds,
so use a 64-bit mantissa to convert FreeBSD's internal measurements
from struct tcpcb from ticks.
2007-02-02 18:34:18 +00:00
Andre Oppermann
6741ecf595 Auto sizing TCP socket buffers.
Normally the socket buffers are static (either derived from global
defaults or set with setsockopt) and do not adapt to real network
conditions. Two things happen: a) your socket buffers are too small
and you can't reach the full potential of the network between both
hosts; b) your socket buffers are too big and you waste a lot of
kernel memory for data just sitting around.

With automatic TCP send and receive socket buffers we can start with a
small buffer and quickly grow it in parallel with the TCP congestion
window to match real network conditions.

FreeBSD has a default 32K send socket buffer. This supports a maximal
transfer rate of only slightly more than 2Mbit/s on a 100ms RTT
trans-continental link. Or at 200ms just above 1Mbit/s. With TCP send
buffer auto scaling and the default values below it supports 20Mbit/s
at 100ms and 10Mbit/s at 200ms. That's an improvement of factor 10, or
1000%. For the receive side it looks slightly better with a default of
64K buffer size.

New sysctls are:
  net.inet.tcp.sendbuf_auto=1 (enabled)
  net.inet.tcp.sendbuf_inc=8192 (8K, step size)
  net.inet.tcp.sendbuf_max=262144 (256K, growth limit)
  net.inet.tcp.recvbuf_auto=1 (enabled)
  net.inet.tcp.recvbuf_inc=16384 (16K, step size)
  net.inet.tcp.recvbuf_max=262144 (256K, growth limit)

Tested by:	many (on HEAD and RELENG_6)
Approved by:	re
MFC after:	1 month
2007-02-01 18:32:13 +00:00
Andre Oppermann
087b55ea59 Change the way the advertized TCP window scaling is computed. Instead of
upper-bounding it to the size of the initial socket buffer lower-bound it
to the smallest MSS we accept.  Ideally we'd use the actual MSS information
here but it is not available yet.

For socket buffer auto sizing to be effective we need room to grow the
receive window.  The window scale shift is determined at connection setup
and can't be changed afterwards.  The previous, original, method effectively
just did a power of two roundup of the socket buffer size at connection
setup severely limiting the headroom for larger socket buffers.

Tested by:	many (as part of the socket buffer auto sizing patch)
MFC after:	1 month
2007-02-01 17:39:18 +00:00
Sam Leffler
21367f630d Change error codes returned by protocol operations when an inpcb is
marked INP_DROPPED or INP_TIMEWAIT:
o return ECONNRESET instead of EINVAL for close, disconnect, shutdown,
  rcvd, rcvoob, and send operations
o return ECONNABORTED instead of EINVAL for accept

These changes should reduce confusion in applications since EINVAL is
normally interpreted to mean an invalid file descriptor.  This change
does not conflict with POSIX or other standards I checked. The return
of EINVAL has always been possible but rare; it's become more common
with recent changes to the socket/inpcb handling and with finer-grained
locking and preemption.

Note: there are other instances of EINVAL for this state that were
      left unchanged; they should be reviewed.

Reviewed by:	rwatson, andre, ru
MFC after:	1 month
2006-11-22 17:16:54 +00:00
Andre Oppermann
7ff0b850a6 Make tcp_usr_send() free the passed mbufs on error in all cases as the
comment to it claims.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-09-17 13:39:35 +00:00
Robert Watson
a152f8a361 Change semantics of socket close and detach. Add a new protocol switch
function, pru_close, to notify protocols that the file descriptor or
other consumer of a socket is closing the socket.  pru_abort is now a
notification of close also, and no longer detaches.  pru_detach is no
longer used to notify of close, and will be called during socket
tear-down by sofree() when all references to a socket evaporate after
an earlier call to abort or close the socket.  This means detach is now
an unconditional teardown of a socket, whereas previously sockets could
persist after detach of the protocol retained a reference.

This faciliates sharing mutexes between layers of the network stack as
the mutex is required during the checking and removal of references at
the head of sofree().  With this change, pru_detach can now assume that
the mutex will no longer be required by the socket layer after
completion, whereas before this was not necessarily true.

Reviewed by:	gnn
2006-07-21 17:11:15 +00:00
Stephan Uphoff
d915b28015 Fix race conditions on enumerating pcb lists by moving the initialization
( and where appropriate the destruction) of the pcb mutex to the init/finit
functions of the pcb zones.
This allows locking of the pcb entries and race condition free comparison
of the generation count.
Rearrange locking a bit to avoid extra locking operation to update the generation
count in in_pcballoc(). (in_pcballoc now returns the pcb locked)

I am planning to convert pcb list handling from a type safe to a reference count
model soon. ( As this allows really freeing the PCBs)

Reviewed by:	rwatson@, mohans@
MFC after:	1 week
2006-07-18 22:34:27 +00:00
Robert Watson
b4470c1639 In tcp6_usr_attach(), return immediately if SS_ISDISCONNECTED, to
avoid dereferencing an uninitialized inp variable.

Submitted by:	Michiel Boland <michiel at boland dot org>
MFC after:	1 month
2006-06-26 09:38:08 +00:00
Robert Watson
f2de87fec4 Push acquisition of pcbinfo lock out of tcp_usr_attach() into
tcp_attach() after the call to soreserve(), as it doesn't require
the global lock.  Rearrange inpcb locking here also.

MFC after:	1 month
2006-06-04 09:31:34 +00:00
Robert Watson
c78cbc7b1d Instead of calling tcp_usr_detach() from tcp_usr_abort(), break out
common pcb tear-down logic into tcp_detach(), which is called from
either.  Invoke tcp_drop() from the tcp_usr_abort() path rather than
tcp_disconnect(), as we want to drop it immediately not perform a
FIN sequence.  This is one reason why some people were experiencing
panics in sodealloc(), as the netisr and aborting thread were
simultaneously trying to tear down the socket.  This bug could often
be reproduced using repeated runs of the listenclose regression test.

MFC after:	3 months
PR:		96090
Reported by:	Peter Kostouros <kpeter at melbpc dot org dot au>, kris
Tested by:	Peter Kostouros <kpeter at melbpc dot org dot au>, kris
2006-04-24 08:20:02 +00:00
Robert Watson
e6e65783d6 Clarify comment on handling of non-timewait TCP states in
tcp_usr_detach().

MFC after:	3 months
2006-04-03 12:43:56 +00:00
Robert Watson
3d2d3ef434 After checking for SO_ISDISCONNECTED in tcp_usr_accept(), return
immediately rather than jumping to the normal output handling, which
assumes we've pulled out the inpcb, which hasn't happened at this
point (and isn't necessary).

Return ECONNABORTED instead of EINVAL when the inpcb has entered
INP_TIMEWAIT or INP_DROPPED, as this is the documented error value.

This may correct the panic seen by Ganbold.

MFC after:	1 month
Reported by:	Ganbold <ganbold at micom dot mng dot net>
2006-04-03 09:52:55 +00:00
Robert Watson
953b5606df During reformulation of tcp_usr_detach(), the call to initiate TCP
disconnect for fully connected sockets was dropped, meaning that if
the socket was closed while the connection was alive, it would be
leaked.  Structure tcp_usr_detach() so that there are two clear
parts: initiating disconnect, and reclaiming state, and reintroduce
the tcp_disconnect() call in the first part.

MFC after:	3 months
2006-04-02 16:42:51 +00:00
Robert Watson
34af7bae80 Properly handle an edge case previously not handled correctly: a
socket can have a tcp connection that has entered time wait
attached to it, in the event that shutdown() is called on the
socket and the FINs properly exchange before close().  In this
case we don't detach or free the inpcb, just leave the tcptw
detached and freed, but we must release the inpcb lock (which we
didn't previously).

MFC after:	3 months
2006-04-01 23:53:25 +00:00
Robert Watson
623dce13c6 Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():

- Universally support and enforce the invariant that so_pcb is
  never NULL, converting dozens of unnecessary NULL checks into
  assertions, and eliminating dozens of unnecessary error handling
  cases in protocol code.

- In some cases, eliminate unnecessary pcbinfo locking, as it is no
  longer required to ensure so_pcb != NULL.  For example, the receive
  code no longer requires the pcbinfo lock, and the send code only
  requires it if building a new connection on an otherwise unconnected
  socket triggered via sendto() with an address.  This should
  significnatly reduce tcbinfo lock contention in the receive and send
  cases.

- In order to support the invariant that so_pcb != NULL, it is now
  necessary for the TCP code to not discard the tcpcb any time a
  connection is dropped, but instead leave the tcpcb until the socket
  is shutdown.  This case is handled by setting INP_DROPPED, to
  substitute for using a NULL so_pcb to indicate that the connection
  has been dropped.  This requires the inpcb lock, but not the pcbinfo
  lock.

- Unlike all other protocols in the tree, TCP may need to retain access
  to the socket after the file descriptor has been closed.  Set
  SS_PROTOREF in tcp_detach() in order to prevent the socket from being
  freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
  or not it needs to free the socket when the connection finally does
  close.  The typical case where this occurs is if close() is called on
  a TCP socket before all sent data in the send socket buffer has been
  transmitted or acknowledged.  If INP_SOCKREF is found when the
  connection is dropped, we release the inpcb, tcpcb, and socket instead
  of flagging INP_DROPPED.

- Abort and detach protocol switch methods no longer return failures,
  nor attempt to free sockets, as the socket layer does this.

- Annotate the existence of a long-standing race in the TCP timer code,
  in which timers are stopped but not drained when the socket is freed,
  as waiting for drain may lead to deadlocks, or have to occur in a
  context where waiting is not permitted.  This race has been handled
  by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
  versa), which is not normally permitted, but may be true of a inpcb
  and tcpcb have been freed.  Add a counter to test how often this race
  has actually occurred, and a large comment for each instance where
  we compare potentially freed memory with NULL.  This will have to be
  fixed in the near future, but requires is to further address how to
  handle the timer shutdown shutdown issue.

- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
  so no longer need to return a pointer to indicate whether the argument
  passed in is still valid.

- Un-macroize debugging and locking setup for various protocol switch
  methods for TCP, as it lead to more obscurity, and as locking becomes
  more customized to the methods, offers less benefit.

- Assert copyright on tcp_usrreq.c due to significant modifications that
  have been made as part of this work.

These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs.  Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.

MFC after:	3 months
2006-04-01 16:36:36 +00:00
Robert Watson
bc725eafc7 Chance protocol switch method pru_detach() so that it returns void
rather than an error.  Detaches do not "fail", they other occur or
the protocol flags SS_PROTOREF to take ownership of the socket.

soclose() no longer looks at so_pcb to see if it's NULL, relying
entirely on the protocol to decide whether it's time to free the
socket or not using SS_PROTOREF.  so_pcb is now entirely owned and
managed by the protocol code.  Likewise, no longer test so_pcb in
other socket functions, such as soreceive(), which have no business
digging into protocol internals.

Protocol detach routines no longer try to free the socket on detach,
this is performed in the socket code if the protocol permits it.

In rts_detach(), no longer test for rp != NULL in detach, and
likewise in other protocols that don't permit a NULL so_pcb, reduce
the incidence of testing for it during detach.

netinet and netinet6 are not fully updated to this change, which
will be in an upcoming commit.  In their current state they may leak
memory or panic.

MFC after:	3 months
2006-04-01 15:42:02 +00:00
Robert Watson
ac45e92ff2 Change protocol switch pru_abort() API so that it returns void rather
than an int, as an error here is not meaningful.  Modify soabort() to
unconditionally free the socket on the return of pru_abort(), and
modify most protocols to no longer conditionally free the socket,
since the caller will do this.

This commit likely leaves parts of netinet and netinet6 in a situation
where they may panic or leak memory, as they have not are not fully
updated by this commit.  This will be corrected shortly in followup
commits to these components.

MFC after:      3 months
2006-04-01 15:15:05 +00:00
Maxime Henrion
e59898ff36 Fix a bunch of SYSCTL_INT() that should have been SYSCTL_ULONG() to
match the type of the variable they are exporting.

Spotted by:	Thomas Hurst <tom@hur.st>
MFC after:	3 days
2005-12-14 22:27:48 +00:00
Robert Watson
d374e81efd Push the assignment of a new or updated so_qlimit from solisten()
following the protocol pru_listen() call to solisten_proto(), so
that it occurs under the socket lock acquisition that also sets
SO_ACCEPTCONN.  This requires passing the new backlog parameter
to the protocol, which also allows the protocol to be aware of
changes in queue limit should it wish to do something about the
new queue limit.  This continues a move towards the socket layer
acting as a library for the protocol.

Bump __FreeBSD_version due to a change in the in-kernel protocol
interface.  This change has been tested with IPv4 and UNIX domain
sockets, but not other protocols.
2005-10-30 19:44:40 +00:00