Commit Graph

549 Commits

Author SHA1 Message Date
Andre Oppermann
3c914c547e Allow drivers to specify a maximum TSO length in bytes if they are
limited in the amount of data they can handle at once.

Drivers can set ifp->if_hw_tsomax before calling ether_ifattach() to
change the limit.

The lowest allowable size is IP_MAXPACKET / 8 (8192 bytes) as anything
less wouldn't be very useful anymore.  The upper limit is still at
IP_MAXPACKET (65536 bytes).  Raising it requires further auditing of
the IPv4/v6 code path's as the length field in the IP header would
overflow leading to confusion in firewalls and others packet handler on
the real size of the packet.

The placement into "struct ifnet" is a bit hackish but the best place
that was found.  When the stack/driver boundary is updated it should
be handled in a better way.

Submitted by:	cperciva (earlier version)
Reviewed by:	cperciva
Tested by:	cperciva
MFC after:	1 week (using spare struct members to preserve ABI)
2013-06-03 12:55:13 +00:00
Jim Harris
d13fc9954b Fix typo in net.inet.tcp.minmss sysctl description.
MFC after:	3 days
2013-05-13 19:55:27 +00:00
Andre Oppermann
f89d4c3acf Back out r249318, r249320 and r249327 due to a heisenbug most
likely related to a race condition in the ipi_hash_lock with
the exact cause currently unknown but under investigation.
2013-05-06 16:42:18 +00:00
Gabor Kovesdan
8fb3bbe770 - Corrrect mispellings of word useful
Submitted by:	Christoph Mallon <christoph.mallon@gmx.de> (via private mail)
2013-04-17 11:45:15 +00:00
Andre Oppermann
e8b3186b6a Change certain heavily used network related mutexes and rwlocks to
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.

NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.
2013-04-09 21:02:20 +00:00
Gleb Smirnoff
dc4ad05ecd Use m_get/m_gethdr instead of compat macros.
Sponsored by:	Nginx, Inc.
2013-03-15 12:55:30 +00:00
Pawel Jakub Dawidek
6acd596efb More warnings for zones that depend on the kern.ipc.maxsockets limit.
Obtained from:	WHEEL Systems
2012-12-08 12:51:06 +00:00
Gleb Smirnoff
eb1b1807af Mechanically substitute flags from historic mbuf allocator with
malloc(9) flags within sys.

Exceptions:

- sys/contrib not touched
- sys/mbuf.h edited manually
2012-12-05 08:04:20 +00:00
Alfred Perlstein
08373e0bc4 Auto size the tcbhashsize structure based on max sockets.
While here, also make the code that enforces power-of-two more
forgiving, instead of just resetting to 512, graciously round-down
to the next lower power of two.
2012-11-27 03:04:24 +00:00
Bjoern A. Zeeb
ec89d0398b Cleanup some whitspace in this file to get it out of an upcoming patch.
MFC after:	10 days
2012-11-08 03:29:55 +00:00
Gleb Smirnoff
8f134647ca Switch the entire IPv4 stack to keep the IP packet header
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.

  After this change a packet processed by the stack isn't
modified at all[2] except for TTL.

  After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.

[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.

[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.

Reviewed by:	luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by:	ray, Olivier Cochard-Labbe <olivier cochard.me>
2012-10-22 21:09:03 +00:00
Gleb Smirnoff
d6d3f01e0a Merge the projects/pf/head branch, that was worked on for last six months,
into head. The most significant achievements in the new code:

 o Fine grained locking, thus much better performance.
 o Fixes to many problems in pf, that were specific to FreeBSD port.

New code doesn't have that many ifdefs and much less OpenBSDisms, thus
is more attractive to our developers.

  Those interested in details, can browse through SVN log of the
projects/pf/head branch. And for reference, here is exact list of
revisions merged:

r232043, r232044, r232062, r232148, r232149, r232150, r232298, r232330,
r232332, r232340, r232386, r232390, r232391, r232605, r232655, r232656,
r232661, r232662, r232663, r232664, r232673, r232691, r233309, r233782,
r233829, r233830, r233834, r233835, r233836, r233865, r233866, r233868,
r233873, r234056, r234096, r234100, r234108, r234175, r234187, r234223,
r234271, r234272, r234282, r234307, r234309, r234382, r234384, r234456,
r234486, r234606, r234640, r234641, r234642, r234644, r234651, r235505,
r235506, r235535, r235605, r235606, r235826, r235991, r235993, r236168,
r236173, r236179, r236180, r236181, r236186, r236223, r236227, r236230,
r236252, r236254, r236298, r236299, r236300, r236301, r236397, r236398,
r236399, r236499, r236512, r236513, r236525, r236526, r236545, r236548,
r236553, r236554, r236556, r236557, r236561, r236570, r236630, r236672,
r236673, r236679, r236706, r236710, r236718, r237154, r237155, r237169,
r237314, r237363, r237364, r237368, r237369, r237376, r237440, r237442,
r237751, r237783, r237784, r237785, r237788, r237791, r238421, r238522,
r238523, r238524, r238525, r239173, r239186, r239644, r239652, r239661,
r239773, r240125, r240130, r240131, r240136, r240186, r240196, r240212.

I'd like to thank people who participated in early testing:

Tested by:	Florian Smeets <flo freebsd.org>
Tested by:	Chekaluk Vitaly <artemrts ukr.net>
Tested by:	Ben Wilber <ben desync.com>
Tested by:	Ian FREISLICH <ianf cloudseed.co.za>
2012-09-08 06:41:54 +00:00
Navdeep Parhar
09fe63205c - Updated TOE support in the kernel.
- Stateful TCP offload drivers for Terminator 3 and 4 (T3 and T4) ASICs.
  These are available as t3_tom and t4_tom modules that augment cxgb(4)
  and cxgbe(4) respectively.  The cxgb/cxgbe drivers continue to work as
  usual with or without these extra features.

- iWARP driver for Terminator 3 ASIC (kernel verbs).  T4 iWARP in the
  works and will follow soon.

Build-tested with make universe.

30s overview
============
What interfaces support TCP offload?  Look for TOE4 and/or TOE6 in the
capabilities of an interface:
# ifconfig -m | grep TOE

Enable/disable TCP offload on an interface (just like any other ifnet
capability):
# ifconfig cxgbe0 toe
# ifconfig cxgbe0 -toe

Which connections are offloaded?  Look for toe4 and/or toe6 in the
output of netstat and sockstat:
# netstat -np tcp | grep toe
# sockstat -46c | grep toe

Reviewed by:	bz, gnn
Sponsored by:	Chelsio communications.
MFC after:	~3 months (after 9.1, and after ensuring MFC is feasible)
2012-06-19 07:34:13 +00:00
Bjoern A. Zeeb
356ab07e2d It turns out that too many drivers are not only parsing the L2/3/4
headers for TSO but also for generic checksum offloading.  Ideally we
would only have one common function shared amongst all drivers, and
perhaps when updating them for IPv6 we should introduce that.
Eventually we should provide the meta information along with mbufs to
avoid (re-)parsing entirely.

To not break IPv6 (checksums and offload) and to be able to MFC the
changes without risking to hurt 3rd party drivers, duplicate the v4
framework, as other OSes have done as well.

Introduce interface capability flags for TX/RX checksum offload with
IPv6, to allow independent toggling (where possible).  Add CSUM_*_IPV6
flags for UDP/TCP over IPv6, and reserve further for SCTP, and IPv6
fragmentation.  Define CSUM_DELAY_DATA_IPV6 as we do for legacy IP and
add an alias for CSUM_DATA_VALID_IPV6.

This pretty much brings IPv6 handling in line with IPv4.
TSO is still handled in a different way and not via if_hwassist.

Update ifconfig to allow (un)setting of the new capability flags.
Update loopback to announce the new capabilities and if_hwassist flags.

Individual driver updates will have to follow, as will SCTP.

Reported by:	gallatin, dim, ..
Reviewed by:	gallatin (glanced at?)
MFC after:	3 days
X-MFC with:	r235961,235959,235958
2012-05-28 09:30:13 +00:00
Bjoern A. Zeeb
45747ba53c MFp4 bz_ipv6_fast:
Add code to handle pre-checked TCP checksums as indicated by mbuf
  flags to save the entire computation for validation if not needed.

  In the IPv6 TCP output path only compute the pseudo-header checksum,
  set the checksum offset in the mbuf field along the appropriate flag
  as done in IPv4.

  In tcp_respond() just initialize the IPv6 payload length to 0 as
  ip6_output() will properly set it.

  Sponsored by:	The FreeBSD Foundation
  Sponsored by:	iXsystems

Reviewed by:	gnn (as part of the whole)
MFC After:	3 days
2012-05-25 02:23:26 +00:00
Gleb Smirnoff
ef341ee1e3 When we receive an ICMP unreach need fragmentation datagram, we take
proposed MTU value from it and update the TCP host cache. Then
tcp_mss_update() is called on the corresponding tcpcb. It finds the
just allocated entry in the TCP host cache and updates MSS on the
tcpcb. And then we do a fast retransmit of what we have in the tcp
send buffer.

This sequence gets broken if the TCP host cache is exausted. In this
case allocation fails, and later called tcp_mss_update() finds nothing
in cache. The fast retransmit is done with not reduced MSS and is
immidiately replied by remote host with new ICMP datagrams and the
cycle repeats. This ping-pong can go up to wirespeed.

To fix this:
- tcp_mss_update() gets new parameter - mtuoffer, that is like
  offer, but needs to have min_protoh subtracted.
- tcp_mtudisc() as notification method renamed to tcp_mtudisc_notify().
- tcp_mtudisc() now accepts not a useless error argument, but proposed
  MTU value, that is passed to tcp_mss_update() as mtuoffer.

Reported by:	az
Reported by:	Andrey Zonov <andrey zonov.org>
Reviewed by:	andre (previous version of patch)
2012-04-16 13:49:03 +00:00
Marko Zec
2454a7ca98 Permit tcpdrop in VNET jails.
Submitted by:	Miljenko Mikuc
MFC after:	3 days
2012-03-28 12:30:16 +00:00
Bjoern A. Zeeb
81d5d46b3c Add multi-FIB IPv6 support to the core network stack supplementing
the original IPv4 implementation from r178888:

- Use RT_DEFAULT_FIB in the IPv4 implementation where noticed.
- Use rt*fib() KPI with explicit RT_DEFAULT_FIB where applicable in
  the NFS code.
- Use the new in6_rt* KPI in TCP, gif(4), and the IPv6 network stack
  where applicable.
- Split in6_rtqtimo() and in6_mtutimo() as done in IPv4 and equally
  prevent multiple initializations of callouts in in6_inithead().
- Use wrapper functions where needed to preserve the current KPI to
  ease MFCs.  Use BURN_BRIDGES to indicate expected future cleanup.
- Fix (related) comments (both technical or style).
- Convert to rtinit() where applicable and only use custom loops where
  currently not possible otherwise.
- Multicast group, most neighbor discovery address actions and faith(4)
  are locked to the default FIB.  Individual IPv6 addresses will only
  appear in the default FIB, however redirect information and prefixes
  of connected subnets are automatically propagated to all FIBs by
  default (mimicking IPv4 behavior as closely as possible).

Sponsored by:	Cisco Systems, Inc.
2012-02-03 13:08:44 +00:00
Bjoern A. Zeeb
dceced71fb Unbreak no-INET kernels after r223839 adding the needed #ifdef INET.
MFC after:	4 weeks
2011-07-14 13:44:48 +00:00
Andre Oppermann
1c6e7fa7f1 Remove the TCP_SORECEIVE_STREAM compile time option. The use of
soreceive_stream() for TCP still has to be enabled with the loader
tuneable net.inet.tcp.soreceive_stream.

Suggested by:	trociny and others
2011-07-07 10:37:14 +00:00
Ermal Luçi
e6c90582c7 pf(4) tags now store the state key but tcp_respond tries to reuse a mbuf as an optimization.
This makes pf find the wrong state and cause errors reported with state mismatches.
Clear the cached state link on the pf(4) tag to avoid the state mismatches.

Approved by:	bz
2011-07-04 17:43:04 +00:00
Robert Watson
52cd27cb58 Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup.  pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.

Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups.  During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock.  By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details).  This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).

Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems".  However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.

Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies.  Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.

Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect.  In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).

Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.

Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.

Reviewed by:    bz
Sponsored by:   Juniper Networks, Inc.
2011-06-06 12:55:02 +00:00
Robert Watson
fa046d8774 Decompose the current single inpcbinfo lock into two locks:
- The existing ipi_lock continues to protect the global inpcb list and
  inpcb counter.  This lock is now relegated to a small number of
  allocation and free operations, and occasional operations that walk
  all connections (including, awkwardly, certain UDP multicast receive
  operations -- something to revisit).

- A new ipi_hash_lock protects the two inpcbinfo hash tables for
  looking up connections and bound sockets, manipulated using new
  INP_HASH_*() macros.  This lock, combined with inpcb locks, protects
  the 4-tuple address space.

Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required.  As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.

A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb.  Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed.  In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup.  New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:

  INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
  INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb

Callers must pass exactly one of these flags (for the time being).

Some notes:

- All protocols are updated to work within the new regime; especially,
  TCP, UDPv4, and UDPv6.  pcbinfo ipi_lock acquisitions are largely
  eliminated, and global hash lock hold times are dramatically reduced
  compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
  may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
  is no longer available -- hash lookup locks are now held only very
  briefly during inpcb lookup, rather than for potentially extended
  periods.  However, the pcbinfo ipi_lock will still be acquired if a
  connection state might change such that a connection is added or
  removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
  due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
  callers to acquire hash locks and perform one or more lookups atomically
  with 4-tuple allocation: this is required only for TCPv6, as there is no
  in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
  locking, which relates to source address selection.  This needs
  attention, as it likely significantly reduces parallelism in this code
  for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
  somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
  is no longer sufficient.  A second check once the inpcb lock is held
  should do the trick, keeping the general case from requiring the inpcb
  lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
  which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
  undesirable, and probably another argument is required to take care of
  this (or a char array name field in the pcbinfo?).

This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics.  It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.

Reviewed by:    bz
Sponsored by:   Juniper Networks, Inc.
2011-05-30 09:43:55 +00:00
Alexander Motin
bc7d18ae72 Refactor TCP ISN increment logic. Instead of firing callout at 100Hz to
keep constant ISN growth rate, do the same directly inside tcp_new_isn(),
taking into account how much time (ticks) passed since the last call.

On my test systems this decreases idle interrupt rate from 140Hz to 70Hz.
2011-05-09 07:37:47 +00:00
Bjoern A. Zeeb
b287c6c70c Make the TCP code compile without INET. Sort #includes and add #ifdef INETs.
Add some comments at #endifs given more nestedness.  To make the compiler
happy, some default initializations were added in accordance with the style
on the files.

Reviewed by:	gnn
Sponsored by:	The FreeBSD Foundation
Sponsored by:	iXsystems
MFC after:	4 days
2011-04-30 11:21:29 +00:00
Attilio Rao
2903309aca Add the possibility to verify MD5 hash of incoming TCP packets.
As long as this is a costy function, even when compiled in (along with
the option TCP_SIGNATURE), it can be disabled via the
net.inet.tcp.signature_verify_input sysctl.

Sponsored by:	Sandvine Incorporated
Reviewed by:	emaste, bz
MFC after:	2 weeks
2011-04-25 17:13:40 +00:00
Rebecca Cran
6bccea7c2b Fix typos - remove duplicate "the".
PR:	bin/154928
Submitted by:	Eitan Adler <lists at eitanadler.com>
MFC after: 	3 days
2011-02-21 09:01:34 +00:00
Matthew D Fleming
79c3d51b86 Specify a CTLTYPE_FOO so that a future sysctl(8) change does not need
to rely on the format string.  For SYSCTL_PROC instances that I
noticed a discrepancy between the CTLTYPE and the format specifier,
fix the CTLTYPE.
2011-01-18 21:14:13 +00:00
Matthew D Fleming
f88910cdf5 sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.
Commit the net* piece.
2011-01-12 19:53:50 +00:00
Lawrence Stewart
39bc9de532 - Add some helper hook points to the TCP stack. The hooks allow Khelp modules to
access inbound/outbound events and associated data for established TCP
  connections. The hooks only run if at least one hook function is registered
  for the hook point, ensuring the impact on the stack is effectively nil when
  no TCP Khelp modules are loaded. struct tcp_hhook_data is passed as contextual
  data to any registered Khelp module hook functions.

- Add an OSD (Object Specific Data) pointer to struct tcpcb to allow Khelp
  modules to associate per-connection data with the TCP control block.

- Bump __FreeBSD_version and add a note to UPDATING regarding to ABI changes
  introduced by this commit and r216753.

In collaboration with:	David Hayes <dahayes at swin edu au> and
				Grenville Armitage <garmitage at swin edu au>
Sponsored by:	FreeBSD Foundation
Reviewed by:	bz, others along the way
MFC after:	3 months
2010-12-28 12:13:30 +00:00
Lawrence Stewart
22968a7d56 Fix a whitespace nit introduced in r215166.
Sponsored by:	FreeBSD Foundation
Spotted by:	bz
MFC after:	5 weeks
X-MFC with:	r215166
2010-12-28 01:38:52 +00:00
Dimitry Andric
3e288e6238 After some off-list discussion, revert a number of changes to the
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files.  A better long-term solution is
still being considered.  This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.

Changes reverted:

------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines

Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.

------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.

------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines

Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
2010-11-22 19:32:54 +00:00
Lawrence Stewart
99065ae6a8 Move protocol specific implementation detail out of the core CC framework.
Sponsored by:	FreeBSD Foundation
Tested by:	Mikolaj Golub <to.my.trociny at gmail com>
MFC after:	11 weeks
X-MFC with:	r215166
2010-11-16 08:30:39 +00:00
Lawrence Stewart
14f57a8b02 cc_init() should only be run once on system boot, but with VIMAGE kernels it
runs on boot and each time a vnet jail is created. Running cc_init() multiple
times results in a panic when attempting to initialise the cc_list lock again,
and so r215166 effectively broke the use of vnet jails.

Switch to using a SYSINIT to run cc_init() on boot. CC algorithm modules loaded
on boot register in the same SI_SUB_PROTO_IFATTACHDOMAIN category as is used in
this patch, so cc_init() is run at SI_ORDER_FIRST to ensure the framework is
initialised before module registration is attempted.

Sponsored by:	FreeBSD Foundation
Reported and tested by:	Mikolaj Golub <to.my.trociny at gmail com>
MFC after:	11 weeks
X-MFC with:	r215166
2010-11-16 07:09:05 +00:00
Dimitry Andric
31c6a0037e Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.
2010-11-14 20:38:11 +00:00
Lawrence Stewart
dbc4240942 This commit marks the first formal contribution of the "Five New TCP Congestion
Control Algorithms for FreeBSD" FreeBSD Foundation funded project. More details
about the project are available at: http://caia.swin.edu.au/freebsd/5cc/

- Add a KPI and supporting infrastructure to allow modular congestion control
  algorithms to be used in the net stack. Algorithms can maintain per-connection
  state if required, and connections maintain their own algorithm pointer, which
  allows different connections to concurrently use different algorithms. The
  TCP_CONGESTION socket option can be used with getsockopt()/setsockopt() to
  programmatically query or change the congestion control algorithm respectively
  from within an application at runtime.

- Integrate the framework with the TCP stack in as least intrusive a manner as
  possible. Care was also taken to develop the framework in a way that should
  allow integration with other congestion aware transport protocols (e.g. SCTP)
  in the future. The hope is that we will one day be able to share a single set
  of congestion control algorithm modules between all congestion aware transport
  protocols.

- Introduce a new congestion recovery (TF_CONGRECOVERY) state into the TCP stack
  and use it to decouple the meaning of recovery from a congestion event and
  recovery from packet loss (TF_FASTRECOVERY) a la RFC2581. ECN and delay based
  congestion control protocols don't generally need to recover from packet loss
  and need a different way to note a congestion recovery episode within the
  stack.

- Remove the net.inet.tcp.newreno sysctl, which simplifies some portions of code
  and ensures the stack always uses the appropriate mechanisms for recovering
  from packet loss during a congestion recovery episode.

- Extract the NewReno congestion control algorithm from the TCP stack and
  massage it into module form. NewReno is always built into the kernel and will
  remain the default algorithm for the forseeable future. Implementations of
  additional different algorithms will become available in the near future.

- Bump __FreeBSD_version to 900025 and note in UPDATING that rebuilding code
  that relies on the size of "struct tcpcb" is required.

Many thanks go to the Cisco University Research Program Fund at Community
Foundation Silicon Valley and the FreeBSD Foundation. Their support of our work
at the Centre for Advanced Internet Architectures, Swinburne University of
Technology is greatly appreciated.

In collaboration with:	David Hayes <dahayes at swin edu au> and
			Grenville Armitage <garmitage at swin edu au>
Sponsored by:	Cisco URP, FreeBSD Foundation
Reviewed by:	rpaulo
Tested by:	David Hayes (and many others over the years)
MFC after:	3 months
2010-11-12 06:41:55 +00:00
Lawrence Stewart
0c236c4ebd Internalise reassembly queue related functionality and variables which should
not be used outside of the reassembly queue implementation. Provide a new
function to flush all segments from a reassembly queue and call it from the
appropriate places instead of manipulating the queue directly.

Sponsored by:	FreeBSD Foundation
Reviewed by:	andre, gnn, rpaulo
MFC after:	2 weeks
2010-09-25 04:58:46 +00:00
Andre Oppermann
1c18314d17 Remove the TCP inflight bandwidth limiter as announced in r211315
to give way for the pluggable congestion control framework.  It is
the task of the congestion control algorithm to set the congestion
window and amount of inflight data without external interference.

In 'struct tcpcb' the variables previously used by the inflight
limiter are renamed to spares to keep the ABI intact and to have
some more space for future extensions.

In 'struct tcp_info' the variable 'tcpi_snd_bwnd' is not removed to
preserve the ABI.  It is always set to 0.

In siftr.c in 'struct pkt_node' the variable 'snd_bwnd' is not removed
to preserve the ABI.  It is always set to 0.

These unused variable in the various structures may be reused in the
future or garbage collected before the next release or at some other
point when an ABI change happens anyway for other reasons.

No MFC is planned.  The inflight bandwidth limiter stays disabled by
default in the other branches but remains available.
2010-09-16 21:06:45 +00:00
John Baldwin
98b9eb0db2 Simplify the tcp pcblist estimate logic slightly.
MFC after:	3 days
2010-08-27 18:17:46 +00:00
Andre Oppermann
b7d747ecec Untangle the net.inet.tcp.log_in_vain and net.inet.tcp.log_debug
sysctl's and remove any side effects.

Both sysctl's share the same backend infrastructure and due to the
way it was implemented enabling net.inet.tcp.log_in_vain would also
cause log_debug output to be generated.  This was surprising and
eventually annoying to the user.

The log output backend is kept the same but a little shim is inserted
to properly separate log_in_vain and log_debug and to remove any side
effects.

PR:		kern/137317
MFC after:	1 week
2010-08-18 17:39:47 +00:00
Bjoern A. Zeeb
2278f9927d When calculating the expected memory size for userspace, also take the
number of syncache entries into account for the surplus we add to account
for a possible increase of records in the re-entry window.

Discussed with:		jhb, silby
MFC after:		1 week
2010-08-18 09:28:12 +00:00
John Baldwin
c007b96a78 Ensure a minimum "slop" of 10 extra pcb structures when providing a
memory size estimate to userland for pcb list sysctls.  The previous
behavior of a "slop" of n/8 does not work well for small values of n
(e.g. no slop at all if you have less than 8 open UDP connections).

Reviewed by:	bz
MFC after:	1 week
2010-08-17 16:41:16 +00:00
Andre Oppermann
e4e9266071 Fix the interaction between 'ICMP fragmentation needed' MTU updates,
path MTU discovery and the tcp_minmss limiter for very small MTU's.

When the MTU suggested by the gateway via ICMP, or if there isn't
any the next smaller step from ip_next_mtu(), is lower than the
floor enforced by net.inet.tcp.minmss (default 216) the value is
ignored and the default MSS (512) is used instead.  However the
DF flag in the IP header is still set in tcp_output() preventing
fragmentation by the gateway.

Fix this by using tcp_minmss as the MSS and clear the DF flag if
the suggested MTU is too low.  This turns off path MTU dissovery
for the remainder of the session and allows fragmentation to be
done by the gateway.

Only MTU's smaller than 256 are affected.  The smallest official
MTU specified is for AX.25 packet radio at 256 octets.

PR:		kern/146628
Tested by:	Matthew Luckie <mjl-at-luckie org nz>
MFC after:	1 week
2010-08-15 13:25:18 +00:00
Andre Oppermann
bee4e5afa9 Disable TCP inflight limiter by default.
It was experimental and interferes with the normal congestion control
algorithms by instating a separate, possibly lower, ceiling for the
amount of data that is in flight to the remote host.  With high speed
internet connections the inflight limit frequently has been estimated
too low due to the noisy nature of the RTT measurements.

This code gives way for the upcoming pluggable congestion control
framework.  It is the task of the congestion control algorithm to
set the congestion window and amount of inflight data without external
interference.

Reviewed by:	lstewart
MFC after:	1 week
Removal after:	1 month
2010-08-14 20:40:55 +00:00
Bjoern A. Zeeb
82cea7e6f3 MFP4: @176978-176982, 176984, 176990-176994, 177441
"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by:	jhb
Discussed with:	rwatson
Sponsored by:	The FreeBSD Foundation
Sponsored by:	CK Software GmbH
MFC after:	6 days
2010-04-29 11:52:42 +00:00
Bjoern A. Zeeb
d0e157f6aa Add pcb reference counting to the pcblist sysctl handler functions
to ensure type stability while caching the pcb pointers for the
copyout.

Reviewed by:	rwatson
MFC after:	7 days
2010-03-17 18:28:27 +00:00
Robert Watson
9bcd427b89 Abstract out initialization of most aspects of struct inpcbinfo from
their calling contexts in {IP divert, raw IP sockets, TCP, UDP} and
create new helper functions: in_pcbinfo_init() and in_pcbinfo_destroy()
to do this work in a central spot.  As inpcbinfo becomes more complex
due to ongoing work to add connection groups, this will reduce code
duplication.

MFC after:      1 month
Reviewed by:    bz
Sponsored by:   Juniper Networks
2010-03-14 18:59:11 +00:00
Bjoern A. Zeeb
376aadf896 Destroy TCP UMA zones (empty or not) upon network stack teardown
to not leak them, otherwise making UMA/vmstat unhappy with every stoped vnet.
We will still leak pages (especially for zones marked NOFREE).

Reshuffle cleanup order in tcp_destroy() to get rid of what we can
easily free first.

Sponsored by:	ISPsystem
Reviewed by:	rwatson
MFC after:	5 days
2010-03-07 15:58:44 +00:00
Robert Watson
2bf3ce088d Add comment in tcp_discardcb() talking about how we don't, but should,
address TCP races relating to not calling tcp_drain() on stopped callouts.

Discussed with:	bz
2010-03-07 14:13:59 +00:00
Mike Silbersack
b8614722ff Add the ability to see TCP timers via netstat -x. This can be a useful
feature when you have a seemingly stuck socket and want to figure
out why it has not been closed yet.

No plans to MFC this, as it changes the netstat sysctl ABI.

Reviewed by:	andre, rwatson, Eric Van Gyzen
2009-09-16 05:33:15 +00:00
Andre Oppermann
11c99a6d7b -Put the optimized soreceive_stream() under a compile time option called
TCP_SORECEIVE_STREAM for the time being.

Requested by:	brooks

Once compiled in make it easily switchable for testers by using a tuneable
 net.inet.tcp.soreceive_stream
and a corresponding read-only sysctl to report the current state.

Suggested by:	rwatson

MFC after:	2 days
2009-09-15 22:23:45 +00:00
Robert Watson
530c006014 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
Bjoern A. Zeeb
a08362ce46 sysctl_msec_to_ticks is used with both virtualized and
non-vrtiualized sysctls so we cannot used one common function.

Add a macro to convert the arg1 in the virtualized case to
vnet.h to not expose the maths to all over the code.

Add a wrapper for the single virtualized call, properly handling
arg1 and call the default implementation from there.

Convert the two over places to use the new macro.

Reviewed by:	rwatson
Approved by:	re (kib)
2009-07-21 21:58:55 +00:00
Robert Watson
5ee847d3ac Reimplement and/or implement vnet list locking by replacing a mostly
unused custom mutex/condvar-based sleep locks with two locks: an
rwlock (for non-sleeping use) and sxlock (for sleeping use).  Either
acquired for read is sufficient to stabilize the vnet list, but both
must be acquired for write to modify the list.

Replace previous no-op read locking macros, used in various places
in the stack, with actual locking to prevent race conditions.  Callers
must declare when they may perform unbounded sleeps or not when
selecting how to lock.

Refactor vnet sysinits so that the vnet list and locks are initialized
before kernel modules are linked, as the kernel linker will use them
for modules loaded by the boot loader.

Update various consumers of these KPIs based on whether they may sleep
or not.

Reviewed by:	bz
Approved by:	re (kib)
2009-07-19 14:20:53 +00:00
Robert Watson
1e77c1056a Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used.  Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with:	bz, julian
Reviewed by:	bz
Approved by:	re (kensmith, kib)
2009-07-16 21:13:04 +00:00
Robert Watson
eddfbb763d Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00
Bjoern A. Zeeb
ebd8672cc3 Add explicit includes for jail.h to the files that need them and
remove the "hidden" one from vimage.h.
2009-06-17 15:01:01 +00:00
Jamie Gritton
9ed47d01eb Get vnets from creds instead of threads where they're available, and from
passed threads instead of curthread.

Reviewed by:	zec, julian
Approved by:	bz (mentor)
2009-06-15 19:01:53 +00:00
Marko Zec
bc29160df3 Introduce an infrastructure for dismantling vnet instances.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state.  The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.

While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers.  Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.

Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels.  Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.

Bump __FreeBSD_version to 800097.
Reviewed by:	bz, julian
Approved by:	rwatson, kib (re), julian (mentor)
2009-06-08 17:15:40 +00:00
Robert Watson
bcf11e8d00 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
Bjoern A. Zeeb
6d453b1001 For UDP with introducing the UDP control block, the uma zone had to
be named "udp_inpcb" to avoid a naming conflict with tcp[1].
For consistency rename the uma zone for TCP from "inpcb" to "tcp_inpcb".

Found by:	rwatson [1]
Discussed with:	rwatson
2009-05-23 17:02:30 +00:00
Marko Zec
f6dfe47a14 Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance.  Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables.  As an example, V_ifnet becomes:

    options VIMAGE:          ((struct vnet_net *) vnet_net)->_ifnet
    default build:           vnet_net_0._ifnet
    options VIMAGE_GLOBALS:  ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

    INIT_VNET_NET(ifp->if_vnet); becomes

    struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals.  If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet.  options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod.  SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by:	bz, rwatson
Approved by:	julian (mentor)
2009-04-30 13:36:26 +00:00
Marko Zec
093f25f8c8 In preparation for turning on options VIMAGE in next commits,
rearrange / replace / adjust several INIT_VNET_* initializer
macros, all of which currently resolve to whitespace.

Reviewed by:	bz (an older version of the patch)
Approved by:	julian (mentor)
2009-04-26 22:06:42 +00:00
Robert Watson
78b5071407 Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and
TCPSTAT_INC(), rather than directly manipulating the fields across the
kernel.  This will make it easier to change the implementation of
these statistics, such as using per-CPU versions of the data structures.

MFC after:	3 days
2009-04-11 22:07:19 +00:00
Marko Zec
1ed81b739e First pass at separating per-vnet initializer functions
from existing functions for initializing global state.

        At this stage, the new per-vnet initializer functions are
	directly called from the existing global initialization code,
	which should in most cases result in compiler inlining those
	new functions, hence yielding a near-zero functional change.

        Modify the existing initializer functions which are invoked via
        protosw, like ip_init() et. al., to allow them to be invoked
	multiple times, i.e. per each vnet.  Global state, if any,
	is initialized only if such functions are called within the
	context of vnet0, which will be determined via the
	IS_DEFAULT_VNET(curvnet) check (currently always true).

        While here, V_irtualize a few remaining global UMA zones
        used by net/netinet/netipsec networking code.  While it is
        not yet clear to me or anybody else whether this is the right
        thing to do, at this stage this makes the code more readable,
        and makes it easier to track uncollected UMA-zone-backed
        objects on vnet removal.  In the long run, it's quite possible
        that some form of shared use of UMA zone pools among multiple
        vnets should be considered.

	Bump __FreeBSD_version due to changes in layout of structs
	vnet_ipfw, vnet_inet and vnet_net.

Approved by:	julian (mentor)
2009-04-06 22:29:41 +00:00
Juli Mallett
34f27ade44 Remove local in6_addr variables for local and foreign addresses in sysctl_drop,
they were passed uninitialized to in6_pcblookup_hash.  Instead, do as is done
for IPv4 and use the addresses within the sockaddr structure, which are
correctly populated.

This fixes tcpdrop(8) for IPv6 address pairs.

Reviewed by:	bz
2009-03-22 00:45:47 +00:00
Robert Watson
ad71fe3c35 Correct a number of evolved problems with inp_vflag and inp_flags:
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted.  Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):

  INP_TIMEWAIT
  INP_SOCKREF
  INP_ONESBCAST
  INP_DROPPED

Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.

MFC after:      1 week (or after dependencies are MFC'd)
Reviewed by:    bz
2009-03-15 09:58:31 +00:00
Luigi Rizzo
d685b6ee05 Use uint32_t instead of n_long and n_time, and uint16_t instead of n_short.
Add a note next to fields in network format.

The n_* types are not enough for compiler checks on endianness, and their
use often requires an otherwise unnecessary #include <netinet/in_systm.h>

The typedef in in_systm.h are still there.
2009-02-13 15:14:43 +00:00
Bjoern A. Zeeb
97aa4a517a Try to remove/assimilate as much of formerly IPv4/6 specific
(duplicate) code in sys/netipsec/ipsec.c and fold it into
common, INET/6 independent functions.

The file local functions ipsec4_setspidx_inpcb() and
ipsec6_setspidx_inpcb() were 1:1 identical after the change
in r186528. Rename to ipsec_setspidx_inpcb() and remove the
duplicate.

Public functions ipsec[46]_get_policy() were 1:1 identical.
Remove one copy and merge in the factored out code from
ipsec_get_policy() into the other. The public function left
is now called ipsec_get_policy() and callers were adapted.

Public functions ipsec[46]_set_policy() were 1:1 identical.
Rename file local ipsec_set_policy() function to
ipsec_set_policy_internal().
Remove one copy of the public functions, rename the other
to ipsec_set_policy() and adapt callers.

Public functions ipsec[46]_hdrsiz() were logically identical
(ignoring one questionable assert in the v6 version).
Rename the file local ipsec_hdrsiz() to ipsec_hdrsiz_internal(),
the public function to ipsec_hdrsiz(), remove the duplicate
copy and adapt the callers.
The v6 version had been unused anyway. Cleanup comments.

Public functions ipsec[46]_in_reject() were logically identical
apart from statistics. Move the common code into a file local
ipsec46_in_reject() leaving vimage+statistics in small AF specific
wrapper functions. Note: unfortunately we already have a public
ipsec_in_reject().

Reviewed by:	sam
Discussed with:	rwatson (renaming to *_internal)
MFC after:	26 days
X-MFC:		keep wrapper functions for public symbols?
2009-02-08 09:27:07 +00:00
Lawrence Stewart
24cb0f2232 Add TCP Appropriate Byte Counting (RFC 3465) support to kernel.
The new behaviour is on by default, and can be disabled by setting the
net.inet.tcp.rfc3465 sysctl to 0 to obtain previous behaviour.

The patch changes struct tcpcb in sys/netinet/tcp_var.h which breaks
the ABI. Bump __FreeBSD_version to 800061 accordingly. User space tools
that rely on the size of struct tcpcb (e.g. sockstat) need to be recompiled.

Reviewed by:	rpaulo, gnn
Approved by:	gnn, kmacy (mentors)
Sponsored by:	FreeBSD Foundation
2009-01-15 06:44:22 +00:00
Bjoern A. Zeeb
dcdb4371ca Use inc_flags instead of the inc_isipv6 alias which so far
had been the only flag with random usage patterns.
Switch inc_flags to be used as a real bit field by using
INC_ISIPV6 with bitops to check for the 'isipv6' condition.

While here fix a place or two where in case of v4 inc_flags
were not properly initialized before.[1]

Found by:	rwatson during review [1]
Discussed with:	rwatson
Reviewed by:	rwatson
MFC after:	4 weeks
2008-12-17 12:52:34 +00:00
Bjoern A. Zeeb
fc384fa5d6 Another step assimilating IPv[46] PCB code - directly use
the inpcb names rather than the following IPv6 compat macros:
in6pcb,in6p_sp, in6p_ip6_nxt,in6p_flowinfo,in6p_vflag,
in6p_flags,in6p_socket,in6p_lport,in6p_fport,in6p_ppcb and
sotoin6pcb().

Apart from removing duplicate code in netipsec, this is a pure
whitespace, not a functional change.

Discussed with:	rwatson
Reviewed by:	rwatson (version before review requested changes)
MFC after:	4 weeks (set the timer and see then)
2008-12-15 21:50:54 +00:00
Qing Li
6e6b3f7cbc This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
   possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,

The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.

Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:

- Kip Macy revised the locking code completely, thus completing
  the last piece of the puzzle, Kip has also been conducting
  active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
  provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
  me maintaining that branch before the svn conversion
2008-12-15 06:10:57 +00:00
Bjoern A. Zeeb
bccd413962 De-virtualize the MD5 context for TCP initial seq number generation
and make it a function local variable like we do almost everywhere
inside the kernel.

Discussed with:	rwatson, silby
MFC after:	4 weeks
2008-12-13 21:59:18 +00:00
Bjoern A. Zeeb
0750c2ed96 Use the correct INIT_VNET_INET() as the virtualized variable here
are in vinet.h not in vinet6.h

Sponsored by:	The FreeBSD Foundation
2008-12-11 16:05:07 +00:00
Marko Zec
385195c062 Conditionally compile out V_ globals while instantiating the appropriate
container structures, depending on VIMAGE_GLOBALS compile time option.

Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out.  Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.

Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively

#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif

Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.

Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs.  This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.

Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.

De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps.  PF virtualization will be done
separately, most probably after next PF import.

Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw.  Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-12-10 23:12:39 +00:00
Bjoern A. Zeeb
4b79449e2f Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by:	brooks, gnn, des, zec, imp
Sponsored by:	The FreeBSD Foundation
2008-12-02 21:37:28 +00:00
Dag-Erling Smørgrav
3b6fe5fcd9 missing V_ 2008-11-28 13:13:44 +00:00
Marko Zec
97021c2464 Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by:  bz, julian
Approved by:  julian (mentor)
Obtained from:        //depot/projects/vimage-commit2/...
X-MFC after:  never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-11-26 22:32:07 +00:00
Marko Zec
44e33a0758 Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions.  As a rule,
initialization at instatiation for such variables should never be
introduced again from now on.  Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact.  In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-11-19 09:39:34 +00:00
Bjoern A. Zeeb
91d6cfa6b1 Fix a bug introduced with r182851 splitting tcp_mss() into
tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could
re-use the same code.

Move the TSO logic back to tcp_mss() and out of tcp_mss_update().
We tried to avoid that initially but if were are called from
tcp_output() with EMSGSIZE, we cleared the TSO flag on the tcpcb
there, called into tcp_mtudisc() and tcp_mss_update() which
then would reenable TSO on the tcpcb based on TSO capabilities
of the interface as learnt in tcp_maxmtu/6().
So if TSO was enabled on the (possibly new) outgoing interface
it was turned back on, which lead to an endless loop between
tcp_output() and tcp_mtudisc() until we overflew the stack.

Reported by:	kmacy
MFC after:	2 months (along with r182851)
2008-11-06 13:25:59 +00:00
Bjoern A. Zeeb
4b3f4d3818 Adopt the comment for tcp_maxmtu(); we are returning a number
not a pointer. While here update the rest of the comment to
better match what we have these days.

MFC after:	2 months
2008-11-06 12:59:00 +00:00
Bjoern A. Zeeb
f08ef6c595 Add cr_canseeinpcb() doing checks using the cached socket
credentials from inp_cred which is also available after the
socket is gone.
Switch cr_canseesocket consumers to cr_canseeinpcb.
This removes an extra acquisition of the socket lock.

Reviewed by:	rwatson
MFC after:	3 months (set timer; decide then)
2008-10-17 16:26:16 +00:00
Bjoern A. Zeeb
86d02c5c63 Cache so_cred as inp_cred in the inpcb.
This means that inp_cred is always there, even after the socket
has gone away. It also means that it is constant for the lifetime
of the inp.
Both facts lead to simpler code and possibly less locking.

Suggested by:	rwatson
Reviewed by:	rwatson
MFC after:	6 weeks
X-MFC Note:	use a inp_pspare for inp_cred
2008-10-04 15:06:34 +00:00
Marko Zec
8b615593fc Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by:	julian, bz, brooks, zec
Reviewed by:	julian, bz, brooks, kris, rwatson, ...
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
Bjoern A. Zeeb
3418daf2f1 Implement IPv6 support for TCP MD5 Signature Option (RFC 2385)
the same way it has been implemented for IPv4.

Reviewed by:	bms (skimmed)
Tested by:	Nick Hilliard (nick netability.ie) (with more changes)
MFC after:	2 months
2008-09-13 17:26:46 +00:00
Bjoern A. Zeeb
00db174bc2 To my reading there are no real consumers of ip6_plen (IPv6
Payload Length) as set in tcpip_fillheaders().
ip6_output() will calculate it based of the length from the
mbuf packet header itself.
So initialize the value in tcpip_fillheaders() in correct
(network) byte order.

With the above change, to my reading, all places calling tcp_trace()
pass in the ip6 header via ipgen as serialized in the mbuf and with
ip6_plen in network byte order.
Thus convert the IPv6 payload length to host byte order before printing.

MFC after:	2 months
2008-09-07 20:44:45 +00:00
Bjoern A. Zeeb
3cee92e074 Split tcp_mss() in tcp_mss() and tcp_mss_update() where the former
calls the latter.

Merge tcp_mss_update() with code from tcp_mtudisc() basically
doing the same thing.

This gives us one central place where we calcuate and check mss values
to update t_maxopd (maximum mss + options length) instead of two slightly
different but almost equal implementations to maintain.

PR:		kern/118455
Reviewed by:	silby (back in March)
MFC after:	2 months
2008-09-07 18:50:25 +00:00
Bjoern A. Zeeb
ebe5426934 V_irtualize SVN r182846 tcp_mssdflt/tcp_v6mssdflt procedure based
sysctl implementations for VIMAGE the same way we did elsewhere:
update the implementation but leave the globals and the SYSCTL
statement untouched.
2008-09-07 15:20:21 +00:00
Bjoern A. Zeeb
4cdf3bedf3 Convert SYSCTL_INTs for tcp_mssdflt and tcp_v6mssdflt to
SYSCTL_PROCs and check that the default mss for neither v4 nor
v6 goes below the minimum MSS constant (216).

This prevents people from shooting themselves in the foot.

PR:		kern/118455 (remotely related)
Reviewed by:	silby (as part of a larger patch in March)
MFC after:	2 months
2008-09-07 14:44:55 +00:00
Julian Elischer
5ed3800e41 Fix some of the formatting fixes.. It's amazing how some thing stand out
in a commit message.
2008-08-20 01:24:55 +00:00
Julian Elischer
ac957cd271 A bunch of formatting fixes brough to light by, or created by the Vimage commit
a few days ago.
2008-08-20 01:05:56 +00:00
Bjoern A. Zeeb
603724d3ab Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from:	//depot/projects/vimage-commit2/...
Reviewed by:	brooks, des, ed, mav, julian,
		jamie, kris, rwatson, zec, ...
		(various people I forgot, different versions)
		md5 (with a bit of help)
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
X-MFC after:	never
V_Commit_Message_Reviewed_By:	more people than the patch
2008-08-17 23:27:27 +00:00
Robert Watson
53640b0e3a When allocating temporary storage to hold a TCP/IP packet header
template, use an M_TEMP malloc(9) allocation rather than an mbuf
with mtod(9) and dtom(9).  This eliminates the last use of
dtom(9) in TCP.

MFC after:	3 weeks
2008-06-02 14:20:26 +00:00
Robert Watson
c28cb4d82f Read lock rather than write lock TCP inpcbs in monitoring sysctls. In
some cases, add explicit inpcb locking rather than relying on the global
lock, as we dereference inp_socket, but also allowing us to drop the
global lock more quickly.

MFC after:	1 week
2008-05-29 14:28:26 +00:00
Julian Elischer
8b07e49a00 Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

  One thing where FreeBSD has been falling behind, and which by chance I
  have some time to work on is "policy based routing", which allows
  different
  packet streams to be routed by more than just the destination address.

  Constraints:
  ------------

  I want to make some form of this available in the 6.x tree
  (and by extension 7.x) , but FreeBSD in general needs it so I might as
  well do it in -current and back port the portions I need.

  One of the ways that this can be done is to have the ability to
  instantiate multiple kernel routing tables (which I will now
  refer to as "Forwarding Information Bases" or "FIBs" for political
  correctness reasons). Which FIB a particular packet uses to make
  the next hop decision can be decided by a number of mechanisms.
  The policies these mechanisms implement are the "Policies" referred
  to in "Policy based routing".

  One of the constraints I have if I try to back port this work to
  6.x is that it must be implemented as a EXTENSION to the existing
  ABIs in 6.x so that third party applications do not need to be
  recompiled in timespan of the branch.

  This first version will not have some of the bells and whistles that
  will come with later versions. It will, for example, be limited to 16
  tables in the first commit.
  Implementation method, Compatible version. (part 1)
  -------------------------------
  For this reason I have implemented a "sufficient subset" of a
  multiple routing table solution in Perforce, and back-ported it
  to 6.x. (also in Perforce though not  always caught up with what I
  have done in -current/P4). The subset allows a number of FIBs
  to be defined at compile time (8 is sufficient for my purposes in 6.x)
  and implements the changes needed to allow IPV4 to use them. I have not
  done the changes for ipv6 simply because I do not need it, and I do not
  have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

  Other protocol families are left untouched and should there be
  users with proprietary protocol families, they should continue to work
  and be oblivious to the existence of the extra FIBs.

  To understand how this is done, one must know that the current FIB
  code starts everything off with a single dimensional array of
  pointers to FIB head structures (One per protocol family), each of
  which in turn points to the trie of routes available to that family.

  The basic change in the ABI compatible version of the change is to
  extent that array to be a 2 dimensional array, so that
  instead of protocol family X looking at rt_tables[X] for the
  table it needs, it looks at rt_tables[Y][X] when for all
  protocol families except ipv4 Y is always 0.
  Code that is unaware of the change always just sees the first row
  of the table, which of course looks just like the one dimensional
  array that existed before.

  The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
  are all maintained, but refer only to the first row of the array,
  so that existing callers in proprietary protocols can continue to
  do the "right thing".
  Some new entry points are added, for the exclusive use of ipv4 code
  called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
  which have an extra argument which refers the code to the correct row.

  In addition, there are some new entry points (currently called
  rtalloc_fib() and friends) that check the Address family being
  looked up and call either rtalloc() (and friends) if the protocol
  is not IPv4 forcing the action to row 0 or to the appropriate row
  if it IS IPv4 (and that info is available). These are for calling
  from code that is not specific to any particular protocol. The way
  these are implemented would change in the non ABI preserving code
  to be added later.

  One feature of the first version of the code is that for ipv4,
  the interface routes show up automatically on all the FIBs, so
  that no matter what FIB you select you always have the basic
  direct attached hosts available to you. (rtinit() does this
  automatically).

  You CAN delete an interface route from one FIB should you want
  to but by default it's there. ARP information is also available
  in each FIB. It's assumed that the same machine would have the
  same MAC address, regardless of which FIB you are using to get
  to it.

  This brings us as to how the correct FIB is selected for an outgoing
  IPV4 packet.

  Firstly, all packets have a FIB associated with them. if nothing
  has been done to change it, it will be FIB 0. The FIB is changed
  in the following ways.

  Packets fall into one of a number of classes.

  1/ locally generated packets, coming from a socket/PCB.
     Such packets select a FIB from a number associated with the
     socket/PCB. This in turn is inherited from the process,
     but can be changed by a socket option. The process in turn
     inherits it on fork. I have written a utility call setfib
     that acts a bit like nice..

         setfib -3 ping target.example.com # will use fib 3 for ping.

     It is an obvious extension to make it a property of a jail
     but I have not done so. It can be achieved by combining the setfib and
     jail commands.

  2/ packets received on an interface for forwarding.
     By default these packets would use table 0,
     (or possibly a number settable in a sysctl(not yet)).
     but prior to routing the firewall can inspect them (see below).
     (possibly in the future you may be able to associate a FIB
     with packets received on an interface..  An ifconfig arg, but not yet.)

  3/ packets inspected by a packet classifier, which can arbitrarily
     associate a fib with it on a packet by packet basis.
     A fib assigned to a packet by a packet classifier
     (such as ipfw) would over-ride a fib associated by
     a more default source. (such as cases 1 or 2).

  4/ a tcp listen socket associated with a fib will generate
     accept sockets that are associated with that same fib.

  5/ Packets generated in response to some other packet (e.g. reset
     or icmp packets). These should use the FIB associated with the
     packet being reponded to.

  6/ Packets generated during encapsulation.
     gif, tun and other tunnel interfaces will encapsulate using the FIB
     that was in effect withthe proces that set up the tunnel.
     thus setfib 1 ifconfig gif0 [tunnel instructions]
     will set the fib for the tunnel to use to be fib 1.

  Routing messages would be associated with their
  process, and thus select one FIB or another.
  messages from the kernel would be associated with the fib they
  refer to and would only be received by a routing socket associated
  with that fib. (not yet implemented)

  In addition Netstat has been edited to be able to cope with the
  fact that the array is now 2 dimensional. (It looks in system
  memory using libkvm (!)). Old versions of netstat see only the first FIB.

  In addition two sysctls are added to give:
  a) the number of FIBs compiled in (active)
  b) the default FIB of the calling process.

  Early testing experience:
  -------------------------

  Basically our (IronPort's) appliance does this functionality already
  using ipfw fwd but that method has some drawbacks.

  For example,
  It can't fully simulate a routing table because it can't influence the
  socket's choice of local address when a connect() is done.

  Testing during the generating of these changes has been
  remarkably smooth so far. Multiple tables have co-existed
  with no notable side effects, and packets have been routes
  accordingly.

  ipfw has grown 2 new keywords:

  setfib N ip from anay to any
  count ip from any to any fib N

  In pf there seems to be a requirement to be able to give symbolic names to the
  fibs but I do not have that capacity. I am not sure if it is required.

  SCTP has interestingly enough built in support for this, called VRFs
  in Cisco parlance. it will be interesting to see how that handles it
  when it suddenly actually does something.

  Where to next:
  --------------------

  After committing the ABI compatible version and MFCing it, I'd
  like to proceed in a forward direction in -current. this will
  result in some roto-tilling in the routing code.

  Firstly: the current code's idea of having a separate tree per
  protocol family, all of the same format, and pointed to by the
  1 dimensional array is a bit silly. Especially when one considers that
  there is code that makes assumptions about every protocol having the
  same internal structures there. Some protocols don't WANT that
  sort of structure. (for example the whole idea of a netmask is foreign
  to appletalk). This needs to be made opaque to the external code.

  My suggested first change is to add routing method pointers to the
  'domain' structure, along with information pointing the data.
  instead of having an array of pointers to uniform structures,
  there would be an array pointing to the 'domain' structures
  for each protocol address domain (protocol family),
  and the methods this reached would be called. The methods would have
  an argument that gives FIB number, but the protocol would be free
  to ignore it.

  When the ABI can be changed it raises the possibilty of the
  addition of a fib entry into the "struct route". Currently,
  the structure contains the sockaddr of the desination, and the resulting
  fib entry. To make this work fully, one could add a fib number
  so that given an address and a fib, one can find the third element, the
  fib entry.

  Interaction with the ARP layer/ LL layer would need to be
  revisited as well. Qing Li has been working on this already.

  This work was sponsored by Ironport Systems/Cisco

Reviewed by:    several including rwatson, bz and mlair (parts each)
Obtained from:  Ironport systems/Cisco
2008-05-09 23:03:00 +00:00
Robert Watson
8501a69cc9 Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after:	3 months
Tested by:	kris (superset of committered patch)
2008-04-17 21:38:18 +00:00
Kip Macy
bc65987ade Incorporate TCP offload hooks in to core TCP code.
- Rename output routines tcp_gen_* -> tcp_output_*.
  - Rename notification routines that turn in to no-ops in the absence of TOE
    from tcp_gen_* -> tcp_offload_*.
  - Fix some minor comment nits.
  - Add a /* FALLTHROUGH */

Reviewed by: Sam Leffler, Robert Watson, and Mike Silbersack
2007-12-18 22:59:07 +00:00
Bjoern A. Zeeb
abebe6db7a Correctly get the authentication key for TCP-MD5 from the SA.
Submitted by:	Nick Hilliard on net@
MFC after:	8 weeks
2007-11-28 13:23:50 +00:00
Robert Watson
2b19cb1b87 More carefully handle various cases in sysctl_drop(), such as unlocking
the inpcb when there's an inpcb without associated timewait state, and
not unlocking when the inpcb has been freed.  This avoids a kernel panic
when tcpdrop(8) is run on a socket in the TIMEWAIT state.

MFC after:	3 days
Reported by:	Rako <rako29 at gmail dot com>
2007-11-24 18:43:59 +00:00
Robert Watson
30d239bc4c Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00
Mike Silbersack
4b421e2daa Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by:	re (kensmith)
2007-10-07 20:44:24 +00:00
Robert Watson
0fb651b1c4 Disable TCP syncache debug logging by default. While useful in debugging
problems with the syncache, it produces a lot of console noise and has led
to quite a few false positive bug reports.  It can be selectively
re-enabled when debugging specific problems by frobbing the same sysctl.

Discussed with:	silby
Approved by:	re (gnn)
2007-10-05 22:39:44 +00:00
Mike Silbersack
e2f2059f68 Two changes:
- Reintegrate the ANSI C function declaration change
  from tcp_timer.c rev 1.92

- Reorganize the tcpcb structure so that it has a single
  pointer to the "tcp_timer" structure which contains all
  of the tcp timer callouts.  This change means that when
  the single tcp timer change is reintegrated, tcpcb will
  not change in size, and therefore the ABI between
  netstat and the kernel will not change.

Neither of these changes should have any functional
impact.

Reviewed by: bmah, rrs
Approved by: re (bmah)
2007-09-24 05:26:24 +00:00
Robert Watson
85d9437250 Back out tcp_timer.c:1.93 and associated changes that reimplemented the many
TCP timers as a single timer, but retain the API changes necessary to
reintroduce this change.  This will back out the source of at least two
reported problems: lock leaks in certain timer edge cases, and TCP timers
continuing to fire after a connection has closed (a bug previously fixed and
then reintroduced with the timer rewrite).

In a follow-up commit, some minor restylings and comment changes performed
after the TCP timer rewrite will be reapplied, and a further change to allow
the TCP timer rewrite to be added back without disturbing the ABI.  The new
design is believed to be a good thing, but the outstanding issues are
leading to significant stability/correctness problems that are holding
up 7.0.

This patch was generated by silby, but is being committed by proxy due to
poor network connectivity for silby this week.

Approved by:	re (kensmith)
Submitted by:	silby
Tested by:	rwatson, kris
Problems reported by:	peter, kris, others
2007-09-07 09:19:22 +00:00
Qing Li
8cb5ba02d8 Use the sequence number comparison macro to compare
projected_offset against isn_offset to account for
wrap around.

Reviewed by:	gnn, kmacy, silby
Submitted by:	yusheng.huang@bluecoat.com
Approved by:	re
MFC:		3 days
2007-08-16 01:35:55 +00:00
Peter Wemm
c4a184bdc4 Change TCPTV_MIN to be independent of HZ. While it was documented to
be in ticks "for algorithm stability" when originally committed, it turns
out that it has a significant impact in timing out connections.  When we
changed HZ from 100 to 1000, this had a big effect on reducing the time
before dropping connections.

To demonstrate, boot with kern.hz=100.  ssh to a box on local ethernet
and establish a reliable round-trip-time (ie: type a few commands).
Then unplug the ethernet and press a key.  Time how long it takes to
drop the connection.

The old behavior (with hz=100) caused the connection to typically drop
between 90 and 110 seconds of getting no response.

Now boot with kern.hz=1000 (default).  The same test causes the ssh session
to drop after just 9-10 seconds.  This is a big deal on a wifi connection.

With kern.hz=1000, change sysctl net.inet.tcp.rexmit_min from 3 to 30.
Note how it behaves the same as when HZ was 100.  Also, note that when
booting with hz=100, net.inet.tcp.rexmit_min *used* to be 30.

This commit changes TCPTV_MIN to be scaled with hz.  rexmit_min should
always be about 30.  If you set hz to Really Slow(TM), there is a safety
feature to prevent a value of 0 being used.

This may be revised in the future, but for the time being, it restores the
old, pre-hz=1000 behavior, which is significantly less annoying.

As a workaround, to avoid rebooting or rebuilding a kernel, you can run
"sysctl net.inet.tcp.rexmit_min=30" and add "net.inet.tcp.rexmit_min=30"
to /etc/sysctl.conf.  This is safe to run from 6.0 onwards.

Approved by:  re (rwatson)
Reviewed by:  andre, silby
2007-07-31 22:11:55 +00:00
Andre Oppermann
773673c133 Provide a sysctl to toggle reporting of TCP debug logging:
sys.net.inet.tcp.log_debug = 1

It defaults to enabled for the moment and is to be turned off for
the next release like other diagnostics from development branches.

It is important to note that sysctl sys.net.inet.tcp.log_in_vain
uses the same logging function as log_debug.  Enabling of the former
also causes the latter to engage, but not vice versa.

Use consistent terminology in tcp log messages:

 "ignored" means a segment contains invalid flags/information and
   is dropped without changing state or issuing a reply.

 "rejected" means a segments contains invalid flags/information but
   is causing a reply (usually RST) and may cause a state change.

Approved by:	re (rwatson)
2007-07-28 12:20:39 +00:00
Robert Watson
c6b2899785 Replace references to NET_CALLOUT_MPSAFE with CALLOUT_MPSAFE, and remove
definition of NET_CALLOUT_MPSAFE, which is no longer required now that
debug.mpsafenet has been removed.

The once over:	bz
Approved by:	re (kensmith)
2007-07-28 07:31:30 +00:00
Mike Silbersack
c325962b47 Export the contents of the syncache to netstat.
Approved by: re (kensmith)
MFC after: 2 weeks
2007-07-27 00:57:06 +00:00
Peter Wemm
477d44c467 Fix a second warning, introduced by my last "fix". I committed the wrong
diff from the wrong machine.

Pointy hat to: peter
Approved by:  re (rwatson - blanket, several days ago)
2007-07-05 06:04:46 +00:00
Peter Wemm
9fb5d4c064 Fix cast-qualifiers warning when INET6 is not present
Approved by:  re (rwatson)
2007-07-05 05:55:57 +00:00
George V. Neville-Neil
b2630c2934 Commit the change from FAST_IPSEC to IPSEC. The FAST_IPSEC
option is now deprecated, as well as the KAME IPsec code.
What was FAST_IPSEC is now IPSEC.

Approved by: re
Sponsored by: Secure Computing
2007-07-03 12:13:45 +00:00
George V. Neville-Neil
2cb64cb272 Commit IPv6 support for FAST_IPSEC to the tree.
This commit includes only the kernel files, the rest of the files
will follow in a second commit.

Reviewed by:    bz
Approved by:    re
Supported by:   Secure Computing
2007-07-01 11:41:27 +00:00
Robert Watson
32f9753cfb Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.

Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.

We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths.  Do, however, move those prototypes to priv.h.

Reviewed by:	csjp
Obtained from:	TrustedBSD Project
2007-06-12 00:12:01 +00:00
Robert Watson
b312d4b0ba Don't assign sp to the value of s when we're about to assign it instead to
s + strlen(s).

Found with:	Coverity Prevent(tm)
CID:		2243
2007-05-27 17:02:54 +00:00
Andre Oppermann
ec05a17370 In tcp_log_addrs():
o add the hex output of the th_flags field to the example log
   line in comments
 o simplify the log line length calculation and make it less
   evil
 o correct the test for the length panic; the line isn't on
   the stack but malloc'ed
2007-05-23 19:07:53 +00:00
Andre Oppermann
df541e5fc1 Add tcp_log_addrs() function to generate and standardized TCP log line
for use thoughout the tcp subsystem.

It is IPv4 and IPv6 aware creates a line in the following format:

 "TCP: [1.2.3.4]:50332 to [1.2.3.4]:80 tcpflags <RST>"

A "\n" is not included at the end.  The caller is supposed to add
further information after the standard tcp log header.

The function returns a NUL terminated string which the caller has
to free(s, M_TCPLOG) after use.  All memory allocation is done
with M_NOWAIT and the return value may be NULL in memory shortage
situations.

Either struct in_conninfo || (struct tcphdr && (struct ip || struct
ip6_hdr) have to be supplied.

Due to ip[6].h header inclusion limitations and ordering issues the
struct ip and struct ip6_hdr parameters have to be casted and passed
as void * pointers.

tcp_log_addrs(struct in_conninfo *inc, struct tcphdr *th, void *ip4hdr,
    void *ip6hdr)

Usage example:

 struct ip *ip;
 char *tcplog;

 if (tcplog = tcp_log_addrs(NULL, th, (void *)ip, NULL)) {
	log(LOG_DEBUG, "%s; %s: Connection attempt to closed port\n",
	    tcplog, __func__);
	free(s, M_TCPLOG);
 }
2007-05-18 19:58:37 +00:00
Andre Oppermann
2104448fe7 Move TIME_WAIT related functions and timer handling from files
other than repo copied tcp_subr.c into tcp_timewait.c#1.284:

 tcp_input.c#1.350 tcp_timewait() -> tcp_twcheck()

 tcp_timer.c#1.92 tcp_timer_2msl_reset() -> tcp_tw_2msl_reset()
 tcp_timer.c#1.92 tcp_timer_2msl_stop() -> tcp_tw_2msl_stop()
 tcp_timer.c#1.92 tcp_timer_2msl_tw() -> tcp_tw_2msl_scan()

This is a mechanical move with appropriate renames and making
them static if used only locally.

The tcp_tw_2msl_scan() cleanup function is still run from the
tcp_slowtimo() in tcp_timer.c.
2007-05-16 17:14:25 +00:00
Andre Oppermann
ec9c755352 Complete the (mechanical) move of the TCP reassembly and timewait
functions from their origininal place to their own files.

TCP Reassembly from tcp_input.c -> tcp_reass.c
TCP Timewait   from tcp_subr.c  -> tcp_timewait.c
2007-05-13 22:16:13 +00:00
Andre Oppermann
0489b64c5e Make the TCP timer callout obtain Giant if the network stack is marked
as non-mpsafe.

This change is to be removed when all protocols are mp-safe.
2007-05-11 20:52:47 +00:00
Andre Oppermann
504abdc6e6 Add the timestamp offset to struct tcptw so we can generate proper
ACKs in TIME_WAIT state that don't get dropped by the PAWS check
on the receiver.
2007-05-11 18:29:39 +00:00
Robert Watson
f2565d68a4 Move universally to ANSI C function declarations, with relatively
consistent style(9)-ish layout.
2007-05-10 15:58:48 +00:00
Robert Watson
434a0d24dd When setting up timewait state for a TCP connection, don't hold the
socket lock over a crhold() of so_cred: so_cred is constant after
socket creation, so doesn't require locking to read.
2007-05-07 13:04:25 +00:00
Andre Oppermann
3529149e9a Use existing TF_SACK_PERMIT flag in struct tcpcb t_flags field instead of
a decdicated sack_enable int for this bool.  Change all users accordingly.
2007-05-06 15:56:31 +00:00
Robert Watson
712fc218a0 Rename some fields of struct inpcbinfo to have the ipi_ prefix,
consistent with the naming of other structure field members, and
reducing improper grep matches.  Clean up and comment structure
fields in structure definition.
2007-04-30 23:12:05 +00:00
Andre Oppermann
bbf4e1cb47 Make tcp_twrespond() use tcp_addoptions() instead of a home grown version. 2007-04-18 18:14:39 +00:00
Andre Oppermann
b8152ba793 Change the TCP timer system from using the callout system five times
directly to a merged model where only one callout, the next to fire,
is registered.

Instead of callout_reset(9) and callout_stop(9) the new function
tcp_timer_activate() is used which then internally manages the callout.

The single new callout is a mutex callout on inpcb simplifying the
locking a bit.

tcp_timer() is the called function which handles all race conditions
in one place and then dispatches the individual timer functions.

Reviewed by:	rwatson (earlier version)
2007-04-11 09:45:16 +00:00
Andre Oppermann
5dd9dfefd6 Retire unused TCP_SACK_DEBUG. 2007-04-04 14:44:15 +00:00
Andre Oppermann
ad3f9ab320 ANSIfy function declarations and remove register keywords for variables.
Consistently apply style to all function declarations.
2007-03-21 19:37:55 +00:00
Andre Oppermann
e406f5a1c9 Remove tcp_minmssoverload DoS detection logic. The problem it tried to
protect us from wasn't really there and it only bloats the code.  Should
the problem surface in the future we can simply resurrect it from cvs
history.
2007-03-21 18:05:54 +00:00
Andre Oppermann
6489fe6553 Match up SYSCTL declaration style. 2007-03-19 19:00:51 +00:00
Robert Watson
8d0d6d112f Remove unused and #if 0'd net.inet.tcp.tcp_rttdflt sysctl. 2007-03-16 13:42:26 +00:00
Mohan Srinivasan
7c72af8770 Reap FIN_WAIT_2 connections marked SOCANTRCVMORE faster. This mitigate
potential issues where the peer does not close, potentially leaving
thousands of connections in FIN_WAIT_2. This is controlled by a new sysctl
fast_finwait2_recycle, which is disabled by default.

Reviewed by: gnn, silby.
2007-02-26 22:25:21 +00:00
John Baldwin
54e3607de6 Whitespace fix and remove an extra cast. 2006-12-30 17:53:28 +00:00
Robert Watson
acd3428b7d Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges.  These may
require some future tweaking.

Sponsored by:           nCircle Network Security, Inc.
Obtained from:          TrustedBSD Project
Discussed on:           arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
                        Alex Lyashkov <umka at sevcity dot net>,
                        Skip Ford <skip dot ford at verizon dot net>,
                        Antoine Brodin <antoine dot brodin at laposte dot net>
2006-11-06 13:42:10 +00:00
Robert Watson
aed5570872 Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h.  sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA
2006-10-22 11:52:19 +00:00
Maxim Konovalov
acc03ac6bb o Convert w/spaces to tabs in the previous commit. 2006-09-29 06:46:31 +00:00
Mike Silbersack
d4bdcb16cc Rather than autoscaling the number of TIME_WAIT sockets to maxsockets / 5,
scale it to min(ephemeral port range / 2, maxsockets / 5) so that people
with large gobs of memory and/or large maxsockets settings will not
exhaust their entire ephemeral port range with sockets in the TIME_WAIT
state during periods of heavy load.

Those who wish to tweak the size of the TIME_WAIT zone can still do so with
net.inet.tcp.maxtcptw.

Reviewed by: glebius, ru
2006-09-29 06:24:26 +00:00
Gleb Smirnoff
3e630ef9a9 Add a sysctl net.inet.tcp.nolocaltimewait that allows to suppress
creating a compress TIME WAIT states, if both connection endpoints
are local. Default is off.
2006-09-08 13:09:15 +00:00
Ruslan Ermilov
751dea2935 Back when we had T/TCP support, we used to apply different
timeouts for TCP and T/TCP connections in the TIME_WAIT
state, and we had two separate timed wait queues for them.
Now that is has gone, the timeout is always 2*MSL again,
and there is no reason to keep two queues (the first was
unused anyway!).

Also, reimplement the remaining queue using a TAILQ (it
was technically impossible before, with two queues).
2006-09-07 13:06:00 +00:00
Andre Oppermann
233dcce118 First step of TSO (TCP segmentation offload) support in our network stack.
o add IFCAP_TSO[46] for drivers to announce this capability for IPv4 and IPv6
 o add CSUM_TSO flag to mbuf pkthdr csum_flags field
 o add tso_segsz field to mbuf pkthdr
 o enhance ip_output() packet length check to allow for large TSO packets
 o extend tcp_maxmtu[46]() with a flag pointer to pass interface capabilities
 o adjust all callers of tcp_maxmtu[46]() accordingly

Discussed on:	-current, -net
Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-09-06 21:51:59 +00:00
Gleb Smirnoff
2c857a9be9 o Backout rev. 1.125 of in_pcb.c. It appeared to behave extremely
bad under high load. For example with 40k sockets and 25k tcptw
  entries, connect() syscall can run for seconds. Debugging showed
  that it iterates the cycle millions times and purges thousands of
  tcptw entries at a time.
  Besides practical unusability this change is architecturally
  wrong. First, in_pcblookup_local() is used in connect() and bind()
  syscalls. No stale entries purging shouldn't be done here. Second,
  it is a layering violation.
o Return back the tcptw purging cycle to tcp_timer_2msl_tw(),
  that was removed in rev. 1.78 by rwatson. The commit log of this
  revision tells nothing about the reason cycle was removed. Now
  we need this cycle, since major cleaner of stale tcptw structures
  is removed.
o Disable probably necessary, but now unused
  tcp_twrecycleable() function.

Reviewed by:	ru
2006-09-06 13:56:35 +00:00
Gleb Smirnoff
c3e07bf82a Finally fix rev. 1.256
Pointy hat to:	glebius
2006-09-05 14:00:59 +00:00
Gleb Smirnoff
23ebab416c Remove extra parenthesis in last commit.
Nitpicked by:	ru
2006-09-05 12:22:54 +00:00
Gleb Smirnoff
1f1f90c3a7 - Make net.inet.tcp.maxtcptw modifiable at run time.
- If net.inet.tcp.maxtcptw was ever set explicitly, do
  not change it if kern.ipc.maxsockets is changed.
2006-09-05 12:08:47 +00:00
Mohan Srinivasan
2374501ca4 Fix for a bug that causes the computation of "len" in tcp_output() to
get messed up, resulting in an inconsistency between the TCP state
and so_snd.
2006-08-26 17:53:19 +00:00
Mohan Srinivasan
464469c713 Fixes an edge case bug in timewait handling where ticks rolling over causing
the timewait expiry to be exactly 0 corrupts the timewait queues (and that entry).
Reviewed by:	silby
2006-08-11 21:15:23 +00:00
Robert Watson
e850475248 Move soisdisconnected() in tcp_discardcb() to one of its calling contexts,
tcp_twstart(), but not to the other, tcp_detach(), as the socket is
already being torn down and therefore there are no listeners.  This avoids
a panic if kqueue state is registered on the socket at close(), and
eliminates to XXX comments.  There is one case remaining in which
tcp_discardcb() reaches up to the socket layer as part of the TCP host
cache, which would be good to avoid.

Reported by:	Goran Gajic <ggajic at afrodita dot rcub dot bg dot ac dot yu>
2006-08-02 16:18:05 +00:00
Robert Watson
a152f8a361 Change semantics of socket close and detach. Add a new protocol switch
function, pru_close, to notify protocols that the file descriptor or
other consumer of a socket is closing the socket.  pru_abort is now a
notification of close also, and no longer detaches.  pru_detach is no
longer used to notify of close, and will be called during socket
tear-down by sofree() when all references to a socket evaporate after
an earlier call to abort or close the socket.  This means detach is now
an unconditional teardown of a socket, whereas previously sockets could
persist after detach of the protocol retained a reference.

This faciliates sharing mutexes between layers of the network stack as
the mutex is required during the checking and removal of references at
the head of sofree().  With this change, pru_detach can now assume that
the mutex will no longer be required by the socket layer after
completion, whereas before this was not necessarily true.

Reviewed by:	gnn
2006-07-21 17:11:15 +00:00
Stephan Uphoff
d915b28015 Fix race conditions on enumerating pcb lists by moving the initialization
( and where appropriate the destruction) of the pcb mutex to the init/finit
functions of the pcb zones.
This allows locking of the pcb entries and race condition free comparison
of the generation count.
Rearrange locking a bit to avoid extra locking operation to update the generation
count in in_pcballoc(). (in_pcballoc now returns the pcb locked)

I am planning to convert pcb list handling from a type safe to a reference count
model soon. ( As this allows really freeing the PCBs)

Reviewed by:	rwatson@, mohans@
MFC after:	1 week
2006-07-18 22:34:27 +00:00
Robert Watson
10702a2840 Abstract inpcb drop logic, previously just setting of INP_DROPPED in TCP,
into in_pcbdrop().  Expand logic to detach the inpcb from its bound
address/port so that dropping a TCP connection releases the inpcb resource
reservation, which since the introduction of socket/pcb reference count
updates, has been persisting until the socket closed rather than being
released implicitly due to prior freeing of the inpcb on TCP drop.

MFC after:	3 months
2006-04-25 11:17:35 +00:00
Robert Watson
9106a6d6b0 Replace isn_mtx direct use with ISN_*() lock macros so that locking
details/strategy can be changed without touching every use.

MFC after:	3 months
2006-04-23 12:27:42 +00:00
Robert Watson
4c0e8f41f6 Introduce a new TCP mutex, isn_mtx, which protects the initial sequence
number state, rather than re-using pcbinfo.  This introduces some
additional mutex operations during isn query, but avoids hitting the TCP
pcbinfo lock out of yet another frequently firing TCP timer.

MFC after:	3 months
2006-04-22 19:23:24 +00:00
Paul Saab
4f590175b7 Allow for nmbclusters and maxsockets to be increased via sysctl.
An eventhandler is used to update all the various zones that depend
on these values.
2006-04-21 09:25:40 +00:00
Gleb Smirnoff
a73b656763 Add a tunable net.inet.tcp.maxtcptw, that allows to set a limit
on tcptw zone independently from setting a limit on socket zone.
2006-04-04 14:31:37 +00:00
Robert Watson
ae0e714308 Before dereferencing intotw() when INP_TIMEWAIT, check for inp_ppcb being
NULL.  We currently do allow this to happen, but may want to remove that
possibility in the future.  This case can occur when a socket is left
open after TCP wraps up, and the timewait state is recycled.  This will
be cleaned up in the future.

Found by:	Kazuaki Oda <kaakun at highway dot ne dot jp>
MFC after:	3 months
2006-04-04 12:26:07 +00:00
Robert Watson
cb895fb9b0 In TCP notify routines, check inpcb for INP_TIMEWAIT and INP_DROPPED.
The INP_DROPPED check replaces the current NULL checks; the INP_TIMEWAIT
checks appear to have always been required, but not been there, which
is/was a bug.  This avoids unconditionally casting of in_ppcb to a tcpcb,
when it may be a twtcb, which may have resulted in obscure ICMP-related
panics in earlier releases.

MFC after:	3 months
2006-04-03 14:07:50 +00:00
Robert Watson
afa39e25c4 Change inp_ppcb from caddr_t to void *, fix/remove associated related
casts.

Consistently use intotw() to cast inp_ppcb pointers to struct tcptw *
pointers.

Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb *
pointers.

Don't assign tp to the results to intotcpcb() during variable declation
at the top of functions, as that is before the asserts relating to
locking have been performed.  Do this later in the function after
appropriate assertions have run to allow that operation to be conisdered
safe.

MFC after:	3 months
2006-04-03 13:33:55 +00:00
Robert Watson
43f56a32a0 Style tweaks: convert to ANSI from K&R function prototypes.
MFC after:	3 months
2006-04-03 12:59:27 +00:00
Robert Watson
2fc5ae87d0 Update comment on tcp_close() for new world order.
MFC after:	3 months
2006-04-03 12:52:13 +00:00
Robert Watson
fa38deac65 Fix up locking surrounding tcp_drop sysctl: in the new world order, we
don't free inpcbs until after the socket is closed, so we always need
to unlock an inpcb after calling tcp_drop() on it.

MFC after:	3 months
2006-04-03 11:57:12 +00:00
Robert Watson
34af7bae80 Properly handle an edge case previously not handled correctly: a
socket can have a tcp connection that has entered time wait
attached to it, in the event that shutdown() is called on the
socket and the FINs properly exchange before close().  In this
case we don't detach or free the inpcb, just leave the tcptw
detached and freed, but we must release the inpcb lock (which we
didn't previously).

MFC after:	3 months
2006-04-01 23:53:25 +00:00
Robert Watson
623dce13c6 Update TCP for infrastructural changes to the socket/pcb refcount model,
pru_abort(), pru_detach(), and in_pcbdetach():

- Universally support and enforce the invariant that so_pcb is
  never NULL, converting dozens of unnecessary NULL checks into
  assertions, and eliminating dozens of unnecessary error handling
  cases in protocol code.

- In some cases, eliminate unnecessary pcbinfo locking, as it is no
  longer required to ensure so_pcb != NULL.  For example, the receive
  code no longer requires the pcbinfo lock, and the send code only
  requires it if building a new connection on an otherwise unconnected
  socket triggered via sendto() with an address.  This should
  significnatly reduce tcbinfo lock contention in the receive and send
  cases.

- In order to support the invariant that so_pcb != NULL, it is now
  necessary for the TCP code to not discard the tcpcb any time a
  connection is dropped, but instead leave the tcpcb until the socket
  is shutdown.  This case is handled by setting INP_DROPPED, to
  substitute for using a NULL so_pcb to indicate that the connection
  has been dropped.  This requires the inpcb lock, but not the pcbinfo
  lock.

- Unlike all other protocols in the tree, TCP may need to retain access
  to the socket after the file descriptor has been closed.  Set
  SS_PROTOREF in tcp_detach() in order to prevent the socket from being
  freed, and add a flag, INP_SOCKREF, so that the TCP code knows whether
  or not it needs to free the socket when the connection finally does
  close.  The typical case where this occurs is if close() is called on
  a TCP socket before all sent data in the send socket buffer has been
  transmitted or acknowledged.  If INP_SOCKREF is found when the
  connection is dropped, we release the inpcb, tcpcb, and socket instead
  of flagging INP_DROPPED.

- Abort and detach protocol switch methods no longer return failures,
  nor attempt to free sockets, as the socket layer does this.

- Annotate the existence of a long-standing race in the TCP timer code,
  in which timers are stopped but not drained when the socket is freed,
  as waiting for drain may lead to deadlocks, or have to occur in a
  context where waiting is not permitted.  This race has been handled
  by testing to see if the tcpcb pointer in the inpcb is NULL (and vice
  versa), which is not normally permitted, but may be true of a inpcb
  and tcpcb have been freed.  Add a counter to test how often this race
  has actually occurred, and a large comment for each instance where
  we compare potentially freed memory with NULL.  This will have to be
  fixed in the near future, but requires is to further address how to
  handle the timer shutdown shutdown issue.

- Several TCP calls no longer potentially free the passed inpcb/tcpcb,
  so no longer need to return a pointer to indicate whether the argument
  passed in is still valid.

- Un-macroize debugging and locking setup for various protocol switch
  methods for TCP, as it lead to more obscurity, and as locking becomes
  more customized to the methods, offers less benefit.

- Assert copyright on tcp_usrreq.c due to significant modifications that
  have been made as part of this work.

These changes significantly modify the memory management and connection
logic of our TCP implementation, and are (as such) High Risk Changes,
and likely to contain serious bugs.  Please report problems to the
current@ mailing list ASAP, ideally with simple test cases, and
optionally, packet traces.

MFC after:	3 months
2006-04-01 16:36:36 +00:00
Andre Oppermann
eaf80179e2 Have TCP Inflight disable itself if the RTT is below a certain
threshold.  Inflight doesn't make sense on a LAN as it has
trouble figuring out the maximal bandwidth because of the coarse
tick granularity.

The sysctl net.inet.tcp.inflight.rttthresh specifies the threshold
in milliseconds below which inflight will disengage.  It defaults
to 10ms.

Tested by:	Joao Barros <joao.barros-at-gmail.com>,
		Rich Murphey <rich-at-whiteoaklabs.com>
Sponsored by:	TCP/IP Optimization Fundraise 2005
2006-02-16 19:38:07 +00:00
Andre Oppermann
34333b16cd Retire MT_HEADER mbuf type and change its users to use MT_DATA.
Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it.  It only adds a layer of confusion.  The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.

Non-native code is not changed in this commit.  For compatibility
MT_HEADER is mapped to MT_DATA.

Sponsored by:	TCP/IP Optimization Fundraise 2005
2005-11-02 13:46:32 +00:00
Philip Paeps
7691747aac Unbreak the net.inet6.tcp6.getcred sysctl.
This makes inetd/auth work again in IPv6 setups.

Pointy hat to:	ume/KAME
2005-10-12 09:24:18 +00:00
Maxim Konovalov
ac827533df o Teach sysctl_drop() how to deal with the sockets in TIME_WAIT state.
This is a special case because tcp_twstart() destroys a tcp control
block via tcp_discardcb() so we cannot call tcp_drop(struct *tcpcb) on
such connections.  Use tcp_twclose() instead.

MFC after:	5 days
2005-10-02 08:43:57 +00:00
Andre Oppermann
ffabe3dce8 In tcp_ctlinput() do not swap ip->ip_len a second time. It
has been done in icmp_input() already.

This fixes the ICMP_UNREACH_NEEDFRAG case where no MTU was
proposed in the ICMP reply.

PR:		kern/81813
Submitted by:	Vitezslav Novy <vita at fio.cz>
MFC after:	3 days
2005-09-10 07:43:29 +00:00
Andre Oppermann
e0aec68255 Use the correct mbuf type for MGET(). 2005-08-30 16:35:27 +00:00
Hajimu UMEMOTO
4dad226e45 recover the line which was wrongly disappeared during scope cleanup.
tcpdrop(8) should work for IPv6, again.
2005-08-01 12:08:49 +00:00
Hajimu UMEMOTO
a1f7e5f8ee scope cleanup. with this change
- most of the kernel code will not care about the actual encoding of
  scope zone IDs and won't touch "s6_addr16[1]" directly.
- similarly, most of the kernel code will not care about link-local
  scoped addresses as a special case.
- scope boundary check will be stricter.  For example, the current
  *BSD code allows a packet with src=::1 and dst=(some global IPv6
  address) to be sent outside of the node, if the application do:
    s = socket(AF_INET6);
    bind(s, "::1");
    sendto(s, some_global_IPv6_addr);
  This is clearly wrong, since ::1 is only meaningful within a single
  node, but the current implementation of the *BSD kernel cannot
  reject this attempt.

Submitted by:	JINMEI Tatuya <jinmei__at__isl.rdc.toshiba.co.jp>
Obtained from:	KAME
2005-07-25 12:31:43 +00:00
Robert Watson
f59a9ebf10 Remove no-op spl's and most comment references to spls, as TCP locking
is believed to be basically done (modulo any remaining bugs).

MFC after:	3 days
2005-07-19 12:21:26 +00:00
Paul Saab
482ac96888 Fix for a bug in the change that defers sack option processing until
after PAWS checks. The symptom of this is an inconsistency in the cached
sack state, caused by the fact that the sack scoreboard was not being
updated for an ACK handled in the header prediction path.

Found by:	Andrey Chernov.
Submitted by:	Noritoshi Demizu, Raja Mukerji.
Approved by:	re
2005-07-01 22:54:18 +00:00
Robert Watson
e3d5315d01 Assert tcbinfo lock in tcp_drop() due to its call of tcp_close()
Assert tcbinfo lock in tcp_close() due to its call to in{,6}_detach()
Assert tcbinfo lock in tcp_drop_syn_sent() due to its call to tcp_drop()

MFC after:	7 days
2005-06-01 12:06:07 +00:00
Colin Percival
fe2eee8231 Fix two issues which were missed in FreeBSD-SA-05:08.kmem.
Reported by:	Uwe Doering
2005-05-07 00:41:36 +00:00
Andre Oppermann
9e4ca6315d If we don't get a suggested MTU during path MTU discovery
look up the packet size of the packet that generated the
response, step down the MTU by one step through ip_next_mtu()
and try again.

Suggested by:	dwmalone
2005-05-04 13:48:44 +00:00
Paul Saab
a6235da61e - Make the sack scoreboard logic use the TAILQ macros. This improves
code readability and facilitates some anticipated optimizations in
  tcp_sack_option().
- Remove tcp_print_holes() and TCP_SACK_DEBUG.

Submitted by:	Raja Mukerji.
Reviewed by:	Mohan Srinivasan, Noritoshi Demizu.
2005-04-21 20:11:01 +00:00
Andre Oppermann
1aedbd9c80 Move Path MTU discovery ICMP processing from icmp_input() to
tcp_ctlinput() and subject it to active tcpcb and sequence
number checking.  Previously any ICMP unreachable/needfrag
message would cause an update to the TCP hostcache.  Now only
ICMP PMTU messages belonging to an active TCP session with
the correct src/dst/port and sequence number will update the
hostcache and complete the path MTU discovery process.

Note that we don't entirely implement the recommended counter
measures of Section 7.2 of the paper.  However we close down
the possible degradation vector from trivially easy to really
complex and resource intensive.  In addition we have limited
the smallest acceptable MTU with net.inet.tcp.minmss sysctl
for some time already, further reducing the effect of any
degradation due to an attack.

Security:	draft-gont-tcpm-icmp-attacks-03.txt Section 7.2
MFC after:	3 days
2005-04-21 14:29:34 +00:00
Andre Oppermann
1600372b6b Ignore ICMP Source Quench messages for TCP sessions. Source Quench is
ineffective, depreciated and can be abused to degrade the performance
of active TCP sessions if spoofed.

Replace a bogus call to tcp_quench() in tcp_output() with the direct
equivalent tcpcb variable assignment.

Security:	draft-gont-tcpm-icmp-attacks-03.txt Section 7.1
MFC after:	3 days
2005-04-21 12:37:12 +00:00
Paul Saab
e346eeff65 - If the reassembly queue limit was reached or if we couldn't allocate
a reassembly queue state structure, don't update (receiver) sack
  report.
- Similarly, if tcp_drain() is called, freeing up all items on the
  reassembly queue, clean the sack report.

Found, Submitted by:	Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp>
Reviewed by:	Mohan Srinivasan (mohans at yahoo-inc dot com),
		Raja Mukerji (raja at moselle dot com).
2005-04-10 05:21:29 +00:00
Gleb Smirnoff
31199c8463 Use NET_CALLOUT_MPSAFE macro. 2005-03-01 12:01:17 +00:00
Maxim Konovalov
9945c0e21f o Add handling of an IPv4-mapped IPv6 address.
o Use SYSCTL_IN() macro instead of direct call of copyin(9).

Submitted by:	ume

o Move sysctl_drop() implementation to sys/netinet/tcp_subr.c where
most of tcp sysctls live.
o There are net.inet[6].tcp[6].getcred sysctls already, no needs in
a separate struct tcp_ident_mapping.

Suggested by:	ume
2005-02-14 07:37:51 +00:00
Hajimu UMEMOTO
6d0a982bdf teach scope of IPv6 address to net.inet6.tcp6.getcred.
MFC after:	1 week
2005-02-04 14:43:05 +00:00
Robert Watson
06456da2c6 Update an additional reference to the rate of ISN tick callouts that was
missed in tcp_subr.c:1.216: projected_offset must also reflect how often
the tcp_isn_tick() callout will fire.

MFC after:	2 weeks
Submitted by:	silby
2005-01-31 01:35:01 +00:00
Robert Watson
54082796aa Have tcp_isn_tick() fire 100 times a second, rather than HZ times a
second; since the default hz has changed to 1000 times a second,
this resulted in unecessary work being performed.

MFC after:		2 weeks
Discussed with:		phk, cperciva
General head nod:	silby
2005-01-30 23:30:28 +00:00
Warner Losh
c398230b64 /* -> /*- for license, minor formatting changes 2005-01-07 01:45:51 +00:00
Robert Watson
452d9f5b1c Attempt to consistently use () around return values in calls to
return() in newer code (sysctl, ISN, timewait).
2004-12-23 01:34:26 +00:00
Robert Watson
06da46b241 Remove an XXXRW comment relating to whether or not the TCP timers are
MPSAFE: they are now believed to be.

Correct a typo in a second comment.

MFC after:	2 weeks
2004-12-23 01:27:13 +00:00
Robert Watson
b9155d92b2 Assert inpcb lock in:
tcpip_fillheaders()
  tcp_discardcb()
  tcp_close()
  tcp_notify()
  tcp_new_isn()
  tcp_xmit_bandwidth_limit()

Fix a locking comment in tcp_twstart(): the pcbinfo will be locked (and
is asserted).

MFC after:	2 weeks
2004-12-05 22:27:53 +00:00
Robert Watson
cce83ffb5a tcp_timewait() performs multiple non-atomic reads on the tcptw
structure, so assert the inpcb lock associated with the tcptw.
Also assert the tcbinfo lock, as tcp_timewait() may call
tcp_twclose() or tcp_2msl_rest(), which require it.  Since
tcp_timewait() is already called with that lock from tcp_input(),
this doesn't change current locking, merely documents reasons for
it.

In tcp_twstart(), assert the tcbinfo lock, as tcp_timer_2msl_rest()
is called, which requires that lock.

In tcp_twclose(), assert the tcbinfo lock, as tcp_timer_2msl_stop()
is called, which requires that lock.

Document the locking strategy for the time wait queues in tcp_timer.c,
which consists of protecting the time wait queues in the same manner
as the tcbinfo structure (using the tcbinfo lock).

In tcp_timer_2msl_reset(), assert the tcbinfo lock, as the time wait
queues are modified.

In tcp_timer_2msl_stop(), assert the tcbinfo lock, as the time wait
queues may be modified.

In tcp_timer_2msl_tw(), assert the tcbinfo lock, as the time wait
queues may be modified.

MFC after:	2 weeks
2004-11-23 17:21:30 +00:00
Robert Watson
7258e91f0f Assert the inpcb lock in tcp_twstart(), which does both read-modify-write
on the tcpcb, but also calls into tcp_close() and tcp_twrespond().

Annotate that tcp_twrecycleable() requires the inpcb lock because it does
a series of non-atomic reads of the tcpcb, but is currently called
without the inpcb lock by the caller.  This is a bug.

Assert the inpcb lock in tcp_twclose() as it performs a read-modify-write
of the timewait structure/inpcb, and calls in_pcbdetach() which requires
the lock.

Assert the inpcb lock in tcp_twrespond(), as it performs multiple
non-atomic reads of the tcptw and inpcb structures, as well as calling
mac_create_mbuf_from_inpcb(), tcpip_fillheaders(), which require the
inpcb lock.

MFC after:	2 weeks
2004-11-23 16:23:13 +00:00
Robert Watson
8263bab34d Assert inpcb lock in tcp_quench(), tcp_drop_syn_sent(), tcp_mtudisc(),
and tcp_drop(), due to read-modify-write of TCP state variables.

MFC after:	2 weeks
2004-11-23 16:06:15 +00:00
Robert Watson
8438db0f59 Assert the tcbinfo write lock in tcp_new_isn(), as the tcbinfo lock
protects access to the ISN state variables.

Acquire the tcbinfo write lock in tcp_isn_tick() to synchronize
timer-driven isn bumping.

Staticize internal ISN variables since they're not used outside of
tcp_subr.c.

MFC after:	2 weeks
2004-11-23 15:59:43 +00:00
SUZUKI Shinsuke
3d54848fc2 support TCP-MD5(IPv4) in KAME-IPSEC, too.
MFC after: 3 week
2004-11-08 18:49:51 +00:00
Andre Oppermann
c94c54e4df Remove RFC1644 T/TCP support from the TCP side of the network stack.
A complete rationale and discussion is given in this message
and the resulting discussion:

 http://docs.freebsd.org/cgi/mid.cgi?4177C8AD.6060706

Note that this commit removes only the functional part of T/TCP
from the tcp_* related functions in the kernel.  Other features
introduced with RFC1644 are left intact (socket layer changes,
sendmsg(2) on connection oriented protocols)  and are meant to
be reused by a simpler and less intrusive reimplemention of the
previous T/TCP functionality.

Discussed on:	-arch
2004-11-02 22:22:22 +00:00
Robert Watson
81158452be Push acquisition of the accept mutex out of sofree() into the caller
(sorele()/sotryfree()):

- This permits the caller to acquire the accept mutex before the socket
  mutex, avoiding sofree() having to drop the socket mutex and re-order,
  which could lead to races permitting more than one thread to enter
  sofree() after a socket is ready to be free'd.

- This also covers clearing of the so_pcb weak socket reference from
  the protocol to the socket, preventing races in clearing and
  evaluation of the reference such that sofree() might be called more
  than once on the same socket.

This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket.  The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets.  The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.

RELENG_5_3 candidate.

MFC after:	3 days
Reviewed by:	dwhite
Discussed with:	gnn, dwhite, green
Reported by:	Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by:	Vlad <marchenko at gmail dot com>
2004-10-18 22:19:43 +00:00
Paul Saab
a55db2b6e6 - Estimate the amount of data in flight in sack recovery and use it
to control the packets injected while in sack recovery (for both
  retransmissions and new data).
- Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c.
- Add a new sysctl (net.inet.tcp.sack.initburst) that controls the
  number of sack retransmissions done upon initiation of sack recovery.

Submitted by:	Mohan Srinivasan <mohans@yahoo-inc.com>
2004-10-05 18:36:24 +00:00
John-Mark Gurney
b5d47ff592 fix up socket/ip layer violation... don't assume/know that
SO_DONTROUTE == IP_ROUTETOIF and SO_BROADCAST == IP_ALLOWBROADCAST...
2004-09-05 02:34:12 +00:00
Andre Oppermann
6f2d4ea6f8 For IPv6 access pointer to tcpcb only after we have checked it is valid.
Found by:	Coverity's automated analysis (via Ted Unangst)
2004-08-19 20:16:17 +00:00