Commit Graph

410 Commits

Author SHA1 Message Date
Robert Watson
3e5cbaa4c7 Merge r197814 from head to stable/8:
Remove tcp_input lock statistics; these are intended for debugging only
  and are not intended to ship in 8.0 as they dirty additional cache
  lines in a performance-critical per-packet path.

Approved by:	re (kib, bz)
2009-10-09 09:18:22 +00:00
Robert Watson
f41dd6dca9 Merge r197795 from head to stable/8:
In tcp_input(), we acquire a global write lock at first only if a
  segment is likely to trigger a TCP state change (i.e., FIN/RST/SYN).
  If we later have to upgrade the lock, we acquire an inpcb reference
  and drop both global/inpcb locks before reacquiring in-order.  In
  that gap, the connection may transition into TIMEWAIT, so we need
  to loop back and reevaluate the inpcb after relocking.

  Reported by:        Kamigishi Rei <spambox at haruhiism.net>
  Reviewed by:        bz

Approved by:	re (kib)
2009-10-08 11:07:15 +00:00
Robert Watson
315e3e38fa Many network stack subsystems use a single global data structure to hold
all pertinent statatistics for the subsystem.  These structures are
sometimes "borrowed" by kernel modules that require a place to store
statistics for similar events.

Add KPI accessor functions for statistics structures referenced by kernel
modules so that they no longer encode certain specifics of how the data
structures are named and stored.  This change is intended to make it
easier to move to per-CPU network stats following 8.0-RELEASE.

The following modules are affected by this change:

      if_bridge
      if_cxgb
      if_gif
      ip_mroute
      ipdivert
      pf

In practice, most of these statistics consumers should, in fact, maintain
their own statistics data structures rather than borrowing structures
from the base network stack.  However, that change is too agressive for
this point in the release cycle.

Reviewed by:	bz
Approved by:	re (kib)
2009-08-02 19:43:32 +00:00
Robert Watson
530c006014 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
Julian Elischer
7973fba3a4 Somewhere along the line accept sockets stopped honoring the
FIB selected for them. Fix this.

Reviewed by:	ambrisko
Approved by:	re (kib)
MFC after:	3 days
2009-07-28 19:43:27 +00:00
Robert Watson
eddfbb763d Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00
Robert Watson
8c0fec805f Modify most routines returning 'struct ifaddr *' to return references
rather than pointers, requiring callers to properly dispose of those
references.  The following routines now return references:

  ifaddr_byindex
  ifa_ifwithaddr
  ifa_ifwithbroadaddr
  ifa_ifwithdstaddr
  ifa_ifwithnet
  ifaof_ifpforaddr
  ifa_ifwithroute
  ifa_ifwithroute_fib
  rt_getifa
  rt_getifa_fib
  IFP_TO_IA
  ip_rtaddr
  in6_ifawithifp
  in6ifa_ifpforlinklocal
  in6ifa_ifpwithaddr
  in6_ifadd
  carp_iamatch6
  ip6_getdstifaddr

Remove unused macro which didn't have required referencing:

  IFP_TO_IA6

This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc).  In a few cases, add missing if_addr_list locking
required to safely acquire references.

Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit.  Once
we have mbuf tag deep copy support, this can be fixed.

Reviewed by:	bz
Obtained from:	Apple, Inc. (portions)
MFC after:	6 weeks (portions)
2009-06-23 20:19:09 +00:00
John Baldwin
6dfb8b316c Fix edge cases with ticks wrapping from INT_MAX to INT_MIN in the handling
of the per-tcpcb t_badtrxtwin.

Submitted by:	bde
2009-06-16 19:00:12 +00:00
John Baldwin
1a0e7cfc42 Trim extra ()'s.
Submitted by:	bde
2009-06-11 14:36:13 +00:00
John Baldwin
0e8cc7e748 Change a few members of tcpcb that store cached copies of ticks to be ints
instead of unsigned longs.  This fixes a few overflow edge cases on 64-bit
platforms.  Specifically, if an idle connection receives a packet shortly
before 2^31 clock ticks of uptime (about 25 days with hz=1000) and the keep
alive timer fires after 2^31 clock ticks, the keep alive timer will think
that the connection has been idle for a very long time and will immediately
drop the connection instead of sending a keep alive probe.

Reviewed by:	silby, gnn, lstewart
MFC after:	1 week
2009-06-10 18:27:15 +00:00
Robert Watson
bcf11e8d00 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
Robert Watson
f93bfb23dc Add internal 'mac_policy_count' counter to the MAC Framework, which is a
count of the number of registered policies.

Rather than unconditionally locking sockets before passing them into MAC,
lock them in the MAC entry points only if mac_policy_count is non-zero.

This avoids locking overhead for a number of socket system calls when no
policies are registered, eliminating measurable overhead for the MAC
Framework for the socket subsystem when there are no active policies.

Possibly socket locks should be acquired by policies if they are required
for socket labels, which would further avoid locking overhead when there
are policies but they don't require labeling of sockets, or possibly
don't even implement socket controls.

Obtained from:	TrustedBSD Project
2009-06-02 18:26:17 +00:00
Zachary Loafman
81ad7eb017 Correct handling of SYN packets that are to the left of the current window of an ESTABLISHED connection.
Reviewed by:        net@, gnn
Approved by:        dfr (mentor)
2009-05-27 17:02:10 +00:00
Robert Watson
78b5071407 Update stats in struct tcpstat using two new macros, TCPSTAT_ADD() and
TCPSTAT_INC(), rather than directly manipulating the fields across the
kernel.  This will make it easier to change the implementation of
these statistics, such as using per-CPU versions of the data structures.

MFC after:	3 days
2009-04-11 22:07:19 +00:00
Kip Macy
80cb9f211a Import "flowid" support for serializing flows across transmit queues
Reviewed by:	rwatson and jeli
2009-04-10 06:16:14 +00:00
Robert Watson
ad71fe3c35 Correct a number of evolved problems with inp_vflag and inp_flags:
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted.  Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):

  INP_TIMEWAIT
  INP_SOCKREF
  INP_ONESBCAST
  INP_DROPPED

Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.

MFC after:      1 week (or after dependencies are MFC'd)
Reviewed by:    bz
2009-03-15 09:58:31 +00:00
Lawrence Stewart
24cb0f2232 Add TCP Appropriate Byte Counting (RFC 3465) support to kernel.
The new behaviour is on by default, and can be disabled by setting the
net.inet.tcp.rfc3465 sysctl to 0 to obtain previous behaviour.

The patch changes struct tcpcb in sys/netinet/tcp_var.h which breaks
the ABI. Bump __FreeBSD_version to 800061 accordingly. User space tools
that rely on the size of struct tcpcb (e.g. sockstat) need to be recompiled.

Reviewed by:	rpaulo, gnn
Approved by:	gnn, kmacy (mentors)
Sponsored by:	FreeBSD Foundation
2009-01-15 06:44:22 +00:00
Bjoern A. Zeeb
dcdb4371ca Use inc_flags instead of the inc_isipv6 alias which so far
had been the only flag with random usage patterns.
Switch inc_flags to be used as a real bit field by using
INC_ISIPV6 with bitops to check for the 'isipv6' condition.

While here fix a place or two where in case of v4 inc_flags
were not properly initialized before.[1]

Found by:	rwatson during review [1]
Discussed with:	rwatson
Reviewed by:	rwatson
MFC after:	4 weeks
2008-12-17 12:52:34 +00:00
Robert Watson
d15fb96522 Enhance one comment relating to recent TCP locking changes, and fix a
typo in another.

MFC after:	6 weeks
2008-12-09 15:49:02 +00:00
Robert Watson
252ca42863 Move from solely write-locking the global tcbinfo in tcp_input()
to read-locking in the TCP input path, allowing greater TCP
input parallelism where multiple ithreads or ithread and netisr
are able to run in parallel.  Previously, most TCP input paths
held a write lock on the global tcbinfo lock, effectively
serializing TCP input.

Before looking up the connection, acquire a write lock if a
potentially state-changing flag is set on the TCP segment header
(FIN, RST, SYN), and otherwise a read lock.  We may later have
to upgrade to a write lock in certain cases (ACKs received by the
syncache or during TIMEWAIT) in order to support global state
transitions, but this is never required for steady-state packets.

Upgrading from a write lock to a read lock must be done as a
trylock operation to avoid deadlocks, and actually violates the
lock order as the tcbinfo lock preceeds the inpcb lock held at
the time of upgrade.  If the trylock fails, we bump the refcount
on the inpcb, drop both locks, and re-acquire in-order.  If
another thread has freed the connection while the locks are
dropped, we free the inpcb and repeat the lookup (this should
hardly ever or never happen in practice).

For now, maintain a number of new counters measuring how many
times various cases execute, and in particular whether various
optimistic assumptions about when read locks can be used, whether
upgrades are done using the fast path, and whether connections
close in practice in the above-described race, actually occur.

MFC after:	6 weeks
Discussed with:	kmacy
Reviewed by:	bz, gnn, kmacy
Tested by:	kmacy
2008-12-08 20:27:00 +00:00
Bjoern A. Zeeb
4b79449e2f Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by:	brooks, gnn, des, zec, imp
Sponsored by:	The FreeBSD Foundation
2008-12-02 21:37:28 +00:00
Marko Zec
97021c2464 Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by:  bz, julian
Approved by:  julian (mentor)
Obtained from:        //depot/projects/vimage-commit2/...
X-MFC after:  never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-11-26 22:32:07 +00:00
Marko Zec
44e33a0758 Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions.  As a rule,
initialization at instatiation for such variables should never be
introduced again from now on.  Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact.  In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-11-19 09:39:34 +00:00
Bjoern A. Zeeb
8e5c87f4b6 Fix typo and while here another one.
Reviewed by:	keramida
Reported by:	keramida
MFC after:	2 months (with r184720)
2008-11-06 16:30:20 +00:00
Bjoern A. Zeeb
91d6cfa6b1 Fix a bug introduced with r182851 splitting tcp_mss() into
tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could
re-use the same code.

Move the TSO logic back to tcp_mss() and out of tcp_mss_update().
We tried to avoid that initially but if were are called from
tcp_output() with EMSGSIZE, we cleared the TSO flag on the tcpcb
there, called into tcp_mtudisc() and tcp_mss_update() which
then would reenable TSO on the tcpcb based on TSO capabilities
of the interface as learnt in tcp_maxmtu/6().
So if TSO was enabled on the (possibly new) outgoing interface
it was turned back on, which lead to an endless loop between
tcp_output() and tcp_mtudisc() until we overflew the stack.

Reported by:	kmacy
MFC after:	2 months (along with r182851)
2008-11-06 13:25:59 +00:00
Bjoern A. Zeeb
6f01cac68a Fix a bug introduced with r182851 splitting tcp_mss() into
tcp_mss() and tcp_mss_update() so that tcp_mtudisc() could
re-use the same code.

In case we return early and got a metricptr to pass the hostcache
info back to the caller we need to initialize the data to a defined
state (zero it) as tcp_hc_get() would do if there was no hit.
Without that the caller would check on random stack garbage which
could lead to undefined results.

This only affected tcp_mss() if there was no routing entry for the peer,
tcp_mtudisc() was not affected.

MFC after:	2 months (along with r182851)
2008-11-06 12:33:33 +00:00
Robert Watson
dd8ac7f990 In both dropwithreset paths in tcp_input.c, drop the tcbinfo lock
sooner to decomplicate locking and eliminate the need for a rather
chatty comment about why we have to handle the global lock in a
special way for the benefit of ipfw and pf cred rules.

MFC after:	3 days
2008-10-26 22:03:52 +00:00
Robert Watson
4c95fd23d6 Remove endearing but syntactically unnecessary "return;" statements
directly before the final closeing brackets of some TCP functions.

MFC after:	3 days
2008-10-26 19:33:22 +00:00
Robert Watson
6c8286e42d Don't pass curthread to sbreserve_locked() in tcp_do_segment(), as the
netisr or ithread's socket buffer size limit is not the right limit to
use.  Instead, pass NULL as the other two calls to sbreserve_locked()
in the TCP input path (tcp_mss()) do.

In practice, this is a no-op, as ithreads and the netisr run without a
process limit on socket buffer use, and a NULL thread pointer leads to
not using the process's limit, if any.  However, if tcp_input() is
called in other contexts that do have limits, this may prevent the
incorrect limit from being used.

MFC after:	3 days
2008-10-07 09:41:07 +00:00
Marko Zec
8b615593fc Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by:	julian, bz, brooks, zec
Reviewed by:	julian, bz, brooks, kris, rwatson, ...
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
Robert Watson
014ea782b1 As a follow-on to r183323, correct another case where ip_output() was
called without an inpcb pointer despite holding the tcbinfo global
lock, which lead to a deadlock or panic when ipfw tried to further
acquire it recursively.

Reported by:    Stefan Ehmann <shoesoft at gmx dot net>
MFC after:      3 days
2008-09-25 17:26:54 +00:00
Robert Watson
a0ca087183 When dropping a packet and issuing a reset during TCP segment handling,
unconditionally drop the tcbinfo lock (after all, we assert it lines
before), but call tcp_dropwithreset() under both inpcb and inpcbinfo
locks only if we pass in an tcpcb.  Otherwise, if the pointer is NULL,
firewall code may later recurse the global tcbinfo lock trying to look
up an inpcb.

This is an instance where a layering violation leads not only
potentially to code reentrace and recursion, but also to lock
recursion, and was revealed by the conversion to rwlocks because
acquiring a read lock on an rwlock already held with a write lock is
forbidden.  When these locks were mutexes, they simply recursed.

Reported by:	Stefan Ehmann <shoesoft at gmx dot net>
MFC after:	3 days
2008-09-24 11:07:03 +00:00
Bjoern A. Zeeb
c10eb6d10a Work around an integer division resulting in 0 and thus the
congestion window not being incremented, if cwnd > maxseg^2.
As suggested in RFC2581 increment the cwnd by 1 in this case.

See http://caia.swin.edu.au/reports/080829A/CAIA-TR-080829A.pdf
for more details.

Submitted by:	Alana Huebner, Lawrence Stewart,
		Grenville Armitage (caia.swin.edu.au)
Reviewed by:	dwmalone, gnn, rpaulo
MFC After:	3 days
2008-09-09 07:35:21 +00:00
Bjoern A. Zeeb
3cee92e074 Split tcp_mss() in tcp_mss() and tcp_mss_update() where the former
calls the latter.

Merge tcp_mss_update() with code from tcp_mtudisc() basically
doing the same thing.

This gives us one central place where we calcuate and check mss values
to update t_maxopd (maximum mss + options length) instead of two slightly
different but almost equal implementations to maintain.

PR:		kern/118455
Reviewed by:	silby (back in March)
MFC after:	2 months
2008-09-07 18:50:25 +00:00
Julian Elischer
ac957cd271 A bunch of formatting fixes brough to light by, or created by the Vimage commit
a few days ago.
2008-08-20 01:05:56 +00:00
Bjoern A. Zeeb
603724d3ab Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from:	//depot/projects/vimage-commit2/...
Reviewed by:	brooks, des, ed, mav, julian,
		jamie, kris, rwatson, zec, ...
		(various people I forgot, different versions)
		md5 (with a bit of help)
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
X-MFC after:	never
V_Commit_Message_Reviewed_By:	more people than the patch
2008-08-17 23:27:27 +00:00
Rui Paulo
f2512ba12a MFp4 (//depot/projects/tcpecn/):
TCP ECN support. Merge of my GSoC 2006 work for NetBSD.
  TCP ECN is defined in RFC 3168.

Partly reviewed by:	dwmalone, silby
Obtained from:		NetBSD
2008-07-31 15:10:09 +00:00
Julian Elischer
8b07e49a00 Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

  One thing where FreeBSD has been falling behind, and which by chance I
  have some time to work on is "policy based routing", which allows
  different
  packet streams to be routed by more than just the destination address.

  Constraints:
  ------------

  I want to make some form of this available in the 6.x tree
  (and by extension 7.x) , but FreeBSD in general needs it so I might as
  well do it in -current and back port the portions I need.

  One of the ways that this can be done is to have the ability to
  instantiate multiple kernel routing tables (which I will now
  refer to as "Forwarding Information Bases" or "FIBs" for political
  correctness reasons). Which FIB a particular packet uses to make
  the next hop decision can be decided by a number of mechanisms.
  The policies these mechanisms implement are the "Policies" referred
  to in "Policy based routing".

  One of the constraints I have if I try to back port this work to
  6.x is that it must be implemented as a EXTENSION to the existing
  ABIs in 6.x so that third party applications do not need to be
  recompiled in timespan of the branch.

  This first version will not have some of the bells and whistles that
  will come with later versions. It will, for example, be limited to 16
  tables in the first commit.
  Implementation method, Compatible version. (part 1)
  -------------------------------
  For this reason I have implemented a "sufficient subset" of a
  multiple routing table solution in Perforce, and back-ported it
  to 6.x. (also in Perforce though not  always caught up with what I
  have done in -current/P4). The subset allows a number of FIBs
  to be defined at compile time (8 is sufficient for my purposes in 6.x)
  and implements the changes needed to allow IPV4 to use them. I have not
  done the changes for ipv6 simply because I do not need it, and I do not
  have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

  Other protocol families are left untouched and should there be
  users with proprietary protocol families, they should continue to work
  and be oblivious to the existence of the extra FIBs.

  To understand how this is done, one must know that the current FIB
  code starts everything off with a single dimensional array of
  pointers to FIB head structures (One per protocol family), each of
  which in turn points to the trie of routes available to that family.

  The basic change in the ABI compatible version of the change is to
  extent that array to be a 2 dimensional array, so that
  instead of protocol family X looking at rt_tables[X] for the
  table it needs, it looks at rt_tables[Y][X] when for all
  protocol families except ipv4 Y is always 0.
  Code that is unaware of the change always just sees the first row
  of the table, which of course looks just like the one dimensional
  array that existed before.

  The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
  are all maintained, but refer only to the first row of the array,
  so that existing callers in proprietary protocols can continue to
  do the "right thing".
  Some new entry points are added, for the exclusive use of ipv4 code
  called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
  which have an extra argument which refers the code to the correct row.

  In addition, there are some new entry points (currently called
  rtalloc_fib() and friends) that check the Address family being
  looked up and call either rtalloc() (and friends) if the protocol
  is not IPv4 forcing the action to row 0 or to the appropriate row
  if it IS IPv4 (and that info is available). These are for calling
  from code that is not specific to any particular protocol. The way
  these are implemented would change in the non ABI preserving code
  to be added later.

  One feature of the first version of the code is that for ipv4,
  the interface routes show up automatically on all the FIBs, so
  that no matter what FIB you select you always have the basic
  direct attached hosts available to you. (rtinit() does this
  automatically).

  You CAN delete an interface route from one FIB should you want
  to but by default it's there. ARP information is also available
  in each FIB. It's assumed that the same machine would have the
  same MAC address, regardless of which FIB you are using to get
  to it.

  This brings us as to how the correct FIB is selected for an outgoing
  IPV4 packet.

  Firstly, all packets have a FIB associated with them. if nothing
  has been done to change it, it will be FIB 0. The FIB is changed
  in the following ways.

  Packets fall into one of a number of classes.

  1/ locally generated packets, coming from a socket/PCB.
     Such packets select a FIB from a number associated with the
     socket/PCB. This in turn is inherited from the process,
     but can be changed by a socket option. The process in turn
     inherits it on fork. I have written a utility call setfib
     that acts a bit like nice..

         setfib -3 ping target.example.com # will use fib 3 for ping.

     It is an obvious extension to make it a property of a jail
     but I have not done so. It can be achieved by combining the setfib and
     jail commands.

  2/ packets received on an interface for forwarding.
     By default these packets would use table 0,
     (or possibly a number settable in a sysctl(not yet)).
     but prior to routing the firewall can inspect them (see below).
     (possibly in the future you may be able to associate a FIB
     with packets received on an interface..  An ifconfig arg, but not yet.)

  3/ packets inspected by a packet classifier, which can arbitrarily
     associate a fib with it on a packet by packet basis.
     A fib assigned to a packet by a packet classifier
     (such as ipfw) would over-ride a fib associated by
     a more default source. (such as cases 1 or 2).

  4/ a tcp listen socket associated with a fib will generate
     accept sockets that are associated with that same fib.

  5/ Packets generated in response to some other packet (e.g. reset
     or icmp packets). These should use the FIB associated with the
     packet being reponded to.

  6/ Packets generated during encapsulation.
     gif, tun and other tunnel interfaces will encapsulate using the FIB
     that was in effect withthe proces that set up the tunnel.
     thus setfib 1 ifconfig gif0 [tunnel instructions]
     will set the fib for the tunnel to use to be fib 1.

  Routing messages would be associated with their
  process, and thus select one FIB or another.
  messages from the kernel would be associated with the fib they
  refer to and would only be received by a routing socket associated
  with that fib. (not yet implemented)

  In addition Netstat has been edited to be able to cope with the
  fact that the array is now 2 dimensional. (It looks in system
  memory using libkvm (!)). Old versions of netstat see only the first FIB.

  In addition two sysctls are added to give:
  a) the number of FIBs compiled in (active)
  b) the default FIB of the calling process.

  Early testing experience:
  -------------------------

  Basically our (IronPort's) appliance does this functionality already
  using ipfw fwd but that method has some drawbacks.

  For example,
  It can't fully simulate a routing table because it can't influence the
  socket's choice of local address when a connect() is done.

  Testing during the generating of these changes has been
  remarkably smooth so far. Multiple tables have co-existed
  with no notable side effects, and packets have been routes
  accordingly.

  ipfw has grown 2 new keywords:

  setfib N ip from anay to any
  count ip from any to any fib N

  In pf there seems to be a requirement to be able to give symbolic names to the
  fibs but I do not have that capacity. I am not sure if it is required.

  SCTP has interestingly enough built in support for this, called VRFs
  in Cisco parlance. it will be interesting to see how that handles it
  when it suddenly actually does something.

  Where to next:
  --------------------

  After committing the ABI compatible version and MFCing it, I'd
  like to proceed in a forward direction in -current. this will
  result in some roto-tilling in the routing code.

  Firstly: the current code's idea of having a separate tree per
  protocol family, all of the same format, and pointed to by the
  1 dimensional array is a bit silly. Especially when one considers that
  there is code that makes assumptions about every protocol having the
  same internal structures there. Some protocols don't WANT that
  sort of structure. (for example the whole idea of a netmask is foreign
  to appletalk). This needs to be made opaque to the external code.

  My suggested first change is to add routing method pointers to the
  'domain' structure, along with information pointing the data.
  instead of having an array of pointers to uniform structures,
  there would be an array pointing to the 'domain' structures
  for each protocol address domain (protocol family),
  and the methods this reached would be called. The methods would have
  an argument that gives FIB number, but the protocol would be free
  to ignore it.

  When the ABI can be changed it raises the possibilty of the
  addition of a fib entry into the "struct route". Currently,
  the structure contains the sockaddr of the desination, and the resulting
  fib entry. To make this work fully, one could add a fib number
  so that given an address and a fib, one can find the third element, the
  fib entry.

  Interaction with the ARP layer/ LL layer would need to be
  revisited as well. Qing Li has been working on this already.

  This work was sponsored by Ironport Systems/Cisco

Reviewed by:    several including rwatson, bz and mlair (parts each)
Obtained from:  Ironport systems/Cisco
2008-05-09 23:03:00 +00:00
Robert Watson
8501a69cc9 Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after:	3 months
Tested by:	kris (superset of committered patch)
2008-04-17 21:38:18 +00:00
Robert Watson
7a3244ccb7 Add further TCP inpcb locking assertions to some TCP input code paths.
MFC after:	1 month
2008-04-07 12:41:45 +00:00
Bjoern A. Zeeb
c3b02504bc Some "cleanup" of tcp_mss():
- Move the assigment of the socket down before we first need it.
  No need to do it at the beginning and then drop out the function
  by one of the returns before using it 100 lines further down.
- Use t_maxopd which was assigned the "tcp_mssdflt" for the corrrect
  AF already instead of another #ifdef ? : #endif block doing the same.
- Remove an unneeded (duplicate) assignment of mss to t_maxseg just before
  we possibly change mss and re-do the assignment without using t_maxseg
  in between.

Reviewed by:	silby
No objections:	net@ (silence)
MFC after:	5 days
2008-03-02 08:40:47 +00:00
Bjoern A. Zeeb
af92e6cf95 Fix indentation (whitespace changes only).
MFC after:	6 days
2008-03-01 22:27:15 +00:00
Robert Watson
30d239bc4c Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00
Mike Silbersack
4b421e2daa Add FBSDID to all files in netinet so that people can more
easily include file version information in bug reports.

Approved by:	re (kensmith)
2007-10-07 20:44:24 +00:00
Mike Silbersack
e31d8aa3da Improve the debugging message:
TCP: [X.X.X.X]:X to [X.X.X.X]:X tcpflags 0x18<PUSH,ACK>; tcp_do_segment: FIN_WAIT_2: Received data after socket was closed, sending RST and removing tcpcb

So that it also includes how many bytes of data were received.  It now looks
like this:

TCP: [X.X.X.X]:X to [X.X.X.X]:X tcpflags 0x18<PUSH,ACK>; tcp_do_segment: FIN_WAIT_2: Received X bytes of data after socket was closed, sending RST and removing tcpcb

Approved by:	re (gnn)
2007-10-07 00:07:27 +00:00
Ken Smith
a258946554 Make sure that either inp is NULL or we have obtained a lock on it before
jumping to dropunlock to avoid a panic.  While here move the calls to
ipsec4_in_reject() and ipsec6_in_reject() so they are after we obtain
the lock on inp.

Original patch to avoid panic:	pjd
Review of locking adjustments:	gnn, sam
Approved by:			re (rwatson)
2007-09-10 14:49:32 +00:00
Dag-Erling Smørgrav
218cbbea9a Make tcpstates[] static, and make sure TCPSTATES is defined before
<netinet/tcp_fsm.h> is included into any compilation unit that needs
tcpstates[].  Also remove incorrect extern declarations and TCPDEBUG
conditionals.  This allows kernels both with and without TCPDEBUG to
build, and unbreaks the tinderbox.

Approved by:	re (rwatson)
2007-07-30 11:06:42 +00:00
Matt Jacob
24face5416 Fix compilation problems- tcpstates is only available if TCPDEBUG
is set.

Approved by:	re (in spirit)
2007-07-29 01:31:33 +00:00
Andre Oppermann
773673c133 Provide a sysctl to toggle reporting of TCP debug logging:
sys.net.inet.tcp.log_debug = 1

It defaults to enabled for the moment and is to be turned off for
the next release like other diagnostics from development branches.

It is important to note that sysctl sys.net.inet.tcp.log_in_vain
uses the same logging function as log_debug.  Enabling of the former
also causes the latter to engage, but not vice versa.

Use consistent terminology in tcp log messages:

 "ignored" means a segment contains invalid flags/information and
   is dropped without changing state or issuing a reply.

 "rejected" means a segments contains invalid flags/information but
   is causing a reply (usually RST) and may cause a state change.

Approved by:	re (rwatson)
2007-07-28 12:20:39 +00:00
Andre Oppermann
19bc77c549 o Move all detailed checks for RST in LISTEN state from tcp_input() to
syncache_rst().
o Fix tests for flag combinations of RST and SYN, ACK, FIN.  Before
  a RST for a connection in syncache did not properly free the entry.
o Add more detailed logging.

Approved by:	re (rwatson)
2007-07-28 11:51:44 +00:00