inp_socket->so_options dereference when we may not acquire the lock on
the inpcb.
This fixes the crash due to NULL pointer dereference in
in_pcbbind_setup() when inp_socket->so_options in a pcb returned by
in_pcblookup_local() was checked.
Reported by: dave jones <s.dave.jones@gmail.com>, Arnaud Lacombe <lacombar@gmail.com>
Suggested by: rwatson
Glanced by: rwatson
Tested by: dave jones <s.dave.jones@gmail.com>
- TCP keep* timers
- TCP UTO (adjust from what was there already)
- netmap
- route caching
- user cookie (temporary to allow for the real fix)
Slightly re-shuffle struct ifnet moving fields out of the middle
of spares and to better align.
Discussed with: rwatson (slightly earlier version)
Rather than including lock.h in in_pcbgroup.c in right order, fix it
for all consumers of in_pcb.h by further header file pollution under
#ifdef KERNEL.
Reported by: Pan Tsu (inyaoo gmail.com)
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.
Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).
Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.
Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS
architecture.
Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).
Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.
Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).
Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.
(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).
- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.
Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.
A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:
INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb
Callers must pass exactly one of these flags (for the time being).
Some notes:
- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
removed.
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).
This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.
Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
reworking of inpcbinfo locking:
(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an
inpcb.
(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.
(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.
This may well be safe to MFC, but some more KBI analysis is required.
Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
(1) Add a locking guide for inpcbinfo.
(2) Annotate inpcbinfo fields with synchronisation information; not all
annotations are 100% satisfactory.
(3) Reorder inpcbinfo fields so that the lock is at the head of the
structure, and close to fields it protects.
(4) Sort fields that will eventually be hashlock/pcbgroup-related together
even though they remain locked by ipi_lock for now.
Reviewed by: bz
Sponsored by: Juniper Networks
X-MFC after: KBI analysis required
Move the ipport_tick_callout and related functions from ip_input.c
to in_pcb.c. The random source port allocation code has been merged
and is now local to in_pcb.c only.
Use a SYSINIT to get the callout started and no longer depend on
initialization from the inet code, which would not work in an IPv6
only setup.
Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days
their calling contexts in {IP divert, raw IP sockets, TCP, UDP} and
create new helper functions: in_pcbinfo_init() and in_pcbinfo_destroy()
to do this work in a central spot. As inpcbinfo becomes more complex
due to ongoing work to add connection groups, this will reduce code
duplication.
MFC after: 1 month
Reviewed by: bz
Sponsored by: Juniper Networks
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
with OpenBSD (and BSD/OS originally). We can't easly do it SOL_SOCKET option
as there is no more space for more SOL_SOCKET options, but this option also
fits better as an IP socket option, it seems.
- Implement this functionality also for IPv6 and RAW IP sockets.
- Always compile it in (don't use additional kernel options).
- Remove sysctl to turn this functionality on and off.
- Introduce new privilege - PRIV_NETINET_BINDANY, which allows to use this
functionality (currently only unjail root can use it).
Discussed with: julian, adrian, jhb, rwatson, kmacy
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:
1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:
options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet
2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:
INIT_VNET_NET(ifp->if_vnet); becomes
struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];
3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.
4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.
5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.
6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.
Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.
Reviewed by: bz, rwatson
Approved by: julian (mentor)
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted. Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):
INP_TIMEWAIT
INP_SOCKREF
INP_ONESBCAST
INP_DROPPED
Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.
MFC after: 1 week (or after dependencies are MFC'd)
Reviewed by: bz
or not the inpcb is currenty on various hash lookup lists, rather
than using (lport != 0) to detect this. This means that the full
4-tuple of a connection can be retained after close, which should
lead to more sensible netstat output in the window between TCP
close and socket close.
MFC after: 2 weeks
in6p_ip6_nxt
in6p_vflag
in6p_flags
in6p_socket
in6p_lport
in6p_fport
in6p_ppcb
Remove unused v6 macro aliases for inpcb flags:
IN6P_HIGHPORT
IN6P_LOWPORT
IN6P_ANONPORT
IN6P_RECVIF
IN6P_MTUDISC
IN6P_FAITH
IN6P_CONTROLOPTS
References to in6p_lport and in6_fport in sockstat are also replaced with
normal inp_lport and inp_fport references.
MFC after: 3 days
Reviewed by: bz
applications to specify a non-local IP address when bind()'ing a socket
to a local endpoint.
This allows applications to spoof the client IP address of connections
if (obviously!) they somehow are able to receive the traffic normally
destined to said clients.
This patch doesn't include any changes to ipfw or the bridging code to
redirect the client traffic through the PCB checks so TCP gets a shot
at it. The normal behaviour is that packets with a non-local destination
IP address are not handled locally. This can be dealth with some IPFW hackery;
modifications to IPFW to make this less hacky will occur in subsequent
commmits.
Thanks to Julian Elischer and others at Ironport. This work was approved
and donated before Cisco acquired them.
Obtained from: Julian Elischer and others
MFC after: 2 weeks
had been the only flag with random usage patterns.
Switch inc_flags to be used as a real bit field by using
INC_ISIPV6 with bitops to check for the 'isipv6' condition.
While here fix a place or two where in case of v4 inc_flags
were not properly initialized before.[1]
Found by: rwatson during review [1]
Discussed with: rwatson
Reviewed by: rwatson
MFC after: 4 weeks
missed under VIMAGE_GLOBAL.
Start putting the extern declarations of the virtualized globals
under VIMAGE_GLOBAL as the globals themsevles are already.
This will help by the time when we are going to remove the globals
entirely.
While there garbage collect a few dead externs from ip6_var.h.
Sponsored by: The FreeBSD Foundation
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.
Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().
MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
for virtualization.
Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.
Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.
Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
ephemeral port allocation as implemented in netinet/in_pcb.c rev. 1.143
(initially from OpenBSD) and follow-up commits during the last four and
a half years including rev. 1.157, 1.162 and 1.199.
This now is relying on the same infrastructure as has been implemented
in in_pcb.c since rev. 1.199.
Reviewed by: silby, rpaulo, mlaier
MFC after: 2 months
This means that inp_cred is always there, even after the socket
has gone away. It also means that it is constant for the lifetime
of the inp.
Both facts lead to simpler code and possibly less locking.
Suggested by: rwatson
Reviewed by: rwatson
MFC after: 6 weeks
X-MFC Note: use a inp_pspare for inp_cred
- removing 'const' qualifier from an input parameter to conform to the type
required by rw_assert();
- using in_addr->s_addr to retrive 32 bits address value.
Observed by: tinderbox
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:
- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.
In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.
The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.
Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)
Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.
From my notes:
-----
One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
different
packet streams to be routed by more than just the destination address.
Constraints:
------------
I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.
One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".
One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.
This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
-------------------------------
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.
Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.
To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.
The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.
The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.
In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.
One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this
automatically).
You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.
This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.
Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.
Packets fall into one of a number of classes.
1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..
setfib -3 ping target.example.com # will use fib 3 for ping.
It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.
2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)
3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).
4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.
5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.
6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.
Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)
In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.
In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.
Early testing experience:
-------------------------
Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.
For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.
Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes
accordingly.
ipfw has grown 2 new keywords:
setfib N ip from anay to any
count ip from any to any fib N
In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.
SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.
Where to next:
--------------------
After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.
Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.
My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.
When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.
Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.
This work was sponsored by Ironport Systems/Cisco
Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco