Commit Graph

955 Commits

Author SHA1 Message Date
rwatson
70b6a8119c Remove IFF_NEEDSGIANT, a compatibility infrastructure introduced
in FreeBSD 5.x to allow network device drivers to run with Giant
despite the network stack being Giant-free.  This significantly
simplifies calls into ioctl() on network interfaces, especially
in the multicast code, as well as eliminates deferred invocation
of interface if_start routines.

Disable the build on device drivers still depending on
IFF_NEEDSGIANT as they no longer compile.  They will be removed
in a few weeks if they haven't been made MPSAFE in that time.
Disabled drivers:

        if_ar
        if_axe
        if_aue
        if_cdce
        if_cue
        if_kue
        if_ray
        if_rue
        if_rum
        if_sr
        if_udav
        if_ural
        if_zyd

Drivers that were already disabled because of tty changes:

        if_ppp
        if_sl

Discussed on:	arch@
2009-03-15 14:21:05 +00:00
rwatson
038bfe209e Correct a number of evolved problems with inp_vflag and inp_flags:
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted.  Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):

  INP_TIMEWAIT
  INP_SOCKREF
  INP_ONESBCAST
  INP_DROPPED

Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.

MFC after:      1 week (or after dependencies are MFC'd)
Reviewed by:    bz
2009-03-15 09:58:31 +00:00
marius
74f63d4ce1 On architectures with strict alignment requirements compensate
the misalignment of the IP header that prepending the EtherIP
header might have caused.

PR:		131921
MFC after:	1 week
2009-03-07 19:08:58 +00:00
bz
59d53a5bdb Start removing IPv6 Type 0 Routing header code.
RH0 was deprecated by RFC 5095.

While most of the code had been disabled by #if 0 already, leave a
bit of infrastructure for possible RH2 code and a log message under
BURN_BRIDGES in case a user still tries to send RH0 packets.

Reviewed by:	gnn (a bit back, earlier version)
2009-03-03 13:12:12 +00:00
bz
4321e2a8f4 Add size-guards evaluated at compile-time to the main struct vnet_*
which are not in a module of their own like gif.

Single kernel compiles and universe will fail if the size of the struct
changes. Th expected values are given in sys/vimage.h.
See the comments where how to handle this.

Requested by:	peter
2009-03-01 11:01:00 +00:00
bz
df2be82cec For all files including net/vnet.h directly include opt_route.h and
net/route.h.

Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.

We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.

This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.
2009-02-27 14:12:05 +00:00
bz
710220924b Shuffle the vimage.h includes or add where missing. 2009-02-27 13:22:26 +00:00
rwatson
52c114d8ab Assert the radix head lock in in6_rtqkill().
MFC after:	3 days
2009-02-23 22:58:59 +00:00
bz
8d30abae87 Try to remove/assimilate as much of formerly IPv4/6 specific
(duplicate) code in sys/netipsec/ipsec.c and fold it into
common, INET/6 independent functions.

The file local functions ipsec4_setspidx_inpcb() and
ipsec6_setspidx_inpcb() were 1:1 identical after the change
in r186528. Rename to ipsec_setspidx_inpcb() and remove the
duplicate.

Public functions ipsec[46]_get_policy() were 1:1 identical.
Remove one copy and merge in the factored out code from
ipsec_get_policy() into the other. The public function left
is now called ipsec_get_policy() and callers were adapted.

Public functions ipsec[46]_set_policy() were 1:1 identical.
Rename file local ipsec_set_policy() function to
ipsec_set_policy_internal().
Remove one copy of the public functions, rename the other
to ipsec_set_policy() and adapt callers.

Public functions ipsec[46]_hdrsiz() were logically identical
(ignoring one questionable assert in the v6 version).
Rename the file local ipsec_hdrsiz() to ipsec_hdrsiz_internal(),
the public function to ipsec_hdrsiz(), remove the duplicate
copy and adapt the callers.
The v6 version had been unused anyway. Cleanup comments.

Public functions ipsec[46]_in_reject() were logically identical
apart from statistics. Move the common code into a file local
ipsec46_in_reject() leaving vimage+statistics in small AF specific
wrapper functions. Note: unfortunately we already have a public
ipsec_in_reject().

Reviewed by:	sam
Discussed with:	rwatson (renaming to *_internal)
MFC after:	26 days
X-MFC:		keep wrapper functions for public symbols?
2009-02-08 09:27:07 +00:00
jamie
aac9010144 Don't bother null-checking the thread pointer before the prison checks
in udp6_connect (td is already dereferenced elsewhere without such a
check).  This makes the conversion from a sockaddr to a sockaddr_in6
always happen, so convert once at the beginning of the function rather
than twice in the middle.

Approved by:	bz (mentor)
2009-02-05 15:04:23 +00:00
jamie
bbcda547da Remove redundant calls of prison_local_ip4 in in_pcbbind_setup, and of
prison_local_ip6 in in6_pcbbind.

Approved by:	bz (mentor)
2009-02-05 14:25:53 +00:00
jamie
12bbe1869f Standardize the various prison_foo_ip[46] functions and prison_if to
return zero on success and an error code otherwise.  The possible errors
are EADDRNOTAVAIL if an address being checked for doesn't match the
prison, and EAFNOSUPPORT if the prison doesn't have any addresses in
that address family.  For most callers of these functions, use the
returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or
EINVAL.

Always include a jailed() check in these functions, where a non-jailed
cred always returns success (and makes no changes).  Remove the explicit
jailed() checks that preceded many of the function calls.

Approved by:	bz (mentor)
2009-02-05 14:06:09 +00:00
bz
5af7ae8eac When iterating through the list trying to find a router in
defrouter_select(), NULL the cached llentry after unlocking
as we are no longer interested in it and with the second
iteration would try to unlock it again resulting in
panic: Lock (rw) lle not locked @ ...

Reported by:	Mark Atkinson <m.atkinson@f5.com>
Tested by:	Mark Atkinson <m.atkinson@f5.com>
PR:		kern/128247 (in follow-up, unrelated to original report)
2009-02-04 10:35:27 +00:00
rrs
520c389cb4 - Cleanup checksum code.
- Prepare for CRC offloading, add MIB counters (RS/MT).
- Bugfix: Disable CRC computation for IPv6 addresses with local scope (MT).
- Bugfix: Handle close() with SO_LINGER correctly when notifications
          are generated during the close() call(MT).
- Bugfix: Generate DRY event when sender is dry during subscription.
          Only for 1-to-1 style sockets (RS/MT)
- Bugfix: Put vtags for the correct amount of time into time-wait (MT).
- Bugfix: Clear vtag entries correctly on expiration (MT).
- Bugfix: shutdown() indicates ENOTCONN when called for unconnected
          1-to-1 style sockets (MT).
- Bugfix: In sctp Auth code (PL).
- Add support for devices that support SCTP csum offload (igb).
- Add missing sctp_associd to mib sysctl xsctp_tcb structure (RS)
Obtained from:	With help from Peter Lei and Michael Tuexen
2009-02-03 11:04:03 +00:00
bz
5d8f0a53a7 Remove the single global unlocked route cache ip6_forward_rt
from the inet6 stack along with statistics and make sure we
properly free the rt in all cases.

While the current situation is not better performance wise it
prevents panics seen more often these days.
After more inet6 and ipsec cleanup we should be able to improve
the situation again passing the rt to ip6_forward directly.

Leave the ip6_forward_rt entry in struct vinet6 but mark it
for removal.

PR:		kern/128247, kern/131038
MFC after:	25 days
Committed from:	Bugathon #6
Tested by:	Denis Ahrens <denis@h3q.com> (different initial version)
2009-02-01 21:11:08 +00:00
bz
033060866c Remove unused local MACROs.
Submitted by:	Christoph Mallon christoph.mallon@gmx.de
MFC after:	2 weeks
2009-01-31 17:35:44 +00:00
bz
b3bbe5cac1 Coalesce two consecutive #ifdef IPSEC blocks.
Move the skip_ipsec: label below the goto as we can never have
ipsecrt set if we get to that label so there is no need to check.

MFC after:	2 weeks
2009-01-31 12:24:53 +00:00
bz
f922834f0f Remove dead code from #if 0:
we do not have an ipsrcchk_rt anywhere else.

MFC after:	2 weeks
2009-01-31 11:19:20 +00:00
bz
226b2a700e Like with r185713 make sure to not leak a lock as rtalloc1(9) returns
a locked route. Thus we have to use RTFREE_LOCKED(9) to get it unlocked
and rtfree(9)d rather than just rtfree(9)d.

Since the PR was filed, new places with the same problem were added
with new code.  Also check that the rt is valid before freeing it
either way there.

PR:		kern/129793
Submitted by:	Dheeraj Reddy <dheeraj@ece.gatech.edu>
MFC after:	2 weeks
Committed from:	Bugathon #6
2009-01-31 10:48:02 +00:00
bz
ec7e619d54 Remove 4 entirely unsued ip6 variables.
Leave then in struct vinet6 to not break the ABI with kernel modules
but mark them for removal so we can do it in one batch when the time
is right.

MFC after:	1 month
2009-01-30 23:40:24 +00:00
bz
6dddd78341 For consistency with prison_{local,remote,check}_ipN rename
prison_getipN to prison_get_ipN.

Submitted by:	jamie (as part of a larger patch)
MFC after:	1 week
2009-01-25 10:11:58 +00:00
sam
b278e68100 remove too noisy DIAGNOSTIC code
Reviewed by:	qingli
2009-01-18 07:20:02 +00:00
qingli
751dff3610 Revive the RTF_LLINFO flag in route.h. The kernel code is guarded
by the new kernel option COMPAT_ROUTE_FLAGS for binary backward
compatibility. The RTF_LLDATA flag maps to the same value as RTF_LLINFO.
RTF_LLDATA is used by the arp and ndp utilities. The RTF_LLDATA flag is
always returned to the userland regardless whether the COMPAT_ROUTE_FLAGS
is defined.
2009-01-12 11:24:32 +00:00
bz
ffd2421407 Restrict arp, ndp and theoretically the FIB listing (if not
read with libkvm) to the addresses of a prison, when inside a
jail. [1]
As the patch from the PR was pre-'new-arp', add checks to the
llt_dump handlers as well.

While touching RTM_GET in route_output(), consistently use
curthread credentials rather than the creds from the socket
there. [2]

PR:		kern/68189
Submitted by:	Mark Delany <sxcg2-fuwxj@qmda.emu.st> [1]
Discussed with:	rwatson [2]
Reviewed by:	rwatson
MFC after:	4 weeks
2009-01-09 21:57:49 +00:00
bz
60c950d4ff Make SIOCGIFADDR and related, as well as SIOCGIFADDR_IN6 and related
jail-aware. Up to now we returned the first address of the interface
for SIOCGIFADDR w/o an ifr_addr in the query. This caused problems for
programs querying for an address but running inside a jail, as the
address returned usually did not belong to the jail.
Like for v6, if there was an ifr_addr given on v4, you could probe
for more addresses on the interfaces that you were not allowed to see
from inside a jail. Return an error (EADDRNOTAVAIL) in that case
now unless the address is on the given interface and valid for the
jail.

PR:		kern/114325
Reviewed by:	rwatson
MFC after:	4 weeks
2009-01-09 13:06:56 +00:00
rrs
fcaf24fb54 Addresses Roberts comments on comments. Also adds
the KASSERT and checks suggested.

Reviewed by:	The udp tunneling was discussed on net@ under the
                thread entitled "Heads up -- Thinking about UDP and tunneling"
2009-01-06 13:27:56 +00:00
rrs
8bff422255 Add the ability of an alternate transport protocol
to easily tunnel over udp by providing a hook
function that will be called instead of appending
to the socket buffer.
2009-01-06 12:13:40 +00:00
bz
086c4b5b79 Switch the last protosw* structs to C99 initializers.
Reviewed by:	ed, julian, Christoph Mallon <christoph.mallon@gmx.de>
MFC after:	2 weeks
2009-01-05 20:29:01 +00:00
rwatson
e259848db5 Unlike with struct protosw, several instances of struct ip6protosw
did not use C99-style sparse structure initialization, so remove
NULL assignments for now-removed pr_usrreq function pointers.

Reported by:	Chris Ruiz <yr.retarded at gmail.com>
2009-01-04 21:53:42 +00:00
rwatson
6db41b8313 struct ip6protosw is a copy of struct protosw, so remove pr_usrreq there
to reflect removal from struct protosw.

Spotted by:	ed
2009-01-04 21:13:51 +00:00
qingli
efe3f87721 Some modules such as SCTP supplies a valid route entry as an input argument
to ip_output(). The destionation is represented in a sockaddr{} object
that may contain other pieces of information, e.g., port number. This
same destination sockaddr{} object may be passed into L2 code, which
could be used to create a L2 entry. Since there exists a L2 table per
address family, the L2 lookup function can make address family specific
comparison instead of the generic bcmp() operation over the entire
sockaddr{} structure.

Note in the IPv6 case the sin6_scope_id is not compared because the
address is currently stored in the embedded form inside the kernel.
The in6_lltable_lookup() has to account for the scope-id if this
storage format were to change in the future.
2009-01-03 00:27:28 +00:00
qingli
1d851edfc0 This checkin addresses a couple of issues:
1. The "route" command allows route insertion through the interface-direct
   option "-iface". During if_attach(), an sockaddr_dl{} entry is created
   for the interface and is part of the interface address list. This
   sockaddr_dl{} entry describes the interface in detail. The "route"
   command selects this entry as the "gateway" object when the "-iface"
   option is present. The "arp" and "ndp" commands also interact with the
   kernel through the routing socket when adding and removing static L2
   entries. The static L2 information is also provided through the
   "gateway" object with an AF_LINK family type, similar to what is
   provided by the "route" command. In order to differentiate between
   these two types of operations, a RTF_LLDATA flag is introduced. This
   flag is set by the "arp" and "ndp" commands when issuing the add and
   delete commands. This flag is also set in each L2 entry returned by the
   kernel. The "arp" and "ndp" command follows a convention where a RTM_GET
   is issued first followed by a RTM_ADD/DELETE. This RTM_GET request fills
   in the fields for a "rtm" object, which is reinjected into the kernel by
   a subsequent RTM_ADD/DELETE command. The entry returend from RTM_GET
   is a prefix route, so the RTF_LLDATA flag must be specified when issuing
   the RTM_ADD/DELETE messages.

2. Enforce the convention that NET_RT_FLAGS with a 0 w_arg is the
   specification for retrieving L2 information. Also optimized the
   code logic.

Reviewed by:   julian
2008-12-26 19:45:24 +00:00
kmacy
7c3c2fbe0c avoid lock recursion by deferring the link check until after LLE lock is dropped 2008-12-24 01:08:18 +00:00
bz
f9f31751ac Correct variable name in comment.
MFC after:	4 weeks
2008-12-22 12:54:52 +00:00
qingli
8f88fc89cb Similar to the INET case, do not destroy the nd6 entries for
interface addresses until those addresses are removed. I already
made the patch in INET but forgot to bring the code over for
INET6.
2008-12-22 07:11:15 +00:00
bz
fcc42d6a25 Only unlock the llentry if it is actually valid.
Reported by:	ed
2008-12-18 19:09:14 +00:00
bz
b1db56aa98 Another step assimilating IPv[46] PCB code:
normalize IN6P_* compat flags usage to their equialent
INP_* counterpart.

Discussed with:	rwatson
Reviewed by:	rwatson
MFC after:	4 weeks
2008-12-17 13:00:18 +00:00
bz
ea0d9d2e9a Use inc_flags instead of the inc_isipv6 alias which so far
had been the only flag with random usage patterns.
Switch inc_flags to be used as a real bit field by using
INC_ISIPV6 with bitops to check for the 'isipv6' condition.

While here fix a place or two where in case of v4 inc_flags
were not properly initialized before.[1]

Found by:	rwatson during review [1]
Discussed with:	rwatson
Reviewed by:	rwatson
MFC after:	4 weeks
2008-12-17 12:52:34 +00:00
qingli
c6b6112234 Remove the rt argument from nd6_storelladdr() because
rt is no longer accessed.
2008-12-17 10:27:34 +00:00
qingli
3bfc2293f2 A couple of files were not meant to be committed. 2008-12-17 10:19:53 +00:00
qingli
c6a0a000ca in6_clsroute() was applied to prefix routes causing some
of them to expire. in6_clsroute() was only applied to
cloned routes that are no longer applicable after the
arp-v2 commit.
2008-12-17 10:03:49 +00:00
kmacy
43e7c1af8b * Compare pointer with NULL
* Remove trailing whitespace (added in r186162)
* Reduce indentation by rephrasing test

Submitted by:	Christopher Mallon (christoph dot mallon at gmx dot de)
2008-12-16 23:56:24 +00:00
kmacy
447f3863c6 - Simplify handling of the deferring of mbuf transmit until after lle lock drop
- add a couple of comments to clarify intent
2008-12-16 23:06:36 +00:00
kmacy
222f4e20a8 check pointers against NULL 2008-12-16 06:01:08 +00:00
kmacy
c501489004 convert more pointer validation checks to checking against NULL 2008-12-16 03:12:44 +00:00
kmacy
e73e761720 simplify locking in find_pfxlist_reachable_router 2008-12-16 03:05:18 +00:00
kmacy
bf113303e6 explicitly check return of lla_lookup against NULL 2008-12-16 02:47:22 +00:00
kmacy
ed9ff236d5 advance tail pointer in nd6_output_lle and check lla_output return against NULL 2008-12-16 02:33:53 +00:00
kmacy
54c2e2ce52 check return from lla_lookup against NULL not zero 2008-12-16 02:30:42 +00:00
kmacy
aca7e14bdb make sure redirect doesn't return without dropping the lock 2008-12-16 02:06:26 +00:00
kmacy
9682e6d337 need to check that lle is not null before unlock if the break condition is not met
also fix the break condition to explicitly check against NULL
2008-12-16 02:05:11 +00:00
kmacy
0b5a9dada1 unlock the llentry after use in find_pfxlist_reachable_router 2008-12-16 01:58:30 +00:00
qingli
e1f9a89b0d Initialize the variable "router", and apply "static_route" flag
across the entire nd6_cache_lladdr() function.
2008-12-16 01:21:19 +00:00
kmacy
c9eebde165 unlock and destroy an llentry's lock before freeing
Found by: sam
2008-12-16 00:20:49 +00:00
kmacy
505bc29767 unlock looked up llentrys in defrouter_select 2008-12-16 00:18:04 +00:00
kmacy
8cc0e3cda9 fix two use after frees in nd6_cache_lladdr caused by last minute unlock shuffling 2008-12-16 00:16:51 +00:00
bz
03f6bb9dc9 Another step assimilating IPv[46] PCB code - directly use
the inpcb names rather than the following IPv6 compat macros:
in6pcb,in6p_sp, in6p_ip6_nxt,in6p_flowinfo,in6p_vflag,
in6p_flags,in6p_socket,in6p_lport,in6p_fport,in6p_ppcb and
sotoin6pcb().

Apart from removing duplicate code in netipsec, this is a pure
whitespace, not a functional change.

Discussed with:	rwatson
Reviewed by:	rwatson (version before review requested changes)
MFC after:	4 weeks (set the timer and see then)
2008-12-15 21:50:54 +00:00
qingli
ec826ad5c7 This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
   possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,

The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.

Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:

- Kip Macy revised the locking code completely, thus completing
  the last piece of the puzzle, Kip has also been conducting
  active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
  provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
  me maintaining that branch before the svn conversion
2008-12-15 06:10:57 +00:00
kmacy
9c68b9dedd in6_addroute is called through rnh_addadr which is always called with the radix node head lock held
exclusively. Pass RTF_RNH_LOCKED to rtalloc so that rtalloc1_fib will not try to re-acquire the lock.
2008-12-13 20:15:42 +00:00
bz
98e7fe0e6a Second round of putting global variables, which were virtualized
but formerly missed under VIMAGE_GLOBAL.

Put the extern declarations of the  virtualized globals
under VIMAGE_GLOBAL as the globals themsevles are already.
This will help by the time when we are going to remove the globals
entirely.

Sponsored by:	The FreeBSD Foundation
2008-12-13 19:13:03 +00:00
kmacy
408657ab73 RTF_RNH_LOCKED needs to be passed in the flags arg not report,
apologies to thompsa
2008-12-12 02:07:45 +00:00
thompsa
7a14f2c921 Pass RTF_RNH_LOCKED to rtalloc1 sunce the node head is locked, this avoids a
recursive lock panic on inet6 detach.

Reviewed by:	kmacy
2008-12-12 01:46:59 +00:00
bz
83a32f8750 Put a global variables, which were virtualized but formerly
missed under VIMAGE_GLOBAL.

Start putting the extern declarations of the  virtualized globals
under VIMAGE_GLOBAL as the globals themsevles are already.
This will help by the time when we are going to remove the globals
entirely.

While there garbage collect a few dead externs from ip6_var.h.

Sponsored by:	The FreeBSD Foundation
2008-12-11 16:26:38 +00:00
zec
7b573d1496 Conditionally compile out V_ globals while instantiating the appropriate
container structures, depending on VIMAGE_GLOBALS compile time option.

Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out.  Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.

Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively

#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif

Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.

Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs.  This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.

Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.

De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps.  PF virtualization will be done
separately, most probably after next PF import.

Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw.  Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-12-10 23:12:39 +00:00
imp
689d225f30 Add missing include to sys/lock.h before sys/rwlock.h 2008-12-08 00:28:21 +00:00
kmacy
598b522b42 - convert radix node head lock from mutex to rwlock
- make radix node head lock not recursive
 - fix LOR in rtexpunge
 - fix LOR in rtredirect

Reviewed by:	sam
2008-12-07 21:15:43 +00:00
rrs
0f2b9dafa3 Code from the hack-session known as the IETF (and a
bit of debugging afterwards):
- Fix protection code for notification generation.
- Decouple associd from vtag
- Allow vtags to have less strigent requirements in non-uniqueness.
   o don't pre-hash them when you issue one in a cookie.
   o Allow duplicates and use addresses and ports to
     discriminate amongst the duplicates during lookup.
- Add support for the NAT draft draft-ietf-behave-sctpnat-00, this
  is still experimental and needs more extensive testing with the
  Jason Butt ipfw changes.
- Support for the SENDER_DRY event to get DTLS in OpenSSL working
  with a set of patches from Michael Tuexen (hopefully heading to OpenSSL soon).
- Update the support of SCTP-AUTH by Peter Lei.
- Use macros for refcounting.
- Fix MTU for UDP encapsulation.
- Fix reporting back of unsent data.
- Update assoc send counter handling to be consistent with endpoint sent counter.
- Fix a bug in PR-SCTP.
- Fix so we only send another FWD-TSN when a SACK arrives IF and only
  if the adv-peer-ack point progressed. However we still make sure
  a timer is running if we do have an adv_peer_ack point.
- Fix PR-SCTP bug where chunks were retransmitted if they are sent
  unreliable but not abandoned yet.

With the help of:	Michael Teuxen and Peter Lei :-)
MFC after:	 4 weeks
2008-12-06 13:19:54 +00:00
bz
604d89458a Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by:	brooks, gnn, des, zec, imp
Sponsored by:	The FreeBSD Foundation
2008-12-02 21:37:28 +00:00
bz
d2730d5b27 MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
  and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
  help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
  suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
  on cluster machines as well as all the testers and people
  who provided feedback the last months on freebsd-jail and
  other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by:	(see above)
MFC after:	3 months (this is just so that I get the mail)
X-MFC Before:   7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
zec
7ecd715d48 Unhide declarations of network stack virtualization structs from
underneath #ifdef VIMAGE blocks.

This change introduces some churn in #include ordering and nesting
throughout the network stack and drivers but is not expected to cause
any additional issues.

In the next step this will allow us to instantiate the virtualization
container structures and switch from using global variables to their
"containerized" counterparts.

Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-11-28 23:30:51 +00:00
bz
864141180e Merge in6_pcbfree() into in_pcbfree() which after the previous
IPsec change in r185366 only differed in two additonal IPv6 lines.
Rather than splattering conditional code everywhere add the v6
check centrally at this single place.

Reviewed by:	rwatson (as part of a larger changset)
MFC after:	6 weeks (*)
(*) possibly need to leave a stub wrapper in 7 to keep the symbol.
2008-11-27 12:04:35 +00:00
bz
9ef49d8b6f Unify ipsec[46]_delete_pcbpolicy in ipsec_delete_pcbpolicy.
Ignoring different names because of macros (in6pcb, in6p_sp) and
inp vs. in6p variable name both functions were entirely identical.

Reviewed by:	rwatson (as part of a larger changeset)
MFC after:	6 weeks (*)
(*) possibly need to leave a stub wrappers in 7 to keep the symbols.
2008-11-27 10:43:08 +00:00
zec
95a15f5c84 Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by:  bz, julian
Approved by:  julian (mentor)
Obtained from:        //depot/projects/vimage-commit2/...
X-MFC after:  never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-11-26 22:32:07 +00:00
bz
b089198d31 Remove in6_pcbdetach() as it is exactly the same function
as in_pcbdetach() and we don't need the code twice.

Reviewed by:	rwatson
MFC after:	6 weeks (*)
(*) possibly need to leave a stub wrapper in 7 to keep the symbol.
2008-11-26 20:52:26 +00:00
bz
2df0dfae25 Unify the v4 and v6 versions of pcbdetach and pcbfree as good
as possible so that they are easily diffable.

No functional changes.

Reviewed by:	rwatson
MFC after:	6 weeks
2008-11-26 12:54:31 +00:00
bz
3c7e39c293 Plug a credential leak in case the inpcb is freed by
in6_pcbfree() instead of in_pcbfree(); missed in r183606.

Reviewed by:	rwatson
MFC after:	3 days (instantly for 7.1-RC?)
2008-11-26 12:24:18 +00:00
zec
815d52c5df Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions.  As a rule,
initialization at instatiation for such variables should never be
introduced again from now on.  Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact.  In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-11-19 09:39:34 +00:00
rwatson
0db6d4519c Add a MAC label, MAC Framework, and MAC policy entry points for IPv6
fragment reassembly queues.

This allows policies to label reassembly queues, perform access
control checks when matching fragments to a queue, update a queue
label when fragments are matched, and label the resulting
reassembled datagram.

Obtained from:	TrustedBSD Project
2008-10-26 22:45:18 +00:00
des
a1e1ad22e0 Fix a number of style issues in the MALLOC / FREE commit. I've tried to
be careful not to fix anything that was already broken; the NFSv4 code is
particularly bad in this respect.
2008-10-23 20:26:15 +00:00
des
66f807ed8b Retire the MALLOC and FREE macros. They are an abomination unto style(9).
MFC after:	3 months
2008-10-23 15:53:51 +00:00
bz
0991899a98 Bring over the change switching from using sequential to random
ephemeral port allocation as implemented in netinet/in_pcb.c rev. 1.143
(initially from OpenBSD) and follow-up commits during the last four and
a half years including rev. 1.157, 1.162 and 1.199.
This now is relying on the same infrastructure as has been implemented
in in_pcb.c since rev. 1.199.

Reviewed by:	silby, rpaulo, mlaier
MFC after:	2 months
2008-10-20 18:43:59 +00:00
bz
88b6e9b1ce Check that the mbuf len is positive (like we do in the v4 case).
Read the other way round this means that even with the checks
the m_len turned negative in some cases which led to panics.
The reason to my understanding seems to be that the checks are wrong
(also for v4) ignoring possible padding when checking cmsg_len or
padding after data when adjusting the mbuf.
Doing proper cheks seems to break applications like named so
further investigation and regression tests are needed.

PR:		kern/119123
Tested by:	Ashish Shukla  wahjava gmail.com
MFC after:	3 days
2008-10-15 19:24:18 +00:00
rwatson
fabec516f4 When disconnecting a UDPv6 socket, acquire the socket lock around the
changing of the so_state field, as is done in UDPv4.  Remove XXX
locking comment.

MFC after:	3 days
2008-10-12 20:01:32 +00:00
bz
d3e532c1da Style changes: compare pointer to NULL and move a }.
MFC after:	6 weeks
2008-10-04 17:07:58 +00:00
bz
77f80e0672 Cache so_cred as inp_cred in the inpcb.
This means that inp_cred is always there, even after the socket
has gone away. It also means that it is constant for the lifetime
of the inp.
Both facts lead to simpler code and possibly less locking.

Suggested by:	rwatson
Reviewed by:	rwatson
MFC after:	6 weeks
X-MFC Note:	use a inp_pspare for inp_cred
2008-10-04 15:06:34 +00:00
zec
8797d4caec Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by:	julian, bz, brooks, zec
Reviewed by:	julian, bz, brooks, kris, rwatson, ...
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
cperciva
678568e481 Default to ignoring potentially evil IPv6 Neighbor Solicitation
messages.

Approved by:    so (cperciva)
Approved by:	re (kensmith)
Security:       FreeBSD-SA-08:10.nd6
Thanks to:      jinmei, bz
2008-10-02 00:32:59 +00:00
rwatson
12ddb86062 When invoking the udp_send() from udp6_send() due to use of a v6-mapped
IPv4 address, first drop the udbinfo and inpcb locks, which will otherwise
be recursed.  This leads to a potential minor race, but is preferable to a
deadlock when acquiring a read lock after a write lock on the inpcb.

MFC after:	3 days
Reported by:	Norbert Papke <fbsd-ml@scrapper.ca>, lioux
2008-09-22 06:44:03 +00:00
bz
83b4eaaa16 mld_timerresid() returns ms so instead of doing the maths in usec
and then dividing down to ms, do the maths in ms.

Obtained from:	NetBSD mld6.c rev. 1.47
MFC after:	2 months
2008-09-10 19:42:13 +00:00
simon
6bb93e188c - Fix amd64 local privilege escalation. [08:07]
- Fix nmount(2) local privilege escalation. [08:08]
- Fix IPv6 remote kernel panics. [08:09]

Fix for [08:07] is merge of r181823.

Submitted by:	kib [08:07], csjp [08:08], bz [08:09]
Reviewed by:	peter [08:07], jhb [08:07]
Reviewed by:	jinmei [08:09], rwatson [08:09]
Approved by:	re (SA blanket)
Approved by:	so (simon)
Security:	FreeBSD-SA-08:07.amd64
Security:	FreeBSD-SA-08:08.nmount
Security:	FreeBSD-SA-08:09.icmp6
2008-09-03 19:09:47 +00:00
bz
78dd3921ca Fix a bug, when a specially crafted ICMPV6 MLD packet could lead
to an integer divide by zero panic in the kernel, if the kernel was
run with hz<1000.
Neither i386, pc98, amd64 or sparc64 are affected in the currently
supported branches and default configuration.

Submitted by:	Miikka Saukko, Ossi Herrala and Jukka Taimisto from
		the CROSS project at Codenomicon Ltd. via CERT-FI.
Reviewed by:	bz, rwatson
Security:	CVE-2008-2464
MFC after:	8 hours
2008-09-03 08:13:58 +00:00
rwatson
2d23f13f7f In UDPv6, reduce scope of global udbinfo lock during append to last
matching socket by dropping it before udp6_append(), and remove
duplicate unlocks of udbinfo and inpcb in sysctl return path.

MFC after:	3 days
2008-08-31 13:16:45 +00:00
julian
e951e2f915 another missed V_ 2008-08-25 06:09:32 +00:00
julian
03a5241ea0 Fix some of the formatting fixes.. It's amazing how some thing stand out
in a commit message.
2008-08-20 01:24:55 +00:00
julian
0592958505 A bunch of formatting fixes brough to light by, or created by the Vimage commit
a few days ago.
2008-08-20 01:05:56 +00:00
bz
8a795a9f76 As part of step 1.5 of the vimage framework resolve conflicts with
file local static globals which would be folded onto the same name
with the V_ macros.

Reviewed by:	kris, brooks, simon
2008-08-18 13:16:19 +00:00
bz
1021d43b56 Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from:	//depot/projects/vimage-commit2/...
Reviewed by:	brooks, des, ed, mav, julian,
		jamie, kris, rwatson, zec, ...
		(various people I forgot, different versions)
		md5 (with a bit of help)
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
X-MFC after:	never
V_Commit_Message_Reviewed_By:	more people than the patch
2008-08-17 23:27:27 +00:00
bz
eafee510a9 Fix a regression introduced in r179289 splitting up ip6_savecontrol()
into v4-only vs. v6-only inp_flags processing.
When ip6_savecontrol_v4() is called from ip6_savecontrol() we
were not passing back the **mp thus the information will be missing
in userland.
Istead of going with a *** as suggested in the PR we are returning
**mp now and passing in the v4only flag as a pointer argument.

PR:		kern/126349
Reviewed by:	rwatson, dwmalone
2008-08-16 06:39:18 +00:00
rwatson
13b6ce6962 Adopt the slightly weaker consistency locking approach used in IPv4 raw
sockets for IPv6 raw sockets: separately lock the inpcb for determining
the destination address for a connect()'d raw socket at the rip6_send()
layer, and then re-acquire the inpcb lock in the rip6_output() layer to
query other options on the socket.  Previously, the global raw IP socket
lock was used, which while correct and marginally more consistent, could
add significantly to global raw IP socket lock contention.

MFC after:	1 week
2008-07-30 09:26:27 +00:00
rwatson
4082f24815 When copying in and out current ICMPv6 filters on a raw IPv6 socket,
lock the inpcb and use a local stack variable to copy to/from userspace
so that sooptcopyin()/sooptcopyout() aren't called while holding an
rwlock.

While here, fix a bug in which a failed sooptcopyin() might lead to
partially consistent ICMPv6 filters on the socket by not ignoring the
error returned by sooptcopyin().

MFC after:	2 weeks
2008-07-29 19:37:16 +00:00
rwatson
cd641465ef Since we fail IPv6 raw socket allocation if inp->in6p_icmp6filt can't
be allocated, there's no need to conditionize use and freeing of it
later.

MFC after:	1 week
2008-07-29 18:09:46 +00:00
rwatson
f16d022cdb Marginally decomplicate set/getsockopt code in ip6_output.c by simply
using the passed arguments explicitly and unconditionally rather than
testing them and calling panic().  The result is the same but easier
to read.

MFC after:	3 days
2008-07-29 09:31:03 +00:00
mav
c8ae327077 Move inpcb lock higher to protect some nonbinding fields reading.
It fixes nothing at this time, but decided to be more correct.
2008-07-28 19:32:18 +00:00
mav
8a791bfa67 According to in_pcb.h protocol binding information has double locking.
It allows access it while list travercing holding only global pcbinfo lock.
2008-07-27 20:30:34 +00:00
bz
362cb79214 Pass the ucred along into in{,6}_pcblookup_local for upcoming
prison checks.

Reviewed by:	rwatson
2008-07-10 13:31:11 +00:00
bz
4b9bb0069f For consistency take lport as u_short in in{,6}_pcblookup_local.
All callers either pass in an u_short or u_int16_t.

Reviewed by:	rwatson
2008-07-10 13:23:22 +00:00
rrs
a51aa927fa 1) Adds the rest of the VIMAGE change macros
2) Adds some __UserSpace__ on some of the common defines that
   the user space code needs
3) Fixes a bug when we send up data to a user that failed. We
   need to a) trim off the data chunk headers, if present, and
   b) make sure the frag bit is communicated properly for the
   msgs coming off the stream queues... i.e. we see if some
   of the msg has been taken.

Obtained from:	jeli contributed the VIMAGE changes on this pass Thanks Julain!
2008-07-09 16:45:30 +00:00
bz
c0ef832fd2 Document required locking in in6_sleectsrc() in case an inp is
passed in by adding an assert.

Requested by:	rwatson
Reviewed by:	rwatson
2008-07-09 16:33:21 +00:00
bz
13896f2e51 Change the parameters to in6_selectsrc():
- pass in the inp instead of both in6p_moptions and laddr.
 - pass in cred for upcoming prison checks.

Reviewed by:	rwatson
2008-07-08 18:41:36 +00:00
rwatson
7e3b07cdf5 Use soreceive_dgram() and sosend_dgram() with UDPv6, as we do with UDPv4.
Tested by:	ps
MFC after:	3 months
2008-07-08 10:15:23 +00:00
rwatson
00f0ab40d4 Drop read lock on udbinfo earlier during delivery to the last matching
UDP socket for a datagram; the inpcb read lock is sufficient to provide
inpcb stability during udp6_append().

MFC after:      1 month
2008-07-07 10:11:17 +00:00
rwatson
6ee57a292b Improve approximation of style(9) in raw socket code. 2008-07-05 18:03:39 +00:00
rwatson
051819b847 Introduce a new lock, hostname_mtx, and use it to synchronize access
to global hostname and domainname variables.  Where necessary, copy
to or from a stack-local buffer before performing copyin() or
copyout().  A few uses, such as in cd9660 and daemon_saver, remain
under-synchronized and will require further updates.

Correct a bug in which a failed copyin() of domainname would leave
domainname potentially corrupted.

MFC after:	3 weeks
2008-07-05 13:10:10 +00:00
rwatson
482bfeab47 Remove NETISR_MPSAFE, which allows specific netisr handlers to be directly
dispatched without Giant, and add NETISR_FORCEQUEUE, which allows specific
netisr handlers to always be dispatched via a queue (deferred).  Mark the
usb and if_ppp netisr handlers as NETISR_FORCEQUEUE, and explicitly
acquire Giant in those handlers.

Previously, any netisr handler not marked NETISR_MPSAFE would necessarily
run deferred and with Giant acquired.  This change removes Giant
scaffolding from the netisr infrastructure, but NETISR_FORCEQUEUE allows
non-MPSAFE handlers to continue to force deferred dispatch so as to avoid
lock order reversals between their acqusition of Giant and any calling
context.

It is likely we will be able to remove NETISR_FORCEQUEUE once
IFF_NEEDSGIANT is removed, as non-MPSAFE usb and if_ppp drivers will no
longer be supported.

Reviewed by:	bz
MFC after:	1 month
X-MFC note:	We can't remove NETISR_MPSAFE from stable/7 for KPI reasons,
		but the rest can go back.
2008-07-04 00:21:38 +00:00
rwatson
20a754dc21 Remove GIANT_REQUIRED from IPv6 input, forward, and frag6 code. The frag6
code is believed to be MPSAFE, and leaving aside the IPv6 route cache in
forwarding, Giant appears not to adequately synchronize the data structures
in the input or forwarding paths.
2008-07-03 10:55:13 +00:00
rwatson
a2caa98b95 Set the IPv6 netisr handler as NETISR_MPSAFE on the basis that, despite
there still being some well-known races in mld6 and nd6, running with
Giant over the netisr handler provides little or not additional
synchronization that might cause mld6 and nd6 to behave better.
2008-07-02 23:12:40 +00:00
bz
43f7cc1db4 Try to fix errors introduced in svn180085/cvs rev. 1.10:
* Include ip6_var.h for ip6stat.
* Use the correct name under ip6stat: `ip6s_cantforward' instead
  of its IPv4 counterpart.

MFC after:	10 days
2008-06-29 07:34:21 +00:00
kan
7d4f905059 Repair botched variable rename.
Pointy hat to:	julian
2008-06-29 04:33:45 +00:00
julian
6a69bd4db2 Oops, we've been incrementing the wrong cantforward variable.
Obtained from:	vimage tree
2008-06-29 00:25:16 +00:00
julian
ffd508c001 Rename two vars so that they are different from the same vars in ipv4.
They are static so it was not a problem 'per se' but it was confusing to
the reader.

Obtained from:	vimage tree
2008-06-29 00:17:45 +00:00
rrs
7782c49376 - Macro-izes the packed declaration in all headers.
- Vimage prep - these are major restructures to move
  all global variables to be accessed via a macro or two.
  The variables all go into a single structure.
- Asconf address addition tweaks (add_or_del Interfaces)
- Fix rwnd calcualtion to be more conservative.
- Support SACK_IMMEDIATE flag to skip delayed sack
  by demand of peer.
- Comment updates in the sack mapping calculations
- Invarients panic added.
- Pre-support for UDP tunneling (we can do this on
  MAC but will need added support from UDP to
  get a "pipe" of UDP packets in.
- clear trace buffer sysctl added when local tracing on.

Note the majority of this huge patch is all the vimage prep stuff :-)
2008-06-14 07:58:05 +00:00
rwatson
68f17e9b68 Employ read locks on UDP inpcbs, rather than write locks, when
monitoring UDP connections using sysctls.  In some cases, add
previously missing locking of inpcbs, as inp_socket is followed,
which also allows us to drop global locks more quickly.

MFC after:	1 week
2008-05-29 08:27:14 +00:00
bz
f3ab94f7b4 Factor out the v4-only vs. the v6-only inp_flags processing in
ip6_savecontrol in preparation for udp_append() to no longer
need an WLOCK as we will no longer be modifying socket options.

Requested by:		rwatson
Reviewed by:		gnn
MFC after:		10 days
2008-05-24 15:20:48 +00:00
rrs
8a66346564 - Adds support for the multi-asconf (From Kozuka-san)
- Adds some prepwork (Not all yet) for vimage in particular
  support the delete the sctppcbinfo.xx structs. There is
  still a leak in here if it were to be called plus we stil
  need the regrouping (From Me and Michael Tuexen)
- Adds support for UDP tunneling. For BSD there is no
  socket yet setup so its disabled, but major argument
  changes are in here to emcompass the passing of the port
  number (zero when you don't have a udp tunnel, the default
  for BSD). Will add some hooks in UDP here shortly (discussed
  with Robert) that will allow easy tunneling. (Mainly from
  Peter Lei and Michael Tuexen with some BSD work from me :-D)
- Some ease for windows, evidently leave is reserved by their
  compile move label leave: -> out:

MFC after:	1 week
2008-05-20 13:47:46 +00:00
julian
1dfc5c98a4 Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

  One thing where FreeBSD has been falling behind, and which by chance I
  have some time to work on is "policy based routing", which allows
  different
  packet streams to be routed by more than just the destination address.

  Constraints:
  ------------

  I want to make some form of this available in the 6.x tree
  (and by extension 7.x) , but FreeBSD in general needs it so I might as
  well do it in -current and back port the portions I need.

  One of the ways that this can be done is to have the ability to
  instantiate multiple kernel routing tables (which I will now
  refer to as "Forwarding Information Bases" or "FIBs" for political
  correctness reasons). Which FIB a particular packet uses to make
  the next hop decision can be decided by a number of mechanisms.
  The policies these mechanisms implement are the "Policies" referred
  to in "Policy based routing".

  One of the constraints I have if I try to back port this work to
  6.x is that it must be implemented as a EXTENSION to the existing
  ABIs in 6.x so that third party applications do not need to be
  recompiled in timespan of the branch.

  This first version will not have some of the bells and whistles that
  will come with later versions. It will, for example, be limited to 16
  tables in the first commit.
  Implementation method, Compatible version. (part 1)
  -------------------------------
  For this reason I have implemented a "sufficient subset" of a
  multiple routing table solution in Perforce, and back-ported it
  to 6.x. (also in Perforce though not  always caught up with what I
  have done in -current/P4). The subset allows a number of FIBs
  to be defined at compile time (8 is sufficient for my purposes in 6.x)
  and implements the changes needed to allow IPV4 to use them. I have not
  done the changes for ipv6 simply because I do not need it, and I do not
  have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

  Other protocol families are left untouched and should there be
  users with proprietary protocol families, they should continue to work
  and be oblivious to the existence of the extra FIBs.

  To understand how this is done, one must know that the current FIB
  code starts everything off with a single dimensional array of
  pointers to FIB head structures (One per protocol family), each of
  which in turn points to the trie of routes available to that family.

  The basic change in the ABI compatible version of the change is to
  extent that array to be a 2 dimensional array, so that
  instead of protocol family X looking at rt_tables[X] for the
  table it needs, it looks at rt_tables[Y][X] when for all
  protocol families except ipv4 Y is always 0.
  Code that is unaware of the change always just sees the first row
  of the table, which of course looks just like the one dimensional
  array that existed before.

  The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
  are all maintained, but refer only to the first row of the array,
  so that existing callers in proprietary protocols can continue to
  do the "right thing".
  Some new entry points are added, for the exclusive use of ipv4 code
  called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
  which have an extra argument which refers the code to the correct row.

  In addition, there are some new entry points (currently called
  rtalloc_fib() and friends) that check the Address family being
  looked up and call either rtalloc() (and friends) if the protocol
  is not IPv4 forcing the action to row 0 or to the appropriate row
  if it IS IPv4 (and that info is available). These are for calling
  from code that is not specific to any particular protocol. The way
  these are implemented would change in the non ABI preserving code
  to be added later.

  One feature of the first version of the code is that for ipv4,
  the interface routes show up automatically on all the FIBs, so
  that no matter what FIB you select you always have the basic
  direct attached hosts available to you. (rtinit() does this
  automatically).

  You CAN delete an interface route from one FIB should you want
  to but by default it's there. ARP information is also available
  in each FIB. It's assumed that the same machine would have the
  same MAC address, regardless of which FIB you are using to get
  to it.

  This brings us as to how the correct FIB is selected for an outgoing
  IPV4 packet.

  Firstly, all packets have a FIB associated with them. if nothing
  has been done to change it, it will be FIB 0. The FIB is changed
  in the following ways.

  Packets fall into one of a number of classes.

  1/ locally generated packets, coming from a socket/PCB.
     Such packets select a FIB from a number associated with the
     socket/PCB. This in turn is inherited from the process,
     but can be changed by a socket option. The process in turn
     inherits it on fork. I have written a utility call setfib
     that acts a bit like nice..

         setfib -3 ping target.example.com # will use fib 3 for ping.

     It is an obvious extension to make it a property of a jail
     but I have not done so. It can be achieved by combining the setfib and
     jail commands.

  2/ packets received on an interface for forwarding.
     By default these packets would use table 0,
     (or possibly a number settable in a sysctl(not yet)).
     but prior to routing the firewall can inspect them (see below).
     (possibly in the future you may be able to associate a FIB
     with packets received on an interface..  An ifconfig arg, but not yet.)

  3/ packets inspected by a packet classifier, which can arbitrarily
     associate a fib with it on a packet by packet basis.
     A fib assigned to a packet by a packet classifier
     (such as ipfw) would over-ride a fib associated by
     a more default source. (such as cases 1 or 2).

  4/ a tcp listen socket associated with a fib will generate
     accept sockets that are associated with that same fib.

  5/ Packets generated in response to some other packet (e.g. reset
     or icmp packets). These should use the FIB associated with the
     packet being reponded to.

  6/ Packets generated during encapsulation.
     gif, tun and other tunnel interfaces will encapsulate using the FIB
     that was in effect withthe proces that set up the tunnel.
     thus setfib 1 ifconfig gif0 [tunnel instructions]
     will set the fib for the tunnel to use to be fib 1.

  Routing messages would be associated with their
  process, and thus select one FIB or another.
  messages from the kernel would be associated with the fib they
  refer to and would only be received by a routing socket associated
  with that fib. (not yet implemented)

  In addition Netstat has been edited to be able to cope with the
  fact that the array is now 2 dimensional. (It looks in system
  memory using libkvm (!)). Old versions of netstat see only the first FIB.

  In addition two sysctls are added to give:
  a) the number of FIBs compiled in (active)
  b) the default FIB of the calling process.

  Early testing experience:
  -------------------------

  Basically our (IronPort's) appliance does this functionality already
  using ipfw fwd but that method has some drawbacks.

  For example,
  It can't fully simulate a routing table because it can't influence the
  socket's choice of local address when a connect() is done.

  Testing during the generating of these changes has been
  remarkably smooth so far. Multiple tables have co-existed
  with no notable side effects, and packets have been routes
  accordingly.

  ipfw has grown 2 new keywords:

  setfib N ip from anay to any
  count ip from any to any fib N

  In pf there seems to be a requirement to be able to give symbolic names to the
  fibs but I do not have that capacity. I am not sure if it is required.

  SCTP has interestingly enough built in support for this, called VRFs
  in Cisco parlance. it will be interesting to see how that handles it
  when it suddenly actually does something.

  Where to next:
  --------------------

  After committing the ABI compatible version and MFCing it, I'd
  like to proceed in a forward direction in -current. this will
  result in some roto-tilling in the routing code.

  Firstly: the current code's idea of having a separate tree per
  protocol family, all of the same format, and pointed to by the
  1 dimensional array is a bit silly. Especially when one considers that
  there is code that makes assumptions about every protocol having the
  same internal structures there. Some protocols don't WANT that
  sort of structure. (for example the whole idea of a netmask is foreign
  to appletalk). This needs to be made opaque to the external code.

  My suggested first change is to add routing method pointers to the
  'domain' structure, along with information pointing the data.
  instead of having an array of pointers to uniform structures,
  there would be an array pointing to the 'domain' structures
  for each protocol address domain (protocol family),
  and the methods this reached would be called. The methods would have
  an argument that gives FIB number, but the protocol would be free
  to ignore it.

  When the ABI can be changed it raises the possibilty of the
  addition of a fib entry into the "struct route". Currently,
  the structure contains the sockaddr of the desination, and the resulting
  fib entry. To make this work fully, one could add a fib number
  so that given an address and a fib, one can find the third element, the
  fib entry.

  Interaction with the ARP layer/ LL layer would need to be
  revisited as well. Qing Li has been working on this already.

  This work was sponsored by Ironport Systems/Cisco

Reviewed by:    several including rwatson, bz and mlair (parts each)
Obtained from:  Ironport systems/Cisco
2008-05-09 23:03:00 +00:00
rwatson
e51618e321 Acquire a read lock, rather than a write lock, on a UDPv6 inpcb when
delivering to the socket or extracting socket details for monitoring
purposes.

MFC after:	3 months
2008-04-22 12:20:33 +00:00
rwatson
a1fcc01258 In ICMPv6, read lock rather than write lock the inpcb on receive.
MFC after:	3 months
2008-04-21 12:08:40 +00:00
rwatson
9ee84cddef With IPv4 raw sockets, read lock rather than write lock the inpcb when
receiving or transmitting.

With IPv6 raw sockets, read lock rather than write lock the inpcb when
receiving.  Unfortunately, IPv6 source address selection appears to
require a write lock on the inpcb for the time being.

MFC after:	3 months
2008-04-21 12:06:41 +00:00
rwatson
e93ab31cf5 When querying a local or remote address on an IPv6 socket, use only a
read lock on the inpcb.

MFC after:	3 months
2008-04-19 14:36:19 +00:00
rwatson
ca47fccd6b Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after:	3 months
Tested by:	kris (superset of committered patch)
2008-04-17 21:38:18 +00:00
rrs
0eceb328ee - Have SCTP use the new pru_flush functionality
PR:		122710
MFC after:	1 week
2008-04-14 18:12:37 +00:00
qingli
4e8901ea7a This patch provides the back end support for equal-cost multi-path
(ECMP) for both IPv4 and IPv6. Previously, multipath route insertion
is disallowed. For example,

	route add -net 192.103.54.0/24 10.9.44.1
	route add -net 192.103.54.0/24 10.9.44.2

The second route insertion will trigger an error message of
"add net 192.103.54.0/24: gateway 10.2.5.2: route already in table"

Multiple default routes can also be inserted. Here is the netstat
output:

default		10.2.5.1	UGS	0	3074	bge0 =>
default		10.2.5.2	UGS	0	0	bge0

When multipath routes exist, the "route delete" command requires
a specific gateway to be specified or else an error message would
be displayed. For example,

	route delete default

would fail and trigger the following error message:

"route: writing to routing socket: No such process"
"delete net default: not in table"

On the other hand,

	route delete default 10.2.5.2

would be successful: "delete net default: gateway 10.2.5.2"

One does not have to specify a gateway if there is only a single
route for a particular destination.

I need to perform more testings on address aliases and multiple
interfaces that have the same IP prefixes. This patch as it
stands today is not yet ready for prime time. Therefore, the ECMP
code fragments are fully guarded by the RADIX_MPATH macro.
Include the "options  RADIX_MPATH" in the kernel configuration
to enable this feature.

Reviewed by:	robert, sam, gnn, julian, kmacy
2008-04-13 05:45:14 +00:00
rwatson
00684a83a1 In in_pcbnotifyall() and in6_pcbnotify(), use LIST_FOREACH_SAFE() and
eliminate unnecessary local variable caching of the list head pointer,
making the code a bit easier to read.

MFC after:	3 weeks
2008-04-06 21:20:56 +00:00
ru
3b1bf8c2e9 Replaced the misleading uses of a historical artefact M_TRYWAIT with M_WAIT.
Removed dead code that assumed that M_TRYWAIT can return NULL; it's not true
since the advent of MBUMA.

Reviewed by:	arch

There are ongoing disputes as to whether we want to switch to directly using
UMA flags M_WAITOK/M_NOWAIT for mbuf(9) allocation.
2008-03-25 09:39:02 +00:00
bz
33dfb1706b Correct IPsec behaviour with a 'use' level in SP but no SA available.
In that case return an continue processing the packet without IPsec.

PR:		121384
MFC after:	5 days
Reported by:	Cyrus Rahman (crahman gmail.com)
Tested by:	Cyrus Rahman (crahman gmail.com) [slightly older version]
2008-03-14 16:38:11 +00:00
bz
51315b3d89 Correct reference counting on the SP for outgoing IPv6 IPsec connections.
PR:		121374
Reported by:	Cyrus Rahman (crahman gmail.com)
Tested by:	Cyrus Rahman (crahman gmail.com)
MFC after:	5 days
2008-03-14 11:55:04 +00:00
bz
f507f0e4fa #if 0 out a currently unsued (and incomplete) function: ip6_ipsec_mtu().
No need to compile 'dead' code.
I am leaving it in because we will have to review the concept and
should use the common function in various places.

MFC after:	5 days
2008-03-14 11:44:30 +00:00
bz
693055a8ae Replace the function name in two identical printfs
by __func__, __LINE__ so we can distinguish them
when people report a problem.

PR:		121373
MFC after:	5 days
2008-03-14 11:09:11 +00:00
bz
cfb85f0c07 Rather than passing around a cached 'priv', pass in an ucred to
ipsec*_set_policy and do the privilege check only if needed.

Try to assimilate both ip*_ctloutput code blocks calling ipsec*_set_policy.

Reviewed by:	rwatson
2008-02-02 14:11:31 +00:00
bz
1c376286e0 Replace the last susers calls in netinet6/ with privilege checks.
Introduce a new privilege allowing to set certain IP header options
(hop-by-hop, routing headers).

Leave a few comments to be addressed later.

Reviewed by:	rwatson (older version, before addressing his comments)
2008-01-24 08:25:59 +00:00
bz
866f483083 Correct the commented out debugging printf()s in REPLACE and NEXT macros.
ip6_sprintf() needs a buffer as first argument these days.

MFC after:	2 weeks
2008-01-20 10:08:15 +00:00
obrien
7eb385c2d8 un-__P() 2008-01-08 19:08:58 +00:00
rwatson
4a0d85f1d4 Fix leaking MAC labels for IPv6 inpcbs by adding missing MAC label
destroy call; this transpired because the inpcb alloc path for IPv4/IPv6
is the same code, but IPv6 has a separate free path.  The results was
that as new IPv6 TCP connections were created, kernel memory would
gradually leak.

MFC after:	3 days
Reported by:	tanyong <tanyong at ercist dot iscas dot ac dot cn>,
		zhouzhouyi
2007-12-17 17:20:57 +00:00
obrien
0d684d927b Clean up VCS Ids. 2007-12-10 16:03:40 +00:00
julian
e38fed7fb7 Remove more dup'd code
MFC After: 1 week
2007-12-06 22:48:24 +00:00
julian
87a49d3e6e remove duped code
Reviewed By: gnn
MRC after: 1 week
2007-12-06 22:44:24 +00:00
mtm
46c3db4ab1 Instead of manually freeing the packet options structure (and not even doing
a good job of it) in the copypktopts() function, just call ip6_clearpktopts()
directly. Otherwise, the callers of this function would end up freeing the
memory twice.

Reviewed by: jinmei
PR:	     kern/116360
2007-11-21 16:01:42 +00:00
rwatson
2bca3d4001 Move towards more explicit support for various network protocol stacks
in the TrustedBSD MAC Framework:

- Add mac_atalk.c and add explicit entry point mac_netatalk_aarp_send()
  for AARP packet labeling, rather than using a generic link layer
  entry point.

- Add mac_inet6.c and add explicit entry point mac_netinet6_nd6_send()
  for ND6 packet labeling, rather than using a generic link layer entry
  point.

- Add expliict entry point mac_netinet_arp_send() for ARP packet
  labeling, and mac_netinet_igmp_send() for IGMP packet labeling,
  rather than using a generic link layer entry point.

- Remove previous genering link layer entry point,
  mac_mbuf_create_linklayer() as it is no longer used.

- Add implementations of new entry points to various policies, largely
  by replicating the existing link layer entry point for them; remove
  old link layer entry point implementation.

- Make MAC_IFNET_LOCK(), MAC_IFNET_UNLOCK(), and mac_ifnet_mtx global
  to the MAC Framework rather than static to mac_net.c as it is now
  needed outside of mac_net.c.

Obtained from:	TrustedBSD Project
2007-10-28 15:55:23 +00:00
rwatson
a3b8fc4866 Rename 'mac_mbuf_create_from_firewall' to 'mac_netinet_firewall_send' as
we move towards netinet as a pseudo-object for the MAC Framework.

Rename 'mac_create_mbuf_linklayer' to 'mac_mbuf_create_linklayer' to
reflect general object-first ordering preference.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-26 13:18:38 +00:00
rwatson
60570a92bf Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00