Commit Graph

509 Commits

Author SHA1 Message Date
pluknet
322f330b92 g/c last bit of old ipv6 prefix management.
Reviewed by:	bz
Obtained from:	NetBSD, net/if.h, rev 1.80
2012-02-08 22:05:26 +00:00
glebius
bedbfc31c3 Fix typo in r231010.
Submitted by:	linimon
2012-02-05 12:52:28 +00:00
glebius
0bf7c1fae8 Better comment for ifa_init(), ifa_ref(), ifa_free(). 2012-02-05 08:53:05 +00:00
glebius
8be39a40a2 In ifa_init() initialize if_data.ifi_datalen. This would be
required after upcoming changes from bz@.

Discussed with:	bz
2012-02-05 08:31:15 +00:00
jhb
4ef366671a Convert all users of IF_ADDR_LOCK to use new locking macros that specify
either a read lock or write lock.

Reviewed by:	bz
MFC after:	2 weeks
2012-01-05 19:00:36 +00:00
glebius
8c74bad9f3 Restore a feature that was present in 5.x and 6.x, and was cleared in
7.x, 8.x and 9.x with pf(4) imports: pfsync(4) should suppress CARP
preemption, while it is running its bulk update.

However, reimplement the feature in more elegant manner, that is
partially inspired by newer OpenBSD:

- Rename term "suppression" to "demotion", to match with OpenBSD.
- Keep a global demotion factor, that can be raised by several
  conditions, for now these are:
  - interface goes down
  - carp(4) has problems with ip_output() or ip6_output()
  - pfsync performs bulk update
- Unlike in OpenBSD the demotion factor isn't a counter, but
  is actual value added to advskew. The adjustment values for
  particular error conditions are also configurable, and their
  defaults are maximum advskew value, so a single failure bumps
  demotion to maximum. This is for POLA compatibility, and should
  satisfy most users.
- Demotion factor is a writable sysctl, so user can do
  foot shooting, if he desires to.
2011-12-20 13:53:31 +00:00
glebius
27a36f6ac8 A major overhaul of the CARP implementation. The ip_carp.c was started
from scratch, copying needed functionality from the old implemenation
on demand, with a thorough review of all code. The main change is that
interface layer has been removed from the CARP. Now redundant addresses
are configured exactly on the interfaces, they run on.

The CARP configuration itself is, as before, configured and read via
SIOCSVH/SIOCGVH ioctls. A new prefix created with SIOCAIFADDR or
SIOCAIFADDR_IN6 may now be configured to a particular virtual host id,
which makes the prefix redundant.

ifconfig(8) semantics has been changed too: now one doesn't need
to clone carpXX interface, he/she should directly configure a vhid
on a Ethernet interface.

To supply vhid data from the kernel to an application the getifaddrs(8)
function had been changed to pass ifam_data with each address. [1]

The new implementation definitely closes all PRs related to carp(4)
being an interface, and may close several others. It also allows
to run a single redundant IP per interface.

Big thanks to Bjoern Zeeb for his help with inet6 part of patch, for
idea on using ifam_data and for several rounds of reviewing!

PR:		kern/117000, kern/126945, kern/126714, kern/120130, kern/117448
Reviewed by:	bz
Submitted by:	bz [1]
2011-12-16 12:16:56 +00:00
brooks
9029bb4f3b Remove the unused if_free_type() function.
X-MFC after:	never
2011-12-09 23:26:28 +00:00
glebius
ec7618f8d2 Improve logging:
- don't hardcode function name
- use LOG_DEBUG for such a debug message
- print error value
2011-11-22 19:42:17 +00:00
ed
e97eae1577 Mark MALLOC_DEFINEs static that have no corresponding MALLOC_DECLAREs.
This means that their use is restricted to a single C file.
2011-11-07 06:44:47 +00:00
bz
9cad5bfef3 Add infrastructure to allow all frames/packets received on an interface
to be assigned to a non-default FIB instance.

You may need to recompile world or ports due to the change of struct ifnet.

Submitted by:	cjsp
Submitted by:	Alexander V. Chernikov (melifaro ipfw.ru)
		(original versions)
Reviewed by:	julian
Reviewed by:	Alexander V. Chernikov (melifaro ipfw.ru)
MFC after:	2 weeks
X-MFC:		use spare in struct ifnet
2011-07-03 12:22:02 +00:00
pluknet
29ff2b4aff Update ifc_len field of struct ifconf passed for the ioctl SIOCGIFCONF32
(i.e. under COMPAT_FREEBSD32) in case ifconf() returned success to match
the native SIOCGIFCONF behavior.

PR:		kern/158369
Reported by:	Paul Procacci <pprocacci att gmail com>
MFC after:	1 week
2011-06-28 08:41:44 +00:00
glebius
8c3b1b83f9 When removing ifnets, we should first remove the reference to ifnet
from the interface index, then decrease refcount, not vice versa.

Otherwise there is a race (reproducible) when if_free_internal()
contests on IFNET_WLOCK(), and we got a zero-refed ifnet in the
index for a long time. It may be picked by some other thread,
that runs ifnet_byindex_ref(), who takes the ifnet from index,
and bumps refcount. When reader drops the lock, if_free_internal()
proceeds with free. Then reader tries to free it a second time.
2011-04-04 07:45:08 +00:00
jeff
2d7d8c05e7 - Merge changes to the base system to support OFED. These include
a wider arg2 for sysctl, updates to vlan code, IFT_INFINIBAND,
   and other miscellaneous small features.
2011-03-21 09:40:01 +00:00
bz
b9b7d3e93a Mfp4 CH=177274,177280,177284-177285,177297,177324-177325
VNET socket push back:
  try to minimize the number of places where we have to switch vnets
  and narrow down the time we stay switched.  Add assertions to the
  socket code to catch possibly unset vnets as seen in r204147.

  While this reduces the number of vnet recursion in some places like
  NFS, POSIX local sockets and some netgraph, .. recursions are
  impossible to fix.

  The current expectations are documented at the beginning of
  uipc_socket.c along with the other information there.

  Sponsored by: The FreeBSD Foundation
  Sponsored by: CK Software GmbH
  Reviewed by:  jhb
  Tested by:    zec

Tested by:	Mikolaj Golub (to.my.trociny gmail.com)
MFC after:	2 weeks
2011-02-16 21:29:13 +00:00
bz
aad842be85 Mfp4 CH=177255:
Make VNET_ASSERT() available with either VNET_DEBUG or INVARIANTS.

  Change the syntax to match KASSERT() to allow more flexible panic
  messages rather than having a printf with hardcoded arguments
  before panic.

  Adjust the few assertions we have to the new format (and enhance
  the output).

  Sponsored by: The FreeBSD Foundation
  Sponsored by: CK Software GmbH
  Reviewed by:	jhb

MFC after:	2 weeks
2011-02-11 13:27:00 +00:00
jhb
aff100c87f Fix a LOR by dropping the global ifnet locks while allocating a new ifnet
table in if_grow().  The order of the SYSINIT's for ifnet state were swapped
so that the various locks were initialized before being used.

Reviewed by:	pluknet, bz
MFC after:	2 weeks
2011-01-24 22:21:58 +00:00
mdf
5e41205b16 sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.
Commit the net* piece.
2011-01-12 19:53:50 +00:00
dim
fb307d7d1d After some off-list discussion, revert a number of changes to the
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files.  A better long-term solution is
still being considered.  This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.

Changes reverted:

------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines

Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.

------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines

Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.

------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines

Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
2010-11-22 19:32:54 +00:00
dim
fda4020a88 Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.
2010-11-14 20:38:11 +00:00
bz
491af1942e Factor out DDB commands from r204145, r204279 into if_debug.c for further
enhancements (1).  Switch to a standard 2-clause BSD license for this (2).

Unfortunately we have to un-static the ifindex_table for this but do not
publicly export it.

Suggested by:	rwatson (1) a while back.
Approved by:	thompsa (2) for the change from r204279.
MFC after:	6 days
2010-10-25 08:30:19 +00:00
pluknet
8950ed8036 Reshuffle SIOCGIFCONF32 handler from r155224.
- move all the chunks into one file, which allows to hide SIOCGIFCONF32
  global definition as well.
- replace __amd64__ with proper COMPAT_FREEBSD32 around.
- handle 32bit capacity before going into the handler itself instead of
  doing internal 32bit specific changes within it (e.g. as it's done for
  SIOCGDEFIFACE32_IN6).
- use explicitely sized types for ABI compat.

Approved by:	kib (mentor)
MFC after:	2 weeks
2010-10-21 16:20:48 +00:00
mdf
ab3a8b533a Replace sbuf_overflowed() with sbuf_error(), which returns any error
code associated with overflow or with the drain function.  While this
function is not expected to be used often, it produces more information
in the form of an errno that sbuf_overflowed() did.
2010-09-10 16:42:16 +00:00
zec
e1e5264fc5 When moving an ethernet ifnet from one vnet to another, destroy the
associated ng_ether netgraph node in the current vnet, and create a
new one in the target vnet.

Reviewed by:	julian
MFC after:	3 days
2010-08-13 18:17:32 +00:00
will
d548943ae9 Unbreak LINT by moving all carp hooks to net/if.c / netinet/ip_carp.h, with
the appropriate ifdefs.

Reviewed by:	bz
Approved by:	ken (mentor)
2010-08-11 20:18:19 +00:00
will
aa4e762c4a Allow carp(4) to be loaded as a kernel module. Follow precedent set by
bridge(4), lagg(4) etc. and make use of function pointers and
pf_proto_register() to hook carp into the network stack.

Currently, because of the uncertainty about whether the unload path is free
of race condition panics, unloads are disallowed by default.  Compiling with
CARPMOD_CAN_UNLOAD in CFLAGS removes this anti foot shooting measure.

This commit requires IP6PROTOSPACER, introduced in r211115.

Reviewed by:	bz, simon
Approved by:	ken (mentor)
MFC after:	2 weeks
2010-08-11 00:51:50 +00:00
bz
0078d05705 Return NULL rather than 0 for a pointer.
MFC after:	3 days
2010-07-27 11:54:01 +00:00
qingli
f6ab4a6810 This patch fixes the problem where proxy ARP entries cannot be added
over the if_ng interface.

MFC after:	3 days
2010-05-25 20:42:35 +00:00
sobomax
213eac1f2c Add new tunable 'net.link.ifqmaxlen' to set default send interface
queue length. The default value for this parameter is 50, which is
quite low for many of today's uses and the only way to modify this
parameter right now is to edit if_var.h file. Also add read-only
sysctl with the same name, so that it's possible to retrieve the
current value.

MFC after:	1 month
2010-05-03 07:32:50 +00:00
bz
0a90ef1728 MFP4: @176978-176982, 176984, 176990-176994, 177441
"Whitspace" churn after the VIMAGE/VNET whirls.

Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.

Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.

This also removes some header file pollution for putatively
static global variables.

Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.

Reviewed by:	jhb
Discussed with:	rwatson
Sponsored by:	The FreeBSD Foundation
Sponsored by:	CK Software GmbH
MFC after:	6 days
2010-04-29 11:52:42 +00:00
delphij
58be1647f8 When an underlying ioctl(2) handler returns an error, our ioctl(2)
interface considers that it hits a fatal error, and will not copyout
the request structure back for _IOW and _IOWR ioctls, keeping them
untouched.

The previous implementation of the SIOCGIFDESCR ioctl intends to
feed the buffer length back to userland.  However, if we return
an error, the feedback would be defeated and ifconfig(8) would
trap into an infinite loop.

This commit changes SIOCGIFDESCR to set buffer field to NULL to
indicate the previous ENAMETOOLONG case.

Reported by:	bschmidt
MFC after:	2 weeks
2010-04-14 22:02:19 +00:00
bz
d5b3b3f9d9 In if_detach_internal() we cannot hold the af_data lock over the
dom_ifdetach() calls as they might sleep for callout_drain().
Do as we do in if_attachdomain1() [r121470] and handle
if_afdata_initialized earlier and call dom_ifdetach() unlocked.

Discussed with:	rwatson
MFC after:	10 days
2010-04-11 11:51:44 +00:00
bz
674a87c918 In if_detach_internal() only try to do the detach run if if_attachdomain1()
has actually succeeded to initialize and attach.  There is a theoretical
possibility to drop out early in if_attachdomain1() leaving the array
uninitialized if we cannot get the lock.

Discussed with:	rwatson
MFC after:	10 days
2010-04-11 11:49:24 +00:00
bz
44a0d7a588 Use the DB_SHOW_ALL_COMMAND() macro to register the formerly 'show ifnets'
in the db_show_all_table as 'show all ifnets' and with that follow the
convention for showing complete lists.

Submitted by:	thompsa
MFC after:	3 days
2010-02-24 15:54:24 +00:00
bz
27d5a92985 Start to implement ifnet DDB support:
- 'show ifnets' prints a list of ifnet *s per virtual network stack,
- 'show ifnet <struct ifnet *>' prints fields matching the given ifp.

We do not yet print the complete set of fields and might want to
factor this out to an extra if_debug.c file in case this grows
a lot[1]. We may also want to grow 'show ifnet <if_xname>' support[1].

Sponsored by:	ISPsystem
Suggested by:	rwatson [1]
Reviewed by:	rwatson
MFC after:	5 days
2010-02-20 22:09:48 +00:00
bz
77e8f746fc Enhance a panic string to contain more useful debugging information.
Sponsored by:	ISPsystem
Reviewed by:	rwatson
MFC after:	5 days
2010-02-20 21:43:36 +00:00
delphij
d9a0cd0982 Revised revision 199201 (add interface description capability as inspired
by OpenBSD), based on comments from many, including rwatson, jhb, brooks
and others.

Sponsored by:	iXsystems, Inc.
MFC after:	1 month
2010-01-27 00:30:07 +00:00
syrinx
40d92428fb While flushing the multicast filter of an interface, do not zero the relevant
ifmultiaddr structures' reference to the parent interface, unless the parent
interface is really detaching. While here, program only link layer multicast
filters to a wlan's hardware parent interface.

PR:		kern/142391, kern/142392
Reviewed by:	sam, rpaolo, bms
MFC after:	1 week
2010-01-24 16:17:58 +00:00
thompsa
5056e27c2d Declare a new EVENTHANDLER called iflladdr_event which signals that the L2
address on an interface has changed. This lets stacked interfaces such as
vlan(4) detect that their lower interface has changed and adjust things in
order to keep working. Previously this situation broke at least vlan(4) and
lagg(4) configurations.

The EVENTHANDLER_INVOKE call was not placed within if_setlladdr() due to the
risk of a loop.

PR:		kern/142927
Submitted by:	Nikolay Denev
2010-01-18 20:34:00 +00:00
brooks
a5cc24440b The devices that supported EVFILT_NETDEV kqueue filters were removed in
r195175.  Remove all definitions, documentation, and usage.

fifo_misc.c:
	Remove all kqueue tests as fifo_io.c performs all those that
	would have remained.

Reviewed by:	rwatson
MFC after:	3 weeks
X-MFC note:	don't change vlan_link_state() function signature
2009-12-31 20:29:58 +00:00
jhb
3ce93dcb7c Change vlan interfaces to cope more usefully with the parent interface being
renamed.  Previously the vlan interfaces would lose their configuration as if
the parent interface had been physically removed.  Now vlan interfaces ignore
rename events.
- Add a new ifnet flag (IFF_RENAMING) that is set while an ifnet is being
  renamed.  This flag can be checked in ifnet departure/arrival event
  handlers to treat rename events differently.
- Change the ifnet departure event handler in the if_vlan(4) driver to
  ignore departure events due to a trunk interface being renamed.

Reviewed by:	brooks, rwatson
MFC after:	1 week
2009-12-29 13:35:18 +00:00
jhb
1f48c677b5 Remove if_timer/if_watchdog now that they are no longer used. The space
used by if_timer is reserved for expanding if_index to an int in the
future.

Reviewed by:	rwatson, brooks
2009-11-30 21:25:57 +00:00
delphij
8fed657163 Revert revision 199201 for now as it has introduced a kernel vulnerability
and requires more polishing.
2009-11-12 19:02:10 +00:00
delphij
13a19ef806 Add interface description capability as inspired by OpenBSD.
MFC after:	3 months
2009-11-11 21:30:58 +00:00
qingli
85d603eeff A wrong variable is used when setting up the interface
address route, which broke source address selection in
some code paths.

Submitted by:	noted by bz
Reviewed by:	hrs
MFC after:	immediately
2009-09-20 17:22:19 +00:00
qingli
3a82e44273 Self pointing routes are installed for configured interface addresses
and address aliases. After an interface is brought down and brought
back up again, those self pointing routes disappeared. This patch
ensures after an interface is brought back up, the loopback routes
are reinstalled properly.

Reviewed by:	bz
MFC after:	immediately
2009-09-15 19:18:34 +00:00
rwatson
cffae4081c Add IFNET_HOLD reserved pointer value for the ifindex ifnet array,
which allows an index to be reserved for an ifnet without making
the ifnet available for management operations.  Use this in if_alloc()
while the ifnet lock is released between initial index allocation and
completion of ifnet initialization.

Add ifindex_free() to centralize the implementation of releasing an
ifindex value.  Use in if_free() and if_vmove(), as well as when
releasing a held index in if_alloc().

Reviewed by:	bz
MFC after:	3 days
2009-08-26 11:13:10 +00:00
rwatson
8184e58b3a Break out allocation of new ifindex values from if_alloc() and if_vmove(),
and centralize in a single function ifindex_alloc().  Assert the
IFNET_WLOCK, and add missing IFNET_WLOCK in if_alloc().  This does not
close all known races in this code.

Reviewed by:	bz
MFC after:	3 days
2009-08-25 20:21:16 +00:00
rwatson
260dfcf9e9 Make if_grow static -- it's not used outside of if.c, and with the
internals destined to change, it's better if it remains that way.

MFC after:	3 days
2009-08-24 12:52:05 +00:00
zec
47445e571b When moving ifnets from one vnet to another, and the ifnet
has ifaddresses of AF_LINK type which thus have an embedded
if_index "backpointer", we must update that if_index backpointer
to reflect the new if_index that our ifnet just got assigned.

This change affects only options VIMAGE builds.

Submitted by:	bz
Reviewed by:	bz
Approved by:	re (rwatson), julian (mentor)
2009-08-24 10:14:09 +00:00
rwatson
ef8d755d4d Rework global locks for interface list and index management, correcting
several critical bugs, including race conditions and lock order issues:

Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock.  Either can be held to stablize the lists and indexes, but both
are required to write.  This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions.  As before, writes to
the interface list must occur from sleepable contexts.

Reviewed by:	bz, julian
MFC after:	3 days
2009-08-23 20:40:19 +00:00
zec
51ca260850 Appease VNET_DEBUG - in if_vmove we temporarily switch i.e.
recurse from one vnet to another which is OK, so no need
to flood the console with warnings here.

Approved by:	re (rwatson), julian (mentor)
2009-08-14 22:46:45 +00:00
rwatson
fb9ffed650 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
bz
3aec900b26 Make the in-kernel logic for the SIOCSIFVNET, SIOCSIFRVNET ioctls
(ifconfig ifN (-)vnet <jname|jid>) work correctly.

Move vi_if_move to if.c and split it up into two functions(*),
one for each ioctl.

In the reclaim case, correctly set the vnet before calling if_vmove.

Instead of silently allowing a move of an interface from the current
vnet to the current vnet, return an error. (*)

There is some duplicate interface name checking before actually moving
the interface between network stacks without locking and thus race
prone. Ideally if_vmove will correctly and automagically handle these
in the future.

Suggested by:	rwatson (*)
Approved by:	re (kib)
2009-07-26 11:29:26 +00:00
rwatson
b3be1c6e3b Introduce and use a sysinit-based initialization scheme for virtual
network stacks, VNET_SYSINIT:

- Add VNET_SYSINIT and VNET_SYSUNINIT macros to declare events that will
  occur each time a network stack is instantiated and destroyed.  In the
  !VIMAGE case, these are simply mapped into regular SYSINIT/SYSUNINIT.
  For the VIMAGE case, we instead use SYSINIT's to track their order and
  properties on registration, using them for each vnet when created/
  destroyed, or immediately on module load for already-started vnets.
- Remove vnet_modinfo mechanism that existed to serve this purpose
  previously, as well as its dependency scheme: we now just use the
  SYSINIT ordering scheme.
- Implement VNET_DOMAIN_SET() to allow protocol domains to declare that
  they want init functions to be called for each virtual network stack
  rather than just once at boot, compiling down to DOMAIN_SET() in the
  non-VIMAGE case.
- Walk all virtualized kernel subsystems and make use of these instead
  of modinfo or DOMAIN_SET() for init/uninit events.  In some cases,
  convert modular components from using modevent to using sysinit (where
  appropriate).  In some cases, do minor rejuggling of SYSINIT ordering
  to make room for or better manage events.

Portions submitted by:	jhb (VNET_SYSINIT), bz (cleanup)
Discussed with:		jhb, bz, julian, zec
Reviewed by:		bz
Approved by:		re (VIMAGE blanket)
2009-07-23 20:46:49 +00:00
rwatson
80ed051e0c Normalize field naming for struct vnet, fix two debugging printfs that
print them.

Reviewed by:	bz
Approved by:	re (kensmith, kib)
2009-07-19 17:40:45 +00:00
rwatson
6955067932 Reimplement and/or implement vnet list locking by replacing a mostly
unused custom mutex/condvar-based sleep locks with two locks: an
rwlock (for non-sleeping use) and sxlock (for sleeping use).  Either
acquired for read is sufficient to stabilize the vnet list, but both
must be acquired for write to modify the list.

Replace previous no-op read locking macros, used in various places
in the stack, with actual locking to prevent race conditions.  Callers
must declare when they may perform unbounded sleeps or not when
selecting how to lock.

Refactor vnet sysinits so that the vnet list and locks are initialized
before kernel modules are linked, as the kernel linker will use them
for modules loaded by the boot loader.

Update various consumers of these KPIs based on whether they may sleep
or not.

Reviewed by:	bz
Approved by:	re (kib)
2009-07-19 14:20:53 +00:00
jamie
9f81cbd9ec Remove the interim vimage containers, struct vimage and struct procg,
and the ioctl-based interface that supported them.

Approved by:	re (kib), bz (mentor)
2009-07-17 14:48:21 +00:00
rwatson
88f8de4d40 Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used.  Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with:	bz, julian
Reviewed by:	bz
Approved by:	re (kensmith, kib)
2009-07-16 21:13:04 +00:00
rwatson
57ca4583e7 Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00
brooks
0cabaf8791 Remove support for the /dev/net/* per-interface devices. They serve
little purpose and are unused in the base system.

The IOCTL functionality is entirely duplicated and routing sockets
provide a richer interface than the kqueue functionality.

Further, it is not practical for these devices to be made sensible in
the face of VIMAGE.

Bump __FreeBSD_version on the off chance that there is any code out
there that actually uses this stuff.

Reviewed by:	rwatson
Discussed with:	bz, zec
Approved by:	re@ (kensmith)
2009-06-29 19:46:29 +00:00
rwatson
5cfc09d074 Remove unnecessary include of kdb.h that snuck in during ifaddr refcount
work.

Reported by:	pluknet <pluknet at gmail.com>
Approved by:	re (kib)
2009-06-27 10:30:28 +00:00
rwatson
c4ac6ab020 Define four wrapper functions for interface address locking,
if_addr_rlock() and if_addr_runlock() for regular address lists, and
if_maddr_rlock() and if_maddr_runlock() for multicast address lists.

We will use these in various kernel modules to avoid encoding specific
type and locking strategy information into modules that currently use
IF_ADDR_LOCK() and IF_ADDR_UNLOCK() directly.

MFC after:	6 weeks
2009-06-26 00:36:47 +00:00
rwatson
d5bc5239c3 In if_setlladdr(), use IF_ADDR_LOCK() and ifaddr references to improve
the safety of link layer address manipulation.

MFC after:	6 weeks
2009-06-24 10:36:48 +00:00
rwatson
c9ef486fe1 Modify most routines returning 'struct ifaddr *' to return references
rather than pointers, requiring callers to properly dispose of those
references.  The following routines now return references:

  ifaddr_byindex
  ifa_ifwithaddr
  ifa_ifwithbroadaddr
  ifa_ifwithdstaddr
  ifa_ifwithnet
  ifaof_ifpforaddr
  ifa_ifwithroute
  ifa_ifwithroute_fib
  rt_getifa
  rt_getifa_fib
  IFP_TO_IA
  ip_rtaddr
  in6_ifawithifp
  in6ifa_ifpforlinklocal
  in6ifa_ifpwithaddr
  in6_ifadd
  carp_iamatch6
  ip6_getdstifaddr

Remove unused macro which didn't have required referencing:

  IFP_TO_IA6

This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc).  In a few cases, add missing if_addr_list locking
required to safely acquire references.

Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit.  Once
we have mbuf tag deep copy support, this can be fixed.

Reviewed by:	bz
Obtained from:	Apple, Inc. (portions)
MFC after:	6 weeks (portions)
2009-06-23 20:19:09 +00:00
bz
37fcd1fbea Remove duplicate #include <net/route.h> from the middle of the file. 2009-06-23 13:16:16 +00:00
bz
309ab541f2 Move virtualization of routing related variables into their own
Vimage module, which had been there already but now is stateful.

All variables are now file local; so this further limits the global
spreading of routing related things throughout the kernel.

Add a missing function local variable in case of MPATHing.

Reviewed by:	zec
2009-06-22 17:48:16 +00:00
rwatson
3c02410d55 Add a new function, ifa_ifwithaddr_check(), which rather than returning
a pointer to an ifaddr matching the passed socket address, returns a
boolean indicating whether one was present.  In the (near) future,
ifa_ifwithaddr() will return a referenced ifaddr rather than a raw
ifaddr pointer, and the new wrapper will allow callers that care only
about the boolean condition to avoid having to free that reference.

MFC after:	3 weeks
2009-06-22 10:59:34 +00:00
bz
c5ae8d95f2 After the update to fxp(4) in r194573 we should no longer need
this DELAY(100) hack introduced in r56938.

Thanks to:	yongari
MFC after:	6 weeks
X-MFC note:	not before the fxp(4) changes
2009-06-22 10:27:20 +00:00
rwatson
1f7e54e8c5 Clean up common ifaddr management:
- Unify reference count and lock initialization in a single function,
  ifa_init().
- Move tear-down from a macro (IFAFREE) to a function ifa_free().
- Move reference count bump from a macro (IFAREF) to a function ifa_ref().
- Instead of using a u_int protected by a mutex to refcount(9) for
  reference count management.

The ifa_mtx is now used for exactly one ioctl, and possibly should be
removed.

MFC after:	3 weeks
2009-06-21 19:30:33 +00:00
rdivacky
9992cf9aeb Switch cmd argument to u_long. This matches what if_ethersubr.c does and
allows the code to compile cleanly on amd64 with clang.

Reviewed by:	rwatson
Approved by:	ed (mentor)
2009-06-21 10:29:31 +00:00
sam
df30f1c9f2 r193336 moved ifq_detach to if_free which broke if_alloc followed
by if_free (w/o doing if_attach); move ifq_attach to if_alloc and
rename ifq_attach/detach to ifq_init/ifq_delete to better identify
their purpose

Reviewed by:	jhb, kmacy
2009-06-15 19:50:03 +00:00
jamie
5675a54fb1 Manage vnets via the jail system. If a jail is given the boolean
parameter "vnet" when it is created, a new vnet instance will be created
along with the jail.  Networks interfaces can be moved between prisons
with an ioctl similar to the one that moves them between vimages.
For now vnets will co-exist under both jails and vimages, but soon
struct vimage will be going away.

Reviewed by:	zec, julian
Approved by:	bz (mentor)
2009-06-15 18:59:29 +00:00
bz
fe42112add carp(4) allows people to share a set of IP addresses and can only
use IPv4/v6 for inter-node communication (according to my reading).

Properly wrap the carp callouts in INET || INET6 and refelect this
in sys/conf/files as well.  While in theory this should be ok,
it might be a bit optimistic to think that carp could build with
inet6 only[1].

Discussed with:		mlaier [1]
2009-06-11 10:26:38 +00:00
kib
e1cb2941d4 Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use
vnode interlock to protect the knote fields [1]. The locking assumes
that shared vnode lock is held, thus we get exclusive access to knote
either by exclusive vnode lock protection, or by shared vnode lock +
vnode interlock.

Do not use kl_locked() method to assert either lock ownership or the
fact that curthread does not own the lock. For shared locks, ownership
is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared
lock not owned by curthread, causing false positives in kqueue subsystem
assertions about knlist lock.

Remove kl_locked method from knlist lock vector, and add two separate
assertion methods kl_assert_locked and kl_assert_unlocked, that are
supposed to use proper asserts. Change knlist_init accordingly.

Add convenience function knlist_init_mtx to reduce number of arguments
for typical knlist initialization.

Submitted by:	jhb [1]
Noted by:	jhb [2]
Reviewed by:	jhb
Tested by:	rnoland
2009-06-10 20:59:32 +00:00
bz
b7ff2bdc20 After r193232 rt_tables in vnet.h are no longer indirectly dependent on
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.

Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.
2009-06-08 19:57:35 +00:00
zec
8b1f38241a Introduce an infrastructure for dismantling vnet instances.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state.  The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.

While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers.  Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.

Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels.  Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.

Bump __FreeBSD_version to 800097.
Reviewed by:	bz, julian
Approved by:	rwatson, kib (re), julian (mentor)
2009-06-08 17:15:40 +00:00
rwatson
f4934662e5 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
sam
ba34647600 move ifq_detach from if_detach to if_free; this permits callers to
reference if_snd in the period between detach+free which helps simplify
detach code

Reviewed by:	jhb, rwatson
2009-06-02 18:53:21 +00:00
bz
c62e99f85d Convert the two dimensional array to be malloced and introduce
an accessor function to get the correct rnh pointer back.

Update netstat to get the correct pointer using kvm_read()
as well.

This not only fixes the ABI problem depending on the kernel
option but also permits the tunable to overwrite the kernel
option at boot time up to MAXFIBS, enlarging the number of
FIBs without having to recompile. So people could just use
GENERIC now.

Reviewed by:	julian, rwatson, zec
X-MFC:		not possible
2009-06-01 15:49:42 +00:00
zec
861b77b017 Introduce an interm userland-kernel API for creating vnets and
assigning ifnets from one vnet to another.  Deletion of vnets is not
yet supported.

The interface is implemented as an ioctl extension so that no syscalls
had to be introduced.  This should be acceptable given that the new
interface will be used for a short / interim period only, until the
new jail management framwork gains the capability of managing vnets.
This method for managing vimages / vnets has been in use for the past
7 years without any observable issues.

The userland tool to be used in conjunction with the interim API can be
found in p4: //depot/projects/vimage-commit2/src/usr.sbin/vimage/... and
will most probably never get commited to svn.

While here, bump copyright notices in kern_vimage.c and vimage.h to
cover work done in year 2009.

Approved by:	julian (mentor)
Discussed with:	bz, rwatson
2009-05-31 12:10:04 +00:00
zec
ce69980b94 Set ifp->if_afdata_initialized to 0 while holding IF_AFDATA_LOCK on ifp,
not after the lock has been released.

Reviewed by:	bz
Discussed with:	rwatson
2009-05-22 22:22:21 +00:00
zec
363a644ce6 Introduce the if_vmove() function, which will be used in the future
for reassigning ifnets from one vnet to another.

if_vmove() works by calling a restricted subset of actions normally
executed by if_detach() on an ifnet in the current vnet, and then
switches to the target vnet and executes an appropriate subset of
if_attach() actions there.

if_attach() and if_detach() have become wrapper functions around
if_attach_internal() and if_detach_internal(), where the later
variants have an additional argument, a flag indicating whether a
full attach or detach sequence is to be executed, or only a
restricted subset suitable for moving an ifnet from one vnet to
another.  Hence, if_vmove() will not call if_detach() and if_attach()
directly, but will call the if_detach_internal() and
if_attach_internal() variants instead, with the vmove flag set.

While here, staticize ifnet_setbyindex() since it is not referenced
from outside of sys/net/if.c.

Also rename ifccnt field in struct vimage to ifcnt, and do some minor
whitespace garbage collection where appropriate.

This change should have no functional impact on nooptions VIMAGE kernel
builds.

Reviewed by:	bz, rwatson, brooks?
Approved by:	julian (mentor)
2009-05-22 22:09:00 +00:00
zec
d78a1b1a82 Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one.  The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE().  Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc.  Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by:	julian (mentor)
2009-05-05 10:56:12 +00:00
zec
39b6dc8ba2 Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance.  Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables.  As an example, V_ifnet becomes:

    options VIMAGE:          ((struct vnet_net *) vnet_net)->_ifnet
    default build:           vnet_net_0._ifnet
    options VIMAGE_GLOBALS:  ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

    INIT_VNET_NET(ifp->if_vnet); becomes

    struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals.  If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet.  options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod.  SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by:	bz, rwatson
Approved by:	julian (mentor)
2009-04-30 13:36:26 +00:00
rwatson
00b3a1fd4f As with ifnet_byindex_ref(), don't return IFF_DYING interfaces from
ifunit_ref().  ifunit() continues to return them.

MFC after:	3 weeks
2009-04-23 15:56:01 +00:00
rwatson
471539dc8f Add ifunit_ref(), a version of ifunit(), that returns not just an
interface pointer, but also a reference to it.

Modify ifioctl() to use ifunit_ref(), holding the reference until
all ioctls, etc, have completed.

This closes a class of reader-writer races in which interfaces
could be removed during long-running ioctls, leading to crashes.
Many other consumers of ifunit() should now use ifunit_ref() to
avoid similar races.

MFC after:	3 weeks
2009-04-23 13:08:47 +00:00
rwatson
ccc05d4c7f During if_detach(), invoke if_dead() to set the ifnet's function
pointers to "dead" implementations that no-op rather than invoking
the device driver.  This would generally be unexpected and
possibly quite badly handled by most device drivers after
if_detach() has completed.

Reviewed by:	bms
MFC after:	3 weeks
2009-04-23 11:51:53 +00:00
rwatson
bf5b2167ac Move portions of data structure initialization from if_attach() to
if_alloc(), and portions of data structure destruction from if_detach()
to if_free().  These changes leave more of the struct ifnet in a
safe-to-access condition between alloc and attach, and between detach
and free, and focus on attach/detach as stack usage events rather than
data structure initialization.

Affected fields include the linkstate task queue, if_afdata lock,
address lists, kqueue state, and MAC labels.  ifq_attach() ifq_detach()
are not moved as ifq_attach() may use a queue length set by the device
driver between if_alloc() and if_attach().

MFC after:	3 weeks
2009-04-23 10:59:40 +00:00
rwatson
c797841f0d Add a new interface flag, IFF_DYING, which is set when a device driver
calls if_free(), and remains set if the refcount is elevated.  IF_DYING
skips the bit in the if_flags bitmask previously used by IFF_NEEDSGIANT,
so that an MFC can be done without changing which bit is used, as
IFF_NEEDSGIANT is still present in 7.x.

ifnet_byindex_ref() checks for IFF_DYING and returns NULL if it is set,
preventing new references from by acquired by index, preventing
monitoring sysctls from seeing it.  Other lookup mechanisms currently
do not check IFF_DYING, but may need to in the future.

MFC after:	3 weeks
2009-04-23 09:32:30 +00:00
rwatson
6b19bec016 Start to address a number of races relating to use of ifnet pointers
after the corresponding interface has been destroyed:

(1) Add an ifnet refcount, ifp->if_refcount.  Initialize it to 1 in
    if_alloc(), and modify if_free_type() to decrement and check the
    refcount.

(2) Add new if_ref() and if_rele() interfaces to allow kernel code
    walking global interface lists to release IFNET_[RW]LOCK() yet
    keep the ifnet stable.  Currently, if_rele() is a no-op wrapper
    around if_free(), but this may change in the future.

(3) Add new ifnet field, if_alloctype, which caches the type passed
    to if_alloc(), but unlike if_type, won't be changed by drivers.
    This allows asynchronous free's of the interface after the
    driver has released it to still use the right type.  Use that
    instead of the type passed to if_free_type(), but assert that
    they are the same (might have to rethink this if that doesn't
    work out).

(4) Add a new ifnet_byindex_ref(), which looks up an interface by
    index and returns a reference rather than a pointer to it.

(5) Fix if_alloc() to fully initialize the if_addr_mtx before hooking
    up the ifnet to global lists.

(6) Modify sysctls in if_mib.c to use ifnet_byindex_ref() and release
    the ifnet when done.

When this change is MFC'd, it will need to replace if_ispare fields
rather than adding new fields in order to avoid breaking the binary
interface.  Once this change is MFC'd, if_free_type() should be
removed, as its 'type' argument is now optional.

This refcount is not appropriate for counting mbuf pkthdr references,
and also not for counting entry into the device driver via ifnet
function pointers.  An rmlock may be appropriate for the latter.
Rather, this is about ensuring data structure stability when reaching
an ifnet via global ifnet lists and tables followed by copy in or out
of userspace.

MFC after:      3 weeks
Reported by:    mdtancsa
Reviewed by:    brooks
2009-04-21 22:43:32 +00:00
rwatson
72b01124b7 Acquire the interface address list lock over some iterations over
if_addrhead.  This closes some reader-writer races associated with
the address list.

MFC after:	2 weeks
2009-04-21 19:06:47 +00:00
kmacy
7370d77f78 export if_qflush for use by driver if_qflush routines
only set ifp->if_{transmit, qflush} if not already set
KASSERT that neither or both are set
2009-04-16 23:05:10 +00:00
zec
8b0141432f In the !VIMAGE_GLOBALS case, make sure not to call vnet_net_iattach()
both via the vnet_mod_register() framework and then directly, but only
once.

Reviewed by:	bz
Approved by:	julian (mentor)
2009-04-15 18:15:29 +00:00
kmacy
a74ac29eb6 call default if_qflush on ifq if default method isn't used by the driver 2009-04-14 03:17:44 +00:00
zec
b39b54e6de Introduce vnet module registration / initialization framework with
dependency tracking and ordering enforcement.

With this change, per-vnet initialization functions introduced with
r190787 are no longer directly called from traditional initialization
functions (which cc in most cases inlined to pre-r190787 code), but are
instead registered via the vnet framework first, and are invoked only
after all prerequisite modules have been initialized.  In the long run,
this framework should allow us to both initialize and dismantle
multiple vnet instances in a correct order.

The problem this change aims to solve is how to replay the
initialization sequence of various network stack components, which
have been traditionally triggered via different mechanisms (SYSINIT,
protosw).  Note that this initialization sequence was and still can be
subtly different depending on whether certain pieces of code have been
statically compiled into the kernel, loaded as modules by boot
loader, or kldloaded at run time.

The approach is simple - we record the initialization sequence
established by the traditional mechanisms whenever vnet_mod_register()
is called for a particular vnet module.  The vnet_mod_register_multi()
variant allows a single initializer function to be registered multiple
times but with different arguments - currently this is only used in
kern/uipc_domain.c by net_add_domain() with different struct domain *
as arguments, which allows for protosw-registered initialization
routines to be invoked in a correct order by the new vnet
initialization framework.

For the purpose of identifying vnet modules, each vnet module has to
have a unique ID, which is statically assigned in sys/vimage.h.
Dynamic assignment of vnet module IDs is not supported yet.

A vnet module may specify a single prerequisite module at registration
time by filling in the vmi_dependson field of its vnet_modinfo struct
with the ID of the module it depends on.  Unless specified otherwise,
all vnet modules depend on VNET_MOD_NET (container for ifnet list head,
rt_tables etc.), which thus has to and will always be initialized
first.  The framework will panic if it detects any unresolved
dependencies before completing system initialization.  Detection of
unresolved dependencies for vnet modules registered after boot
(kldloaded modules) is not provided.

Note that the fact that each module can specify only a single
prerequisite may become problematic in the long run.  In particular,
INET6 depends on INET being already instantiated, due to TCP / UDP
structures residing in INET container.  IPSEC also depends on INET,
which will in turn additionally complicate making INET6-only kernel
configs a reality.

The entire registration framework can be compiled out by turning on the
VIMAGE_GLOBALS kernel config option.

Reviewed by:	bz
Approved by:	julian (mentor)
2009-04-11 05:58:58 +00:00
mlaier
f2cc9938bf Follow up for r190895 It's not only the "all" group that is affected, but
all groups on the given interface.

PR:		kern/130977, kern/131310
MFC after:	3 days (%vnet)
2009-04-10 19:16:14 +00:00
mlaier
6472cfe37a Remove interfaces from IFG_ALL on detach. This cures a couple of pf panics
when using the "self" keyword in tables or as ()-style host address and
fixes "ifconfig -g all" output.

PR:		kern/130977, kern/131310
Submitted by:	Mikolaj Golub
MFC after:	3 days
2009-04-10 14:41:51 +00:00
zec
c85551e0bc First pass at separating per-vnet initializer functions
from existing functions for initializing global state.

        At this stage, the new per-vnet initializer functions are
	directly called from the existing global initialization code,
	which should in most cases result in compiler inlining those
	new functions, hence yielding a near-zero functional change.

        Modify the existing initializer functions which are invoked via
        protosw, like ip_init() et. al., to allow them to be invoked
	multiple times, i.e. per each vnet.  Global state, if any,
	is initialized only if such functions are called within the
	context of vnet0, which will be determined via the
	IS_DEFAULT_VNET(curvnet) check (currently always true).

        While here, V_irtualize a few remaining global UMA zones
        used by net/netinet/netipsec networking code.  While it is
        not yet clear to me or anybody else whether this is the right
        thing to do, at this stage this makes the code more readable,
        and makes it easier to track uncollected UMA-zone-backed
        objects on vnet removal.  In the long run, it's quite possible
        that some form of shared use of UMA zone pools among multiple
        vnets should be considered.

	Bump __FreeBSD_version due to changes in layout of structs
	vnet_ipfw, vnet_inet and vnet_net.

Approved by:	julian (mentor)
2009-04-06 22:29:41 +00:00
sam
fdfdc43f6d enable setting the mac address of 802.11 devices 2009-03-28 17:36:56 +00:00
jamie
422c98f3d6 Call the interface's if_ioctl from ifioctl(), if the protocol didn't
handle the ioctl.  There are other paths that already call it, but this
allows for a non-interface socket (like AF_LOCAL which ifconfig now
uses) to use a broader class of interface ioctls.

Approved by:	bz (mentor), rwatson
2009-03-20 13:41:23 +00:00
rwatson
70b6a8119c Remove IFF_NEEDSGIANT, a compatibility infrastructure introduced
in FreeBSD 5.x to allow network device drivers to run with Giant
despite the network stack being Giant-free.  This significantly
simplifies calls into ioctl() on network interfaces, especially
in the multicast code, as well as eliminates deferred invocation
of interface if_start routines.

Disable the build on device drivers still depending on
IFF_NEEDSGIANT as they no longer compile.  They will be removed
in a few weeks if they haven't been made MPSAFE in that time.
Disabled drivers:

        if_ar
        if_axe
        if_aue
        if_cdce
        if_cue
        if_kue
        if_ray
        if_rue
        if_rum
        if_sr
        if_udav
        if_ural
        if_zyd

Drivers that were already disabled because of tty changes:

        if_ppp
        if_sl

Discussed on:	arch@
2009-03-15 14:21:05 +00:00
sam
9dac23d7ae remove stray ; 2009-03-14 17:54:58 +00:00
bz
df2be82cec For all files including net/vnet.h directly include opt_route.h and
net/route.h.

Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.

We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.

This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.
2009-02-27 14:12:05 +00:00
jamie
12bbe1869f Standardize the various prison_foo_ip[46] functions and prison_if to
return zero on success and an error code otherwise.  The possible errors
are EADDRNOTAVAIL if an address being checked for doesn't match the
prison, and EAFNOSUPPORT if the prison doesn't have any addresses in
that address family.  For most callers of these functions, use the
returned error code instead of e.g. a hard-coded EADDRNOTAVAIL or
EINVAL.

Always include a jailed() check in these functions, where a non-jailed
cred always returns success (and makes no changes).  Remove the explicit
jailed() checks that preceded many of the function calls.

Approved by:	bz (mentor)
2009-02-05 14:06:09 +00:00
jhb
a622abe85f Only start the if_slowtimo timer (which drives the if_watchdog methods of
network interfaces) if we have at least one interface with an if_watchdog
routine.

MFC after:	2 weeks
2009-01-23 20:53:01 +00:00
kmacy
b08c9bbff4 if_rtdel is always called with the RADIX_NODE_HEAD lock held 2008-12-18 09:59:24 +00:00
kmacy
0e212eb0e1 add ifnet_byindex_locked to allow for use of IFNET_RLOCK 2008-12-18 04:50:44 +00:00
kmacy
832d8c0e29 avoid trying to acquire a shared lock while holding an exclusive lock
by making the ifnet lock acquisition exclusive
2008-12-17 04:33:52 +00:00
kmacy
d0147f27c7 convert ifnet and afdata locks from mutexes to rwlocks 2008-12-17 00:11:56 +00:00
qingli
ec826ad5c7 This main goals of this project are:
1. separating L2 tables (ARP, NDP) from the L3 routing tables
2. removing as much locking dependencies among these layers as
   possible to allow for some parallelism in the search operations
3. simplify the logic in the routing code,

The most notable end result is the obsolescent of the route
cloning (RTF_CLONING) concept, which translated into code reduction
in both IPv4 ARP and IPv6 NDP related modules, and size reduction in
struct rtentry{}. The change in design obsoletes the semantics of
RTF_CLONING, RTF_WASCLONE and RTF_LLINFO routing flags. The userland
applications such as "arp" and "ndp" have been modified to reflect
those changes. The output from "netstat -r" shows only the routing
entries.

Quite a few developers have contributed to this project in the
past: Glebius Smirnoff, Luigi Rizzo, Alessandro Cerri, and
Andre Oppermann. And most recently:

- Kip Macy revised the locking code completely, thus completing
  the last piece of the puzzle, Kip has also been conducting
  active functional testing
- Sam Leffler has helped me improving/refactoring the code, and
  provided valuable reviews
- Julian Elischer setup the perforce tree for me and has helped
  me maintaining that branch before the svn conversion
2008-12-15 06:10:57 +00:00
bz
7bc8c0cbd6 Whitespace changes only - tabs must have been converted to spaces
somehow, when moving the code from p4 to svn.

Sponsored by:	The FreeBSD Foundation
2008-12-11 15:42:59 +00:00
zec
7b573d1496 Conditionally compile out V_ globals while instantiating the appropriate
container structures, depending on VIMAGE_GLOBALS compile time option.

Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out.  Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.

Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively

#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif

Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.

Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs.  This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.

Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.

De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps.  PF virtualization will be done
separately, most probably after next PF import.

Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw.  Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-12-10 23:12:39 +00:00
bz
30b801e183 It does not make much sense to include net/route.h twice.
Remove one #include.
2008-12-09 21:09:05 +00:00
bz
1b7a712c9a Add rwlock.h (and lock.h for that) to keep no-INET kernels compiling
after RADIX_NODE_HEAD_{,UN}LOCK() were added.  Must have been "learned"
by pollution before (most likely: route.h -> radix.h -> rwlock.h)
2008-12-09 20:05:58 +00:00
bz
604d89458a Rather than using hidden includes (with cicular dependencies),
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.

For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.

Reviewed by:	brooks, gnn, des, zec, imp
Sponsored by:	The FreeBSD Foundation
2008-12-02 21:37:28 +00:00
bz
d2730d5b27 MFp4:
Bring in updated jail support from bz_jail branch.

This enhances the current jail implementation to permit multiple
addresses per jail. In addtion to IPv4, IPv6 is supported as well.
Due to updated checks it is even possible to have jails without
an IP address at all, which basically gives one a chroot with
restricted process view, no networking,..

SCTP support was updated and supports IPv6 in jails as well.

Cpuset support permits jails to be bound to specific processor
sets after creation.

Jails can have an unrestricted (no duplicate protection, etc.) name
in addition to the hostname. The jail name cannot be changed from
within a jail and is considered to be used for management purposes
or as audit-token in the future.

DDB 'show jails' command was added to aid debugging.

Proper compat support permits 32bit jail binaries to be used on 64bit
systems to manage jails. Also backward compatibility was preserved where
possible: for jail v1 syscalls, as well as with user space management
utilities.

Both jail as well as prison version were updated for the new features.
A gap was intentionally left as the intermediate versions had been
used by various patches floating around the last years.

Bump __FreeBSD_version for the afore mentioned and in kernel changes.

Special thanks to:
- Pawel Jakub Dawidek (pjd) for his multi-IPv4 patches
  and Olivier Houchard (cognet) for initial single-IPv6 patches.
- Jeff Roberson (jeff) and Randall Stewart (rrs) for their
  help, ideas and review on cpuset and SCTP support.
- Robert Watson (rwatson) for lots and lots of help, discussions,
  suggestions and review of most of the patch at various stages.
- John Baldwin (jhb) for his help.
- Simon L. Nielsen (simon) as early adopter testing changes
  on cluster machines as well as all the testers and people
  who provided feedback the last months on freebsd-jail and
  other channels.
- My employer, CK Software GmbH, for the support so I could work on this.

Reviewed by:	(see above)
MFC after:	3 months (this is just so that I get the mail)
X-MFC Before:   7.2-RELEASE if possible
2008-11-29 14:32:14 +00:00
zec
95a15f5c84 Merge more of currently non-functional (i.e. resolving to
whitespace) macros from p4/vimage branch.

Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.

De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.

Reviewed by:  bz, julian
Approved by:  julian (mentor)
Obtained from:        //depot/projects/vimage-commit2/...
X-MFC after:  never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
2008-11-26 22:32:07 +00:00
sam
a5ea7c5903 use consistent style 2008-11-24 17:34:00 +00:00
kmacy
9d3bb599b1 - bump __FreeBSD version to reflect added buf_ring, memory barriers,
and ifnet functions

- add memory barriers to <machine/atomic.h>
- update drivers to only conditionally define their own

- add lockless producer / consumer ring buffer
- remove ring buffer implementation from cxgb and update its callers

- add if_transmit(struct ifnet *ifp, struct mbuf *m) to ifnet to
  allow drivers to efficiently manage multiple hardware queues
  (i.e. not serialize all packets through one ifq)
- expose if_qflush to allow drivers to flush any driver managed queues

This work was supported by Bitgravity Inc. and Chelsio Inc.
2008-11-22 05:55:56 +00:00
zec
815d52c5df Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions.  As a rule,
initialization at instatiation for such variables should never be
introduced again from now on.  Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact.  In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at:	devsummit Strassburg
Reviewed by:	bz, julian
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-11-19 09:39:34 +00:00
bz
c6a123e24a Include if_arp.h for IFP2AC so that the netgraph parts in if.c
are happy even if compiled without INET or INET6.

MFC after:	2 months
2008-11-06 15:26:09 +00:00
des
66f807ed8b Retire the MALLOC and FREE macros. They are an abomination unto style(9).
MFC after:	3 months
2008-10-23 15:53:51 +00:00
zec
8797d4caec Step 1.5 of importing the network stack virtualization infrastructure
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit

Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.

Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().

Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).

All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).

(*) netipsec/keysock.c did not validate depending on compile time options.

Implemented by:	julian, bz, brooks, zec
Reviewed by:	julian, bz, brooks, kris, rwatson, ...
Approved by:	julian (mentor)
Obtained from:	//depot/projects/vimage-commit2/...
X-MFC after:	never
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
2008-10-02 15:37:58 +00:00
ed
4efdef565f Replace all calls to minor() with dev2unit().
After I removed all the unit2minor()/minor2unit() calls from the kernel
yesterday, I realised calling minor() everywhere is quite confusing.
Character devices now only have the ability to store a unit number, not
a minor number. Remove the confusion by using dev2unit() everywhere.

This commit could also be considered as a bug fix. A lot of drivers call
minor(), while they should actually be calling dev2unit(). In -CURRENT
this isn't a problem, but it turns out we never had any problem reports
related to that issue in the past. I suspect not many people connect
more than 256 pieces of the same hardware.

Reviewed by:	kib
2008-09-27 08:51:18 +00:00
ed
4212d51a7d Remove unit2minor() use from kernel code.
When I changed kern_conf.c three months ago I made device unit numbers
equal to (unneeded) device minor numbers. We used to require
bitshifting, because there were eight bits in the middle that were
reserved for a device major number. Not very long after I turned
dev2unit(), minor(), unit2minor() and minor2unit() into macro's.
The unit2minor() and minor2unit() macro's were no-ops.

We'd better not remove these four macro's from the kernel, because there
is a lot of (external) code that may still depend on them. For now it's
harmless to remove all invocations of unit2minor() and minor2unit().

Reviewed by:	kib
2008-09-26 14:19:52 +00:00
bz
4e18e7c8f4 Make the checks for ptp interfaces in ifa_ifwithdstaddr() and
ifa_ifwithnet() look more similar by comparing the pointer to NULL
in both cases.

MFC after:	3 months
2008-08-24 11:03:43 +00:00
thompsa
fb39793d41 ifnet_setbyindex() is only used locally, go back to being static. 2008-08-20 05:00:18 +00:00
julian
0592958505 A bunch of formatting fixes brough to light by, or created by the Vimage commit
a few days ago.
2008-08-20 01:05:56 +00:00
bz
1021d43b56 Commit step 1 of the vimage project, (network stack)
virtualization work done by Marko Zec (zec@).

This is the first in a series of commits over the course
of the next few weeks.

Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.

We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.

Obtained from:	//depot/projects/vimage-commit2/...
Reviewed by:	brooks, des, ed, mav, julian,
		jamie, kris, rwatson, zec, ...
		(various people I forgot, different versions)
		md5 (with a bit of help)
Sponsored by:	NLnet Foundation, The FreeBSD Foundation
X-MFC after:	never
V_Commit_Message_Reviewed_By:	more people than the patch
2008-08-17 23:27:27 +00:00
rwatson
46dd6e44fc Introduce locking around use of ifindex_table, whose use was previously
unsynchronized.  While races were extremely rare, we've now had a
couple of reports of panics in environments involving large numbers of
IPSEC tunnels being added very quickly on an active system.

- Add accessor functions ifnet_byindex(), ifaddr_byindex(),
  ifdev_byindex() to replace existing accessor macros.  These functions
  now acquire the ifnet lock before derefencing the table.
- Add IFNET_WLOCK_ASSERT().
- Add static accessor functions ifnet_setbyindex(), ifdev_setbyindex(),
  which set values in the table either asserting of acquiring the ifnet
  lock.
- Use accessor functions throughout if.c to modify and read
  ifindex_table.
- Rework ifnet attach/detach to lock around ifindex_table modification.

Note that these changes simply close races around use of ifindex_table,
and make no attempt to solve the probem of disappearing ifnets.  Further
refinement of this work, including with respect to ifindex_table
resizing, is still required.

In a future change, the ifnet lock should be converted from a mutex to an
rwlock in order to reduce contention.

Reviewed and tested by:	brooks
2008-06-26 23:05:28 +00:00
brooks
926e3bf55c The if_check() function performed three actions:
- verified that the ifp->if_snd.ifq_mtx was initalized for
   all attached interfaces.  This was pointless because it was
   initalized for all interfaces in if_attach() so I've removed it.
 - Checked that ifp->if_snd.ifq_maxlen is initalized and set it to
   ifqmaxlen if unset.  This makes more sense in if_attach() so
   I moved it there.
 - The first call of if_slowtimo().  Delete if_check() and call
   if_slowtimo() directly from the SYSINIT().
2008-05-17 03:38:13 +00:00
julian
1dfc5c98a4 Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:

-----

  One thing where FreeBSD has been falling behind, and which by chance I
  have some time to work on is "policy based routing", which allows
  different
  packet streams to be routed by more than just the destination address.

  Constraints:
  ------------

  I want to make some form of this available in the 6.x tree
  (and by extension 7.x) , but FreeBSD in general needs it so I might as
  well do it in -current and back port the portions I need.

  One of the ways that this can be done is to have the ability to
  instantiate multiple kernel routing tables (which I will now
  refer to as "Forwarding Information Bases" or "FIBs" for political
  correctness reasons). Which FIB a particular packet uses to make
  the next hop decision can be decided by a number of mechanisms.
  The policies these mechanisms implement are the "Policies" referred
  to in "Policy based routing".

  One of the constraints I have if I try to back port this work to
  6.x is that it must be implemented as a EXTENSION to the existing
  ABIs in 6.x so that third party applications do not need to be
  recompiled in timespan of the branch.

  This first version will not have some of the bells and whistles that
  will come with later versions. It will, for example, be limited to 16
  tables in the first commit.
  Implementation method, Compatible version. (part 1)
  -------------------------------
  For this reason I have implemented a "sufficient subset" of a
  multiple routing table solution in Perforce, and back-ported it
  to 6.x. (also in Perforce though not  always caught up with what I
  have done in -current/P4). The subset allows a number of FIBs
  to be defined at compile time (8 is sufficient for my purposes in 6.x)
  and implements the changes needed to allow IPV4 to use them. I have not
  done the changes for ipv6 simply because I do not need it, and I do not
  have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

  Other protocol families are left untouched and should there be
  users with proprietary protocol families, they should continue to work
  and be oblivious to the existence of the extra FIBs.

  To understand how this is done, one must know that the current FIB
  code starts everything off with a single dimensional array of
  pointers to FIB head structures (One per protocol family), each of
  which in turn points to the trie of routes available to that family.

  The basic change in the ABI compatible version of the change is to
  extent that array to be a 2 dimensional array, so that
  instead of protocol family X looking at rt_tables[X] for the
  table it needs, it looks at rt_tables[Y][X] when for all
  protocol families except ipv4 Y is always 0.
  Code that is unaware of the change always just sees the first row
  of the table, which of course looks just like the one dimensional
  array that existed before.

  The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
  are all maintained, but refer only to the first row of the array,
  so that existing callers in proprietary protocols can continue to
  do the "right thing".
  Some new entry points are added, for the exclusive use of ipv4 code
  called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
  which have an extra argument which refers the code to the correct row.

  In addition, there are some new entry points (currently called
  rtalloc_fib() and friends) that check the Address family being
  looked up and call either rtalloc() (and friends) if the protocol
  is not IPv4 forcing the action to row 0 or to the appropriate row
  if it IS IPv4 (and that info is available). These are for calling
  from code that is not specific to any particular protocol. The way
  these are implemented would change in the non ABI preserving code
  to be added later.

  One feature of the first version of the code is that for ipv4,
  the interface routes show up automatically on all the FIBs, so
  that no matter what FIB you select you always have the basic
  direct attached hosts available to you. (rtinit() does this
  automatically).

  You CAN delete an interface route from one FIB should you want
  to but by default it's there. ARP information is also available
  in each FIB. It's assumed that the same machine would have the
  same MAC address, regardless of which FIB you are using to get
  to it.

  This brings us as to how the correct FIB is selected for an outgoing
  IPV4 packet.

  Firstly, all packets have a FIB associated with them. if nothing
  has been done to change it, it will be FIB 0. The FIB is changed
  in the following ways.

  Packets fall into one of a number of classes.

  1/ locally generated packets, coming from a socket/PCB.
     Such packets select a FIB from a number associated with the
     socket/PCB. This in turn is inherited from the process,
     but can be changed by a socket option. The process in turn
     inherits it on fork. I have written a utility call setfib
     that acts a bit like nice..

         setfib -3 ping target.example.com # will use fib 3 for ping.

     It is an obvious extension to make it a property of a jail
     but I have not done so. It can be achieved by combining the setfib and
     jail commands.

  2/ packets received on an interface for forwarding.
     By default these packets would use table 0,
     (or possibly a number settable in a sysctl(not yet)).
     but prior to routing the firewall can inspect them (see below).
     (possibly in the future you may be able to associate a FIB
     with packets received on an interface..  An ifconfig arg, but not yet.)

  3/ packets inspected by a packet classifier, which can arbitrarily
     associate a fib with it on a packet by packet basis.
     A fib assigned to a packet by a packet classifier
     (such as ipfw) would over-ride a fib associated by
     a more default source. (such as cases 1 or 2).

  4/ a tcp listen socket associated with a fib will generate
     accept sockets that are associated with that same fib.

  5/ Packets generated in response to some other packet (e.g. reset
     or icmp packets). These should use the FIB associated with the
     packet being reponded to.

  6/ Packets generated during encapsulation.
     gif, tun and other tunnel interfaces will encapsulate using the FIB
     that was in effect withthe proces that set up the tunnel.
     thus setfib 1 ifconfig gif0 [tunnel instructions]
     will set the fib for the tunnel to use to be fib 1.

  Routing messages would be associated with their
  process, and thus select one FIB or another.
  messages from the kernel would be associated with the fib they
  refer to and would only be received by a routing socket associated
  with that fib. (not yet implemented)

  In addition Netstat has been edited to be able to cope with the
  fact that the array is now 2 dimensional. (It looks in system
  memory using libkvm (!)). Old versions of netstat see only the first FIB.

  In addition two sysctls are added to give:
  a) the number of FIBs compiled in (active)
  b) the default FIB of the calling process.

  Early testing experience:
  -------------------------

  Basically our (IronPort's) appliance does this functionality already
  using ipfw fwd but that method has some drawbacks.

  For example,
  It can't fully simulate a routing table because it can't influence the
  socket's choice of local address when a connect() is done.

  Testing during the generating of these changes has been
  remarkably smooth so far. Multiple tables have co-existed
  with no notable side effects, and packets have been routes
  accordingly.

  ipfw has grown 2 new keywords:

  setfib N ip from anay to any
  count ip from any to any fib N

  In pf there seems to be a requirement to be able to give symbolic names to the
  fibs but I do not have that capacity. I am not sure if it is required.

  SCTP has interestingly enough built in support for this, called VRFs
  in Cisco parlance. it will be interesting to see how that handles it
  when it suddenly actually does something.

  Where to next:
  --------------------

  After committing the ABI compatible version and MFCing it, I'd
  like to proceed in a forward direction in -current. this will
  result in some roto-tilling in the routing code.

  Firstly: the current code's idea of having a separate tree per
  protocol family, all of the same format, and pointed to by the
  1 dimensional array is a bit silly. Especially when one considers that
  there is code that makes assumptions about every protocol having the
  same internal structures there. Some protocols don't WANT that
  sort of structure. (for example the whole idea of a netmask is foreign
  to appletalk). This needs to be made opaque to the external code.

  My suggested first change is to add routing method pointers to the
  'domain' structure, along with information pointing the data.
  instead of having an array of pointers to uniform structures,
  there would be an array pointing to the 'domain' structures
  for each protocol address domain (protocol family),
  and the methods this reached would be called. The methods would have
  an argument that gives FIB number, but the protocol would be free
  to ignore it.

  When the ABI can be changed it raises the possibilty of the
  addition of a fib entry into the "struct route". Currently,
  the structure contains the sockaddr of the desination, and the resulting
  fib entry. To make this work fully, one could add a fib number
  so that given an address and a fib, one can find the third element, the
  fib entry.

  Interaction with the ARP layer/ LL layer would need to be
  revisited as well. Qing Li has been working on this already.

  This work was sponsored by Ironport Systems/Cisco

Reviewed by:    several including rwatson, bz and mlair (parts each)
Obtained from:  Ironport systems/Cisco
2008-05-09 23:03:00 +00:00
brooks
bf4b91bf6a Delay the global registration of the struct ifnet in if_alloc() until after
we're certain the allocation will entierly succeed.  This fixes a leak in a
fairly unlikely case.

Reported by:	vijay singh <vijjus at rocketmail dot com>
MFC after:	1 week
2008-04-19 22:04:51 +00:00
sam
d5c642ca44 expose if_purgemaddrs, it will be used by the vap code unless someone
redesigns the mcast support code in the next few weeks

MFC after:	3 weeks
2008-03-25 21:23:32 +00:00
rwatson
877d7c65ba In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation.  This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after:	1 month
Discussed with:	imp, rink
2008-03-16 10:58:09 +00:00
rwatson
763a53ea3e Move IFF_NEEDSGIANT warning from if_ethersubr.c to if.c so it is displayed
for all network interfaces, not just ethernet-like ones.

Upgrade it to a louder WARNING and be explicit that the flag is obsolete.
Support for IFF_NEEDSGIANT will be removed in a few months (see arch@ for
details) and will not appear in 8.0.

Upgrade if_watchdog to a WARNING.
2008-03-07 16:00:44 +00:00
rwatson
60570a92bf Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:

  mac_<object>_<method/action>
  mac_<object>_check_<method/action>

The previous naming scheme was inconsistent and mostly
reversed from the new scheme.  Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier.  Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods.  Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by:	SPARTA (original patches against Mac OS X)
Obtained from:	TrustedBSD Project, Apple Computer
2007-10-24 19:04:04 +00:00
rwatson
c29e74320b First in a series of changes to remove the now-unused Giant compatibility
framework for non-MPSAFE network protocols:

- Remove debug_mpsafenet variable, sysctl, and tunable.
- Remove NET_NEEDS_GIANT() and associate SYSINITSs used by it to force
  debug.mpsafenet=0 if non-MPSAFE protocols are compiled into the kernel.
- Remove logic to automatically flag interrupt handlers as non-MPSAFE if
  debug.mpsafenet is set for an INTR_TYPE_NET handler.
- Remove logic to automatically flag netisr handlers as non-MPSAFE if
  debug.mpsafenet is set.
- Remove references in a few subsystems, including NFS and Cronyx drivers,
  which keyed off debug_mpsafenet to determine various aspects of their own
  locking behavior.
- Convert NET_LOCK_GIANT(), NET_UNLOCK_GIANT(), and NET_ASSERT_GIANT into
  no-op's, as their entire behavior was determined by the value in
  debug_mpsafenet.
- Alias NET_CALLOUT_MPSAFE to CALLOUT_MPSAFE.

Many remaining references to NET_.*_GIANT() and NET_CALLOUT_MPSAFE are still
present in subsystems, and will be removed in followup commits.

Reviewed by:	bz, jhb
Approved by:	re (kensmith)
2007-07-27 11:59:57 +00:00
brooks
1f80599046 Update the comments on if_alloc(), if_free(), if_free_type(), and
if_attach.

Remove a comment about pre-3.0 network drivers from if_attach().

Be a bit more consistant about whitespace near comments.
2007-05-16 19:59:01 +00:00
thompsa
5fc175b7b4 Rename the trunk(4) driver to lagg(4) as it is too similar to vlan trunking.
The name trunk is misused as the networking term trunk means carrying multiple
VLANs over a single connection. The IEEE standard for link aggregation (802.3
section 3) does not talk about 'trunk' at all while it is used throughout IEEE
802.1Q in describing vlans.

The lagg(4) driver provides link aggregation, failover and fault tolerance.

Discussed on:	current@
2007-04-17 00:35:11 +00:00
thompsa
0f00c64853 Add the trunk(4) driver for providing link aggregation, failover and fault
tolerance.  This driver allows aggregation of multiple network interfaces as
one virtual interface using a number of different protocols/algorithms.

failover    - Sends traffic through the secondary port if the master becomes
              inactive.
fec         - Supports Cisco Fast EtherChannel.
lacp        - Supports the IEEE 802.3ad Link Aggregation Control Protocol
              (LACP) and the Marker Protocol.
loadbalance - Static loadbalancing using an outgoing hash.
roundrobin  - Distributes outgoing traffic using a round-robin scheduler
              through all active ports.

This code was obtained from OpenBSD and this also includes 802.3ad LACP support
from agr(4) in NetBSD.
2007-04-10 00:27:25 +00:00
bms
441b1fffcf Fix a case where hardware removal of an interface caused an attempt to
announce an ll_ifma which has gone away. Add a KASSERT to catch regressions.

Bug found by:	Tom Uffner
2007-03-27 16:11:28 +00:00
bms
a5925f917c Fix tinderbox; ng_ether needs to see if_findmulti(). 2007-03-20 03:15:43 +00:00
bms
4ffc004901 Implement reference counting for ifmultiaddr, in_multi, and in6_multi
structures. Detect when ifnet instances are detached from the network
stack and perform appropriate cleanup to prevent memory leaks.

This has been implemented in such a way as to be backwards ABI compatible.
Kernel consumers are changed to use if_delmulti_ifma(); in_delmulti()
is unable to detect interface removal by design, as it performs searches
on structures which are removed with the interface.

With this architectural change, the panics FreeBSD users have experienced
with carp and pfsync should be resolved.

Obtained from:	p4 branch bms_netdev
Reviewed by:	andre
Sponsored by:	Garance A Drosehn
Idea from:	NetBSD
MFC after:	1 month
2007-03-20 00:36:10 +00:00
bms
aaa1e7fb11 Fix a bug in if_findmulti(), whereby it would not find (and thus delete)
a link-layer multicast group membership.
Such memberships are needed in order to support protocols such as
IS-IS without putting the interface into PROMISC or ALLMULTI modes.

sa_equal() is not OK for comparing sockaddr_dl as it has deeper structure
than a simple byte array, so add sa_dl_equal() and use that instead.

Reviewed by:	rwatson
Verified with:	/usr/sbin/mtest
Bug found by:	Jouke Witteveen
MFC after:	2 weeks
2007-02-22 00:14:02 +00:00
glebius
068ffeee72 The recent issues with em(4) interface has shown that the old 4.4BSD
if_watchdog/if_timer interface doesn't fit modern SMP network
stack design.

Device drivers that need watchdog to monitor their hardware should
implement it theirselves.

Eventually the if_watchdog/if_timer API will be removed. For now,
warn that driver uses it.

Reviewed by:	scottl
2006-11-30 15:02:01 +00:00
rwatson
10d0d9cf47 Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges.  These may
require some future tweaking.

Sponsored by:           nCircle Network Security, Inc.
Obtained from:          TrustedBSD Project
Discussed on:           arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
                        Alex Lyashkov <umka at sevcity dot net>,
                        Skip Ford <skip dot ford at verizon dot net>,
                        Antoine Brodin <antoine dot brodin at laposte dot net>
2006-11-06 13:42:10 +00:00
rwatson
7beaaf5cd2 Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h.  sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from:	TrustedBSD Project
Sponsored by:	SPARTA
2006-10-22 11:52:19 +00:00
andre
f044a1949b Fix the socket option IP_ONESBCAST by giving it its own case in ip_output()
and skip over the normal IP processing.

Add a supporting function ifa_ifwithbroadaddr() to verify and validate the
supplied subnet broadcast address.

PR:		kern/99558
Tested by:	Andrey V. Elsukov <bu7cher-at-yandex.ru>
Sponsored by:	TCP/IP Optimization Fundraise 2005
MFC after:	3 days
2006-09-06 17:12:10 +00:00