Commit Graph

2648 Commits

Author SHA1 Message Date
Julian Elischer
0b4b0b0fee Virtualize the pfil hooks so that different jails may chose different
packet filters. ALso allows ipfw to be enabled on on ejail and disabled
on another. In 8.0 it's a global setting.

Sitting aroung in tree waiting to commit for: 2 months
MFC after:	2 months
2009-10-11 05:59:43 +00:00
Bjoern A. Zeeb
db44ff4047 Put #ifdef INET around parts of the FLOWTABLE code, to unbreak
nooptions INET kernel builds.

MFC after:	3 days
X-MFC:		with r197687
2009-10-03 10:56:03 +00:00
Qing Li
e5c610d659 The flow-table associates TCP/UDP flows and IP destinations with
specific routes. When the routing table changes, for example,
when a new route with a more specific prefix is inserted into the
routing table, the flow-table is not updated to reflect that change.
As such existing connections cannot take advantage of the new path.
In some cases the path is broken. This patch will update the affected
flow-table entries when a more specific route is added. The route
entry is properly marked when a route is deleted from the table.
In this case, when the flow-table performs a search, the stale
entry is updated automatically. Therefore this patch is not
necessary for route deletion.

Submitted by:	simon, phk
Reviewed by:	bz, kmacy
MFC after:	3 days
2009-10-01 20:32:29 +00:00
Qing Li
46e7f9838b A wrong variable is used when setting up the interface
address route, which broke source address selection in
some code paths.

Submitted by:	noted by bz
Reviewed by:	hrs
MFC after:	immediately
2009-09-20 17:22:19 +00:00
Marko Zec
38d61195b8 Style fix - break too long a line in two.
Spotted by:	bz
MFC after:	3 days
2009-09-18 09:03:23 +00:00
Marko Zec
989e04112b V_irtualize the lltables list, making ARP and ND reasonably
usable again with options VIMAGE kernels.

Submitted by:	bz (the original version, probably identical to this one)
Reviewed by:	many @ DevSummit Cambridge
MFC after:	3 days
2009-09-17 14:52:15 +00:00
Qing Li
9bb7d0f47a Self pointing routes are installed for configured interface addresses
and address aliases. After an interface is brought down and brought
back up again, those self pointing routes disappeared. This patch
ensures after an interface is brought back up, the loopback routes
are reinstalled properly.

Reviewed by:	bz
MFC after:	immediately
2009-09-15 19:18:34 +00:00
Robert Watson
e76d823b81 Use C99 initialization for struct filterops.
Obtained from:	Mac OS X
Sponsored by:	Apple Inc.
MFC after:	3 weeks
2009-09-12 20:03:45 +00:00
Ed Maste
1bdc73d337 Compare pointer with NULL, not 0. 2009-09-09 03:36:43 +00:00
Navdeep Parhar
9a31144537 Add arp_update_event. This replaces route_arp_update_event, which
has not worked since the arp-v2 rewrite.

The event handler will be called with the llentry write-locked and
can examine la_flags to determine whether the entry is being added
or removed.

Reviewed by:	gnn, kmacy
Approved by:	gnn (mentor)
MFC after:	1 month
2009-09-08 21:17:17 +00:00
Qing Li
d134008aa0 The addresses that are assigned to the loopback interface
should be part of the kernel routing table.

Reviewed by:	bz
MFC after:	immediately
2009-09-05 20:24:37 +00:00
Qing Li
9452b0d2de This patch fixes the following issues:
- Interface link-local address is not reachable within the
  node that owns the interface, this is due to the mismatch
  in address scope as the result of the installed interface
  address loopback route. Therefore for each interface
  address loopback route, the rt_gateway field (of AF_LINK
  type) will be used to track which interface a given
  address belongs to. This will aid the address source to
  use the proper interface for address scope/zone validation.
- The loopback address is not reachable. The root cause is
  the same as the above.
- Empty nd6 entries are created for the IPv6 loopback addresses
  only for validation reason. Doing so will eliminate as much
  of the special case (loopback addresses) handling code
  as possible, however, these empty nd6 entries should not
  be returned to the userland applications such as the
  "ndp" command.
Since both of the above issues contain common files, these
files are committed together.

Reviewed by:	bz
MFC after:	immediately
2009-09-05 16:43:16 +00:00
George V. Neville-Neil
54fc657d59 Add ARP statistics to the kernel and netstat.
New counters now exist for:
requests sent
replies sent
requests received
replies received
packets received
total packets dropped due to no ARP entry
entrys timed out
Duplicate IPs seen

The new statistics are seen in the netstat command
when it is given the -s command line switch.

MFC after:	2 weeks
In collaboration with: bz
2009-09-03 21:10:57 +00:00
Qing Li
5311e988ea As part of r196609, a call to "rtalloc" did not take the fib into
account. So call the appropriate "rtalloc_ign_fib()" instead of
calling "rtalloc_ign()".

Reviewed by:i	pointed out by bz
MFC after:	immediately
2009-08-31 00:14:37 +00:00
Marko Zec
a99fcfd4ca Introduce a separate sx lock for protecting lists of vnet sysinit
and sysuninit handlers.

Previously, sx_vnet, which is a lock designated for protecting
the vnet list, was (ab)used for protecting vnet sysinit / sysuninit
handler lists as well.  Holding exclusively the sx_vnet lock while
invoking sysinit and / or sysuninit handlers turned out to be
problematic, since some of the handlers may attempt to wake up
another thread and wait for it to walk over the vnet list, hence
acquire a shared lock on sx_vnet, which in turn leads to a deadlock.
Protecting vnet sysinit / sysuninit lists with a separate lock
mitigates this issue, which was first observed with
flowtable_flush() / flowtable_cleaner() in sys/net/flowtable.c.

Reviewed by:	rwatson, jhb
MFC after:	3 days
2009-08-28 22:30:55 +00:00
Qing Li
9231d35f4d In ip_output(), the flow-table module must not try to cache L2/L3
information for interface of IFF_POINTOPOINT or IFF_LOOPBACK type.
Since the L2 information (rt_lle) is invalid for these interface
types, accidental caching attempt will trigger panic when the invalid
rt_lle reference is accessed.

When installing a new route, or when updating an existing route, the
user supplied gateway address may be an interface address (this is
particularly true for point-to-point interface related modules such
as ppp, if_tun, if_gif). Currently the routing command handler always
set the RTF_GATEWAY flag if the gateway address is given as part of the
command paramters. Therefore the gateway address must be verified against
interface addresses or else the route would be treated as an indirect
route, thus making that route unusable.

Reviewed by:	kmacy, julia, rwatson
Verified by:	marcus
MFC after:	3 days
2009-08-28 07:01:09 +00:00
Robert Watson
ed2dabfc68 Add IFNET_HOLD reserved pointer value for the ifindex ifnet array,
which allows an index to be reserved for an ifnet without making
the ifnet available for management operations.  Use this in if_alloc()
while the ifnet lock is released between initial index allocation and
completion of ifnet initialization.

Add ifindex_free() to centralize the implementation of releasing an
ifindex value.  Use in if_free() and if_vmove(), as well as when
releasing a held index in if_alloc().

Reviewed by:	bz
MFC after:	3 days
2009-08-26 11:13:10 +00:00
Robert Watson
61f6986b07 Break out allocation of new ifindex values from if_alloc() and if_vmove(),
and centralize in a single function ifindex_alloc().  Assert the
IFNET_WLOCK, and add missing IFNET_WLOCK in if_alloc().  This does not
close all known races in this code.

Reviewed by:	bz
MFC after:	3 days
2009-08-25 20:21:16 +00:00
Robert Watson
dc56e98f0d Use locks specific to the lltable code, rather than borrow the ifnet
list/index locks, to protect link layer address tables.  This avoids
lock order issues during interface teardown, but maintains the bug that
sysctl copy routines may be called while a non-sleepable lock is held.

Reviewed by:	bz, kmacy
MFC after:	3 days
2009-08-25 09:52:38 +00:00
Jack F Vogel
3de029efaf When bridging LRO is causing a problem, the believe
that it would work as long as all interfaces have TSO
seems to be false, until the matter gets sorted out
just disable LRO completely.
2009-08-24 21:04:51 +00:00
Robert Watson
8e937462f4 Make if_grow static -- it's not used outside of if.c, and with the
internals destined to change, it's better if it remains that way.

MFC after:	3 days
2009-08-24 12:52:05 +00:00
Marko Zec
52db6805ea When moving ifnets from one vnet to another, and the ifnet
has ifaddresses of AF_LINK type which thus have an embedded
if_index "backpointer", we must update that if_index backpointer
to reflect the new if_index that our ifnet just got assigned.

This change affects only options VIMAGE builds.

Submitted by:	bz
Reviewed by:	bz
Approved by:	re (rwatson), julian (mentor)
2009-08-24 10:14:09 +00:00
Robert Watson
6852110b64 Rather than using IFNET_RLOCK() when iterating over (and modifying) the
ifnet list during if_ef load, directly acquire the ifnet_sxlock
exclusively.  That way when if_alloc() recurses the lock, it's a write
recursion rather than a read->write recursion.

This code structure is arguably a bug, so add a comment indicating that
this is the case.  Post-8.0, we should fix this, but this commit
resolves panic-on-load for if_ef.

Discussed with:	bz, julian
Reported by:	phk
MFC after:	3 days
2009-08-23 21:00:21 +00:00
Robert Watson
77dfcdc445 Rework global locks for interface list and index management, correcting
several critical bugs, including race conditions and lock order issues:

Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock.  Either can be held to stablize the lists and indexes, but both
are required to write.  This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions.  As before, writes to
the interface list must occur from sleepable contexts.

Reviewed by:	bz, julian
MFC after:	3 days
2009-08-23 20:40:19 +00:00
Julian Elischer
cd81cd3fd1 Don't allow access to the internals until it has all been set up.
Specifically, not until the per-vnet parts have been set up.

Submitted by:	kmacy@
Reviewed by:	julian@, zec@
Approved by:	re(rwatson)
MFC after:	immediately
2009-08-21 09:22:32 +00:00
Kip Macy
6d37c3ecd9 This change fixes a comment and addresses a complaint by kib@ by
moving a frequently executed flowtable syslog statement from being
conditional on bootverbose to conditional on a per-vnet flowtable
sysctl.

Approved by:	re@
2009-08-19 20:13:09 +00:00
Kip Macy
3ee42584f9 - change the interface to flowtable_lookup so that we don't rely on
the mbuf for obtaining the fib index
 - check that a cached flow corresponds to the same fib index as the
   packet for which we are doing the lookup
 - at interface detach time flush any flows referencing stale rtentrys
   associated with the interface that is going away (fixes reported
   panics)
 - reduce the time between cleans in case the cleaner is running at
   the time the eventhandler is called and the wakeup is missed less
   time will elapse before the eventhandler returns
 - separate per-vnet initialization from global initialization
   (pointed out by jeli@)

Reviewed by:	sam@
Approved by:	re@
2009-08-18 20:28:58 +00:00
Kip Macy
d53e359b9a fix netboot issue by disabling flowtable lookups until initialization has been run
Reviewed by:	rwatson@
Approved by:	re@
2009-08-17 19:09:28 +00:00
Robert Watson
d931ea0961 Remove unused if_rawoutput() macro; it has been unused since at least
FreeBSD 2.

Approved by:	re (kib)
2009-08-15 22:26:26 +00:00
Marko Zec
9abb486279 Appease VNET_DEBUG - in if_vmove we temporarily switch i.e.
recurse from one vnet to another which is OK, so no need
to flood the console with warnings here.

Approved by:	re (rwatson), julian (mentor)
2009-08-14 22:46:45 +00:00
Marko Zec
67addcde86 Make VNET_DEBUG a standalone compile-time option, i.e. decouple it from
INVARIANTS.

Reviewed by:	bz
Approved by:	re (rwatson), julian (mentor)
2009-08-14 22:41:39 +00:00
Bjoern A. Zeeb
eb79e1c76e Make it possible to change the vnet sysctl variables on jails
with their own virtual network stack. Jails only inheriting a
network stack cannot change anything that cannot be changed from
within a prison.

Reviewed by:	rwatson, zec
Approved by:	re (kib)
2009-08-13 10:26:34 +00:00
Bjoern A. Zeeb
20b0cdb749 Put multiple instructions into a block when iterating; unbreaks
NET_RT_DUMP, which otherwise only returned information of AF_MAX.
This was broken in r193232 (save your time - my bug, my fix).

PR:		kern/137700
Reported by:	Larry Baird (lab gta.com)
Tested by:	Larry Baird (lab gta.com)
Reviewed by:	zec, lstewart, qing
Approved by:	re (kib)
2009-08-13 09:29:52 +00:00
Jung-uk Kim
a36599cce7 Always embed pointer to BPF JIT function in BPF descriptor
to avoid inconsistency when opt_bpf.h is not included.

Reviewed by:	rwatson
Approved by:	re (rwatson)
2009-08-12 17:28:53 +00:00
Bjoern A. Zeeb
281c86a4ef Update DDB show vnet command to print all used and available information.
Reviewed by:	rwatson, zec
Approved by:	re
2009-08-12 12:00:21 +00:00
Bjoern A. Zeeb
1b501e53f3 Put minimum alignment on the dpcpu and vnet section so that ld
when adding the __start_ symbol knows the expected section alignment
and can place the __start_ symbol correctly.

These sections will not support symbols with super-cache line alignment
requirements.

For full details, see posting to freebsd-current, 2009-08-10,
Message-ID: <20090810133111.C93661@maildrop.int.zabbadoz.net>.

Debugging and testing patches by:
		Kamigishi Rei (spambox haruhiism.net),
		np, lstewart, jhb, kib, rwatson
Tested by:	Kamigishi Rei, lstewart
Reviewed by:	kib
Approved by:	re
2009-08-12 10:26:03 +00:00
Robert Watson
315e3e38fa Many network stack subsystems use a single global data structure to hold
all pertinent statatistics for the subsystem.  These structures are
sometimes "borrowed" by kernel modules that require a place to store
statistics for similar events.

Add KPI accessor functions for statistics structures referenced by kernel
modules so that they no longer encode certain specifics of how the data
structures are named and stored.  This change is intended to make it
easier to move to per-CPU network stats following 8.0-RELEASE.

The following modules are affected by this change:

      if_bridge
      if_cxgb
      if_gif
      ip_mroute
      ipdivert
      pf

In practice, most of these statistics consumers should, in fact, maintain
their own statistics data structures rather than borrowing structures
from the base network stack.  However, that change is too agressive for
this point in the release cycle.

Reviewed by:	bz
Approved by:	re (kib)
2009-08-02 19:43:32 +00:00
Robert Watson
6aad5c1c93 The colour was red as shall be the letters of this warning to people upon
boot if the experimental VIMAGE feature was compiled into the kernel.

Submitted by:	bz
Reviewed by:	zec
Approved by:	re (vimage blanket)
2009-08-01 22:22:45 +00:00
Robert Watson
c8f6a13820 Minor style tweaks.
Approved by:	re (vimage blanket)
2009-08-01 21:58:32 +00:00
Robert Watson
6bc2c7b70c Make the vnet alloc/destroy paths a bit easier to followg by merging
vnet_data_init/vnet_data_destroy into vnet_alloc/vnet_destroy.

Reviewed by:	bz, zec
Approved by:	re (vimage blanket)
2009-08-01 21:54:15 +00:00
Robert Watson
7429a3f3d8 Remove vnet_foreach() utility function, which previously allowed
vnet.c to iterate virtual network stacks without being aware of
the implementation details previously hidden in kern_vimage.c.
Now they are in the same file, so remove this added complexity.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 20:24:45 +00:00
Robert Watson
530c006014 Merge the remainder of kern_vimage.c and vimage.h into vnet.c and
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks.  Minor cleanups are done in the process,
and comments updated to reflect these changes.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-08-01 19:26:27 +00:00
Robert Watson
ed3db012fc Reorder and recomment vnet.c and vnet.h on the basis that they are no longer
solely about the virtual network stack memory allocator.

Approved by:	re (vimage blanket)
2009-07-30 12:41:19 +00:00
Robert Watson
a9bcca799e Revise header comments for vnet.h as we now implement VNET_SYSINIT, not
just VNET_DEFINE in vnet.h.

Approved by:	re (vimage blanket)
2009-07-28 22:17:34 +00:00
Qing Li
9fca4f79c7 The new flow table caches both the routing table entry as well as the
L2 information. For an indirect route the cached L2 entry contains the
MAC address of the gateway. Typically the default route is used to
transmit multicast packets when explicit multicast routes are not
available. The ether_output() function bypasses L2 resolution function
if it verifies the L2 cache is valid, because the cached L2 address
(a unicast MAC address) is copied into the packets as the destination
MAC address. This validation, however, does not apply to broadcast and
multicast packets because the destination MAC address is mapped
according to a standard method instead.

Submitted by:	Xin Li
Reviewed by:	bz
Approved by:	re
2009-07-28 17:16:54 +00:00
Qing Li
df813b7ea2 This patch does the following:
- Allow loopback route to be installed for address assigned to
      interface of IFF_POINTOPOINT type.
    - Install loopback route for an IPv4 interface addreess when the
      "useloopback" sysctl variable is enabled. Similarly, install
      loopback route for an IPv6 interface address when the sysctl variable
      "nd6_useloopback" is enabled. Deleting loopback routes for interface
      addresses is unconditional in case these sysctl variables were
      disabled after an interface address has been assigned.

Reviewed by:	bz
Approved by:	re
2009-07-27 17:08:06 +00:00
Bjoern A. Zeeb
d0ea47437a Update epair(4) to the new netisr implementation and polish
things a bit:
- use dpcpu data to track the ifps with packets queued up,
- per-cpu locking and driver flags
- along with .nh_drainedcpu and NETISR_POLICY_CPU.
- Put the mbufs in flight reference count, preventing interfaces
  from going away, under INVARIANTS as this is a general problem
  of the stack and should be solved in if.c/netisr but still good
  to verify the internal queuing logic.
- Permit changing the MTU to virtually everythinkg like we do for loopback.

Hook epair(4) up to the build.

Approved by:	re (kib)
2009-07-26 12:20:07 +00:00
Bjoern A. Zeeb
be31e5e7b5 Make the in-kernel logic for the SIOCSIFVNET, SIOCSIFRVNET ioctls
(ifconfig ifN (-)vnet <jname|jid>) work correctly.

Move vi_if_move to if.c and split it up into two functions(*),
one for each ioctl.

In the reclaim case, correctly set the vnet before calling if_vmove.

Instead of silently allowing a move of an interface from the current
vnet to the current vnet, return an error. (*)

There is some duplicate interface name checking before actually moving
the interface between network stacks without locking and thus race
prone. Ideally if_vmove will correctly and automagically handle these
in the future.

Suggested by:	rwatson (*)
Approved by:	re (kib)
2009-07-26 11:29:26 +00:00
Robert Watson
d0728d7174 Introduce and use a sysinit-based initialization scheme for virtual
network stacks, VNET_SYSINIT:

- Add VNET_SYSINIT and VNET_SYSUNINIT macros to declare events that will
  occur each time a network stack is instantiated and destroyed.  In the
  !VIMAGE case, these are simply mapped into regular SYSINIT/SYSUNINIT.
  For the VIMAGE case, we instead use SYSINIT's to track their order and
  properties on registration, using them for each vnet when created/
  destroyed, or immediately on module load for already-started vnets.
- Remove vnet_modinfo mechanism that existed to serve this purpose
  previously, as well as its dependency scheme: we now just use the
  SYSINIT ordering scheme.
- Implement VNET_DOMAIN_SET() to allow protocol domains to declare that
  they want init functions to be called for each virtual network stack
  rather than just once at boot, compiling down to DOMAIN_SET() in the
  non-VIMAGE case.
- Walk all virtualized kernel subsystems and make use of these instead
  of modinfo or DOMAIN_SET() for init/uninit events.  In some cases,
  convert modular components from using modevent to using sysinit (where
  appropriate).  In some cases, do minor rejuggling of SYSINIT ordering
  to make room for or better manage events.

Portions submitted by:	jhb (VNET_SYSINIT), bz (cleanup)
Discussed with:		jhb, bz, julian, zec
Reviewed by:		bz
Approved by:		re (VIMAGE blanket)
2009-07-23 20:46:49 +00:00
Bjoern A. Zeeb
a08362ce46 sysctl_msec_to_ticks is used with both virtualized and
non-vrtiualized sysctls so we cannot used one common function.

Add a macro to convert the arg1 in the virtualized case to
vnet.h to not expose the maths to all over the code.

Add a wrapper for the single virtualized call, properly handling
arg1 and call the default implementation from there.

Convert the two over places to use the new macro.

Reviewed by:	rwatson
Approved by:	re (kib)
2009-07-21 21:58:55 +00:00
Robert Watson
0a4747d4d0 Garbage collect vnet module registrations that have neither constructors
nor destructors, as there's no actual work to do.

In most cases, the constructors weren't needed because of the existing
protocol initialization functions run by net_init_domain() as part of
VNET_MOD_NET, or they were eliminated when support for static
initialization of virtualized globals was added.

Garbage collect dependency references to modules without constructors or
destructors, notably VNET_MOD_INET and VNET_MOD_INET6.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-07-20 13:55:33 +00:00
Robert Watson
17ef1feb8a Add macros VNET_SETNAME and VNET_SYMPREFIX, and expose to userspace if
_WANT_VNET is defined.  This way we don't need separate definitions in
libkvm.

Reviewed by:	bz
Approved by:	re (vimage blanket)
2009-07-20 07:50:50 +00:00
Robert Watson
006e9db452 Normalize field naming for struct vnet, fix two debugging printfs that
print them.

Reviewed by:	bz
Approved by:	re (kensmith, kib)
2009-07-19 17:40:45 +00:00
Robert Watson
5ee847d3ac Reimplement and/or implement vnet list locking by replacing a mostly
unused custom mutex/condvar-based sleep locks with two locks: an
rwlock (for non-sleeping use) and sxlock (for sleeping use).  Either
acquired for read is sufficient to stabilize the vnet list, but both
must be acquired for write to modify the list.

Replace previous no-op read locking macros, used in various places
in the stack, with actual locking to prevent race conditions.  Callers
must declare when they may perform unbounded sleeps or not when
selecting how to lock.

Refactor vnet sysinits so that the vnet list and locks are initialized
before kernel modules are linked, as the kernel linker will use them
for modules loaded by the boot loader.

Update various consumers of these KPIs based on whether they may sleep
or not.

Reviewed by:	bz
Approved by:	re (kib)
2009-07-19 14:20:53 +00:00
Jamie Gritton
7afcbc18b3 Remove the interim vimage containers, struct vimage and struct procg,
and the ioctl-based interface that supported them.

Approved by:	re (kib), bz (mentor)
2009-07-17 14:48:21 +00:00
Robert Watson
1e77c1056a Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used.  Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with:	bz, julian
Reviewed by:	bz
Approved by:	re (kensmith, kib)
2009-07-16 21:13:04 +00:00
Robert Watson
c1e200ffcc Add missing license line for vnet.h, correct white space nit.
Approved by:	re (kensmith) (implicit)
2009-07-15 00:56:15 +00:00
Robert Watson
eddfbb763d Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator.  Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...).  This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack.  Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory.  Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy.  Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address.  When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by:  bz
Reviewed by:            bz, zec
Discussed with:         gnn, jamie, jeff, jhb, julian, sam
Suggested by:           peter
Approved by:            re (kensmith)
2009-07-14 22:48:30 +00:00
Kip Macy
6a7bff2c31 Re-factoring for adding weighted routes introduced a
fairly irritating bug where the system will panic
when RADIX_MPATH is enabled. This change fixes this.

Approved by:	re@
2009-07-11 21:56:23 +00:00
Rui Paulo
59aa14a91d Implementation of the upcoming Wireless Mesh standard, 802.11s, on the
net80211 wireless stack. This work is based on the March 2009 D3.0 draft
standard. This standard is expected to become final next year.
This includes two main net80211 modules, ieee80211_mesh.c
which deals with peer link management, link metric calculation,
routing table control and mesh configuration and ieee80211_hwmp.c
which deals with the actually routing process on the mesh network.
HWMP is the mandatory routing protocol on by the mesh standard, but
others, such as RA-OLSR, can be implemented.

Authentication and encryption are not implemented.

There are several scripts under tools/tools/net80211/scripts that can be
used to test different mesh network topologies and they also teach you
how to setup a mesh vap (for the impatient: ifconfig wlan0 create
wlandev ... wlanmode mesh).

A new build option is available: IEEE80211_SUPPORT_MESH and it's enabled
by default on GENERIC kernels for i386, amd64, sparc64 and pc98.

Drivers that support mesh networks right now are: ath, ral and mwl.

More information at: http://wiki.freebsd.org/WifiMesh

Please note that this work is experimental. Also, please note that
bridging a mesh vap with another network interface is not yet supported.

Many thanks to the FreeBSD Foundation for sponsoring this project and to
Sam Leffler for his support.
Also, I would like to thank Gateworks Corporation for sending me a
Cambria board which was used during the development of this project.

Reviewed by:	sam
Approved by:	re (kensmith)
Obtained from:	projects/mesh11s
2009-07-11 15:02:45 +00:00
Bjoern A. Zeeb
ba3b25b35a In case we cannot queue a packet reaching the queue limit, retain the
semantics netisr_queue() always had and free the mbuf along with
returning the error.

Reviewed by:	rwatson
Approved by:	re (kensmith)
2009-06-30 05:21:00 +00:00
Brooks Davis
6cb7f168db Remove support for the /dev/net/* per-interface devices. They serve
little purpose and are unused in the base system.

The IOCTL functionality is entirely duplicated and routing sockets
provide a richer interface than the kqueue functionality.

Further, it is not practical for these devices to be made sensible in
the face of VIMAGE.

Bump __FreeBSD_version on the off chance that there is any code out
there that actually uses this stuff.

Reviewed by:	rwatson
Discussed with:	bz, zec
Approved by:	re@ (kensmith)
2009-06-29 19:46:29 +00:00
Robert Watson
395cbe82d2 Remove unnecessary include of kdb.h that snuck in during ifaddr refcount
work.

Reported by:	pluknet <pluknet at gmail.com>
Approved by:	re (kib)
2009-06-27 10:30:28 +00:00
Robert Watson
9e6e01ebf6 In light of DPCPU use by netisr, revise various for loops from using
MAXCPU to mp_maxid, and handling and reporting of requests to use more
threads than we have CPUs to run them on.

Reviewed by:	bz
Approved by:	re (kib)
MFC after:	6 weeks
2009-06-26 20:39:36 +00:00
Robert Watson
ba16a0fab1 Use if_addr_rlock/if_addr_runlock for if_spp when iterating if_addrhead,
as it is loadable as a module.

Approved by:	re (kib)
MFC after:	6 weeks
2009-06-26 18:50:49 +00:00
Robert Watson
3893212ddc Update if_stf and if_tun to use if_addr_rlock()/if_addr_runlock() rather
than IF_ADDR_LOCK()/IF_ADDR_UNLOCK() when iterating ifp->if_addrhead.

MFC after:	6 weeks
2009-06-26 00:45:20 +00:00
Robert Watson
f9ef96ca71 Define four wrapper functions for interface address locking,
if_addr_rlock() and if_addr_runlock() for regular address lists, and
if_maddr_rlock() and if_maddr_runlock() for multicast address lists.

We will use these in various kernel modules to avoid encoding specific
type and locking strategy information into modules that currently use
IF_ADDR_LOCK() and IF_ADDR_UNLOCK() directly.

MFC after:	6 weeks
2009-06-26 00:36:47 +00:00
Robert Watson
534027673b Convert netisr to use dynamic per-CPU storage (DPCPU) instead of sizing
arrays to [MAXCPU], offering moderate memory savings.  In some places,
this requires using CPU_ABSENT() to handle less common platforms with
sparse CPU IDs.  In several places, assert that the selected CPUID for
work placement or statistics is not CPU_ABSENT() to be on the safe side.

Discussed with:	bz, jeff
2009-06-26 00:19:25 +00:00
Konstantin Belousov
9f80ce043d Change the type of uio_resid member of struct uio from int to ssize_t.
Note that this does not actually enable full-range i/o requests for
64 architectures, and is done now to update KBI only.

Tested by:	pho
Reviewed by:	jhb, bde (as part of the review of the bigger patch)
2009-06-25 18:46:30 +00:00
Robert Watson
2d9cfabad4 Add a new global rwlock, in_ifaddr_lock, which will synchronize use of the
in_ifaddrhead and INADDR_HASH address lists.

Previously, these lists were used unsynchronized as they were effectively
never changed in steady state, but we've seen increasing reports of
writer-writer races on very busy VPN servers as core count has gone up
(and similar configurations where address lists change frequently and
concurrently).

For the time being, use rwlocks rather than rmlocks in order to take
advantage of their better lock debugging support.  As a result, we don't
enable ip_input()'s read-locking of INADDR_HASH until an rmlock conversion
is complete and a performance analysis has been done.  This means that one
class of reader-writer races still exists.

MFC after:      6 weeks
Reviewed by:    bz
2009-06-25 11:52:33 +00:00
Bjoern A. Zeeb
98c230c87e Merge from p4: CH154790,154793,154874
Import if_epair(4), a virtual cross-over Ethernet-like interface pair.

Note these files are 1:1 from p4 and not yet connected to the build
not knowing about the new netisr interface.

Sponsored by:	The FreeBSD Foundation
2009-06-24 22:21:30 +00:00
Navdeep Parhar
456ae55008 Add 10Gbase-T to known ethernet media types.
Approved by:	gnn (mentor)
MFC after:	1 week.
2009-06-24 21:53:25 +00:00
Navdeep Parhar
52d9cb1252 About to add 10Gbase-T to known media types, this is just a whitespace
cleanup before that commit.  No functional impact.

Approved by:	gnn (mentor)
2009-06-24 21:51:42 +00:00
Robert Watson
3baaf2974d In if_setlladdr(), use IF_ADDR_LOCK() and ifaddr references to improve
the safety of link layer address manipulation.

MFC after:	6 weeks
2009-06-24 10:36:48 +00:00
Robert Watson
6c7ffe9340 Break at_ifawithnet() into two variants:
- at_ifawithnet(), which acquires an locks it needs and returns an
  at_ifaddr reference.
- at_ifawithnet_locked(), which relies on the caller locking
  at_ifaddr_list, and returns a pointer rather than a reference.

Update various consumers to prefer one or the other, including ether
and fddi output, to properly release at_ifaddr references.

Rework at_control() to manage locking and references in a manner
identical to in_control().

MFC after:	6 weeks
2009-06-24 10:32:44 +00:00
Robert Watson
5c66449004 Lock if_addrhead when iterating, and where necessary acquire and release
ifadr references in if_sppp.

MFC after:	6 weeks
2009-06-24 08:53:23 +00:00
Robert Watson
fe0ecfd64d Make stf_getsrcifa6() return a reference to an in6_ifaddr rather than
a pointer, and dispose of the references when no longer needed.

MFC after:	6 weeks
2009-06-24 08:52:09 +00:00
Robert Watson
8c0fec805f Modify most routines returning 'struct ifaddr *' to return references
rather than pointers, requiring callers to properly dispose of those
references.  The following routines now return references:

  ifaddr_byindex
  ifa_ifwithaddr
  ifa_ifwithbroadaddr
  ifa_ifwithdstaddr
  ifa_ifwithnet
  ifaof_ifpforaddr
  ifa_ifwithroute
  ifa_ifwithroute_fib
  rt_getifa
  rt_getifa_fib
  IFP_TO_IA
  ip_rtaddr
  in6_ifawithifp
  in6ifa_ifpforlinklocal
  in6ifa_ifpwithaddr
  in6_ifadd
  carp_iamatch6
  ip6_getdstifaddr

Remove unused macro which didn't have required referencing:

  IFP_TO_IA6

This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc).  In a few cases, add missing if_addr_list locking
required to safely acquire references.

Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit.  Once
we have mbuf tag deep copy support, this can be fixed.

Reviewed by:	bz
Obtained from:	Apple, Inc. (portions)
MFC after:	6 weeks (portions)
2009-06-23 20:19:09 +00:00
Bjoern A. Zeeb
5736e6fb9d After cleaning up rt_tables from vnet.h and cleaning up opt_route.h
a lot of files no longer need route.h either. Garbage collect them.
While here remove now unneeded vnet.h #includes as well.
2009-06-23 17:03:45 +00:00
Bjoern A. Zeeb
a877d0cffa Remove duplicate #include <net/route.h> from the middle of the file. 2009-06-23 13:16:16 +00:00
Marko Zec
fa057b15bd V_irtualize flowtable state.
This change should make options VIMAGE kernel builds usable again,
to some extent at least.

Note that the size of struct vnet_inet has changed, though in
accordance with one-bump-per-day policy we didn't update the
__FreeBSD_version number, given that it has already been touched
by r194640 a few hours ago.
Reviewed by:	bz
Approved by:	julian (mentor)
2009-06-22 21:19:24 +00:00
Bjoern A. Zeeb
3952a5abc9 Updates after r194640:
- shrink size guards for vnet_net.
  vnet_rtable does not need size guards as it is self-contained.
- remove a bunch of defines from vnet.h no longer valid.
2009-06-22 17:56:07 +00:00
Bjoern A. Zeeb
b58ea5f310 Move virtualization of routing related variables into their own
Vimage module, which had been there already but now is stateful.

All variables are now file local; so this further limits the global
spreading of routing related things throughout the kernel.

Add a missing function local variable in case of MPATHing.

Reviewed by:	zec
2009-06-22 17:48:16 +00:00
Bjoern A. Zeeb
f987f19301 Collect all VIMAGE_GLOBALS variables in one place.
No longer export rt_tables as all lookups go through
rt_tables_get_rnh().

We cannot make rt_tables (and rtstat, rttrash[1]) static as
netstat -r (-rs[1]) would stop working on a stripped
VIMAGE_GLOBALS kernel.

Reviewed by:		zec
Presumably broken by:	phk 13.5y ago in r12820 [1]
2009-06-22 15:07:12 +00:00
Robert Watson
8896f83a58 Add a new function, ifa_ifwithaddr_check(), which rather than returning
a pointer to an ifaddr matching the passed socket address, returns a
boolean indicating whether one was present.  In the (near) future,
ifa_ifwithaddr() will return a referenced ifaddr rather than a raw
ifaddr pointer, and the new wrapper will allow callers that care only
about the boolean condition to avoid having to free that reference.

MFC after:	3 weeks
2009-06-22 10:59:34 +00:00
Bjoern A. Zeeb
bed56bb51b After the update to fxp(4) in r194573 we should no longer need
this DELAY(100) hack introduced in r56938.

Thanks to:	yongari
MFC after:	6 weeks
X-MFC note:	not before the fxp(4) changes
2009-06-22 10:27:20 +00:00
Robert Watson
1099f828b3 Clean up common ifaddr management:
- Unify reference count and lock initialization in a single function,
  ifa_init().
- Move tear-down from a macro (IFAFREE) to a function ifa_free().
- Move reference count bump from a macro (IFAREF) to a function ifa_ref().
- Instead of using a u_int protected by a mutex to refcount(9) for
  reference count management.

The ifa_mtx is now used for exactly one ioctl, and possibly should be
removed.

MFC after:	3 weeks
2009-06-21 19:30:33 +00:00
Roman Divacky
e40bae9a45 Switch cmd argument to u_long. This matches what if_ethersubr.c does and
allows the code to compile cleanly on amd64 with clang.

Reviewed by:	rwatson
Approved by:	ed (mentor)
2009-06-21 10:29:31 +00:00
Roman Divacky
2b7d10c225 In non-debugging mode make this define (void)0 instead of nothing. This
helps to catch bugs like the below with clang.

	if (cond);		<--- note the trailing ;
	   something();

Approved by:	ed (mentor)
Discussed on:	current@
2009-06-21 08:49:06 +00:00
Kip Macy
d49cd9a18e add helper function for flushing software queues 2009-06-19 23:11:20 +00:00
Christian S.J. Peron
0e37f3e196 Implement the -z (zero counters) option for the various bpf counters.
Add necessary changes to the kernel for this (basically introduce a
bpf_zero_counters() function).  As well, update the man page.

MFC after:	1 month
Discussed with:	rwatson
2009-06-19 20:31:44 +00:00
Bjoern A. Zeeb
ebd8672cc3 Add explicit includes for jail.h to the files that need them and
remove the "hidden" one from vimage.h.
2009-06-17 15:01:01 +00:00
Bjoern A. Zeeb
7654a365db Add the explicit include of vimage.h to another five .c files still
missing it.

Remove the "hidden" kernel only include of vimage.h from ip_var.h added
with the very first Vimage commit r181803 to avoid further kernel poisoning.
2009-06-17 12:44:11 +00:00
Sam Leffler
d659538f72 r193336 moved ifq_detach to if_free which broke if_alloc followed
by if_free (w/o doing if_attach); move ifq_attach to if_alloc and
rename ifq_attach/detach to ifq_init/ifq_delete to better identify
their purpose

Reviewed by:	jhb, kmacy
2009-06-15 19:50:03 +00:00
Jamie Gritton
9ed47d01eb Get vnets from creds instead of threads where they're available, and from
passed threads instead of curthread.

Reviewed by:	zec, julian
Approved by:	bz (mentor)
2009-06-15 19:01:53 +00:00
Jamie Gritton
679e13901c Manage vnets via the jail system. If a jail is given the boolean
parameter "vnet" when it is created, a new vnet instance will be created
along with the jail.  Networks interfaces can be moved between prisons
with an ioctl similar to the one that moves them between vimages.
For now vnets will co-exist under both jails and vimages, but soon
struct vimage will be going away.

Reviewed by:	zec, julian
Approved by:	bz (mentor)
2009-06-15 18:59:29 +00:00
Bjoern A. Zeeb
ed655c8c07 Add an optional callback function that will be invoked when a per-CPU
queue was drained.  It will never fire for a directly dispatched packet.

You will most likely never want to use this for any ordinary netisr usage
and you will never blame netisr in case you try to use it and it does
not work as expected.

Reviewed by:	rwatson
2009-06-14 17:15:18 +00:00
Bjoern A. Zeeb
736801ace4 Garbage collect an extern for a non-existent variable.
While here let the comment end in a '.' and mark the #endif of _KERNEL.

Reviewed by:	rwatson (as part of a larger patch)
2009-06-12 20:50:28 +00:00
Bjoern A. Zeeb
53be8fca00 Move the kernel option FLOWTABLE chacking from the header file to the
actual implementation.
Remove the accessor functions for the compiled out case, just returning
"unavail" values. Remove the kernel conditional from the header file as
it is no longer needed, only leaving the externs.
Hide the improperly virtualized SYSCTL/TUNABLE for the flowtable size
under the kernel option as well.

Reviewed by:	rwatson
2009-06-12 20:46:36 +00:00
VANHULLEBUS Yvan
7b495c4494 Added support for NAT-Traversal (RFC 3948) in IPsec stack.
Thanks to (no special order) Emmanuel Dreyfus (manu@netbsd.org), Larry
Baird (lab@gta.com), gnn, bz, and other FreeBSD devs, Julien Vanherzeele
(julien.vanherzeele@netasq.com, for years of bug reporting), the PFSense
team, and all people who used / tried the NAT-T patch for years and
reported bugs, patches, etc...

X-MFC: never

Reviewed by:	bz
Approved by:	gnn(mentor)
Obtained from:	NETASQ
2009-06-12 15:44:35 +00:00
Bjoern A. Zeeb
259d2d5431 carp(4) allows people to share a set of IP addresses and can only
use IPv4/v6 for inter-node communication (according to my reading).

Properly wrap the carp callouts in INET || INET6 and refelect this
in sys/conf/files as well.  While in theory this should be ok,
it might be a bit optimistic to think that carp could build with
inet6 only[1].

Discussed with:		mlaier [1]
2009-06-11 10:26:38 +00:00
Konstantin Belousov
d8b0556c6d Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use
vnode interlock to protect the knote fields [1]. The locking assumes
that shared vnode lock is held, thus we get exclusive access to knote
either by exclusive vnode lock protection, or by shared vnode lock +
vnode interlock.

Do not use kl_locked() method to assert either lock ownership or the
fact that curthread does not own the lock. For shared locks, ownership
is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared
lock not owned by curthread, causing false positives in kqueue subsystem
assertions about knlist lock.

Remove kl_locked method from knlist lock vector, and add two separate
assertion methods kl_assert_locked and kl_assert_unlocked, that are
supposed to use proper asserts. Change knlist_init accordingly.

Add convenience function knlist_init_mtx to reduce number of arguments
for typical knlist initialization.

Submitted by:	jhb [1]
Noted by:	jhb [2]
Reviewed by:	jhb
Tested by:	rnoland
2009-06-10 20:59:32 +00:00
Bjoern A. Zeeb
c03528b663 SCTP needs either IPv4 or IPv6 as lower layer[1].
So properly hide the already #ifdef SCTP code with
#if defined(INET) || defined(INET6) as well to get us
closer to a non-INET/INET6 kernel.

Discussed with:	tuexen [1]
2009-06-10 14:36:59 +00:00
Bjoern A. Zeeb
974524bf59 ip_gif_ttl/GIF_TTL are only used by the inet part in in_gif.c,
so put the initialization under #ifdef INET.
2009-06-10 13:39:51 +00:00
Bjoern A. Zeeb
74900cdf7f The llentry *lle is only used in cases of INET or INET6.
Put the variable declaration under proper #ifdefs.

In case variables are only needed for one of the two AFs
more them into proper scope.
2009-06-10 09:07:05 +00:00
Kip Macy
3576e2f4a2 revert to opt-in flowtable 2009-06-09 21:55:28 +00:00
Oleg Bulyzhin
dda10d624c Close long existed race with net.inet.ip.fw.one_pass = 0:
If packet leaves ipfw to other kernel subsystem (dummynet, netgraph, etc)
it carries pointer to matching ipfw rule. If this packet then reinjected back
to ipfw, ruleset processing starts from that rule. If rule was deleted
meanwhile, due to existed race condition panic was possible (as well as
other odd effects like parsing rules in 'reap list').

P.S. this commit changes ABI so userland ipfw related binaries should be
recompiled.

MFC after:	1 month
Tested by:	Mikolaj Golub
2009-06-09 21:27:11 +00:00
Kip Macy
15d13a59a3 make flowtable opt-out 2009-06-09 20:27:30 +00:00
Kip Macy
ee117d034c move jenkins hash to its own header in libkern 2009-06-09 20:21:40 +00:00
Kip Macy
a913be0917 - add drbr routines for accessing #qentries and conditionally dequeueing
- track bytes enqueued in buf_ring
2009-06-09 19:19:16 +00:00
Bjoern A. Zeeb
c2f16e371e Remove one INET dependency by calling the general
AF agnostic version for doing the routing lookup.

Reviewed by:	kmacy
2009-06-09 09:50:43 +00:00
Hiroki Sato
4cd5f57d6b Style fix.
Submitted by:	bz
2009-06-09 08:09:30 +00:00
Hiroki Sato
fb70e72b0c - Fix sanity check of GIFSOPTS ioctl.
- Rename option mask s/GIF_FULLOPTS/GIF_OPTMASK/

Spotted by:	Eygene Ryabinkin, delphij
2009-06-09 02:27:59 +00:00
Bjoern A. Zeeb
5c40c5e989 Remove two unneeded, hidden includes. 2009-06-08 20:04:46 +00:00
Bjoern A. Zeeb
8d8bc0182e After r193232 rt_tables in vnet.h are no longer indirectly dependent on
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.

Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.
2009-06-08 19:57:35 +00:00
Marko Zec
bc29160df3 Introduce an infrastructure for dismantling vnet instances.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state.  The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.

While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers.  Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.

Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels.  Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.

Bump __FreeBSD_version to 800097.
Reviewed by:	bz, julian
Approved by:	rwatson, kib (re), julian (mentor)
2009-06-08 17:15:40 +00:00
Hiroki Sato
dbe5926046 Fix and add a workaround on an issue of EtherIP packet with reversed
version field sent via gif(4)+if_bridge(4).  The EtherIP
implementation found on FreeBSD 6.1, 6.2, 6.3, 7.0, 7.1, and 7.2 had
an interoperability issue because it sent the incorrect EtherIP
packets and discarded the correct ones.

This change introduces the following two flags to gif(4):

 accept_rev_ethip_ver: accepts both correct EtherIP packets and ones
    with reversed version field, if enabled.  If disabled, the gif
    accepts the correct packets only.  This flag is enabled by
    default.

 send_rev_ethip_ver: sends EtherIP packets with reversed version field
    intentionally, if enabled.  If disabled, the gif sends the correct
    packets only.  This flag is disabled by default.

These flags are stored in struct gif_softc and can be set by
ifconfig(8) on per-interface basis.

Note that this is an incompatible change of EtherIP with the older
FreeBSD releases.  If you need to interoperate older FreeBSD boxes and
new versions after this commit, setting "send_rev_ethip_ver" is
needed.

Reviewed by:	thompsa and rwatson
Spotted by:	Shunsuke SHINOMIYA
PR:		kern/125003
MFC after:	2 weeks
2009-06-07 23:00:40 +00:00
Robert Watson
bcf11e8d00 Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with:	pjd
2009-06-05 14:55:22 +00:00
Luigi Rizzo
115a40c7bf More cleanup in preparation of ipfw relocation (no actual code change):
+ move ipfw and dummynet hooks declarations to raw_ip.c (definitions
  in ip_var.h) same as for most other global variables.
  This removes some dependencies from ip_input.c;

+ remove the IPFW_LOADED macro, just test ip_fw_chk_ptr directly;

+ remove the DUMMYNET_LOADED macro, just test ip_dn_io_ptr directly;

+ move ip_dn_ruledel_ptr to ip_fw2.c which is the only file using it;

To be merged together with rev 193497

MFC after:	5 days
2009-06-05 13:44:30 +00:00
Sam Leffler
c9dd371765 move ifq_detach from if_detach to if_free; this permits callers to
reference if_snd in the period between detach+free which helps simplify
detach code

Reviewed by:	jhb, rwatson
2009-06-02 18:53:21 +00:00
Robert Watson
d363c61766 Revert a recent netisr2 change: when billing packets to the current
CPU, don't lock the workstream, as its mutexes may not have been
initialized if there are fewer workstreams than CPUs.

Run into by:	hps, ps
2009-06-01 18:38:36 +00:00
Bjoern A. Zeeb
c2c2a7c11e Convert the two dimensional array to be malloced and introduce
an accessor function to get the correct rnh pointer back.

Update netstat to get the correct pointer using kvm_read()
as well.

This not only fixes the ABI problem depending on the kernel
option but also permits the tunable to overwrite the kernel
option at boot time up to MAXFIBS, enlarging the number of
FIBs without having to recompile. So people could just use
GENERIC now.

Reviewed by:	julian, rwatson, zec
X-MFC:		not possible
2009-06-01 15:49:42 +00:00
Robert Watson
ed54411c19 Garbage collect NETISR_POLL and NETISR_POLLMORE, which are no longer
required for options DEVICE_POLLING.

De-fragment the NETISR_ constant space and lower NETISR_MAXPROT from
32 to 16 -- when sizing queue arrays using this compile-time constant,
significant amounts of memory are saved.

Warn on the console when tunable values for netisr are automatically
adjusted during boot due to exceeding limits, invalid values, or as a
result of DEVICE_POLLING.
2009-06-01 15:03:58 +00:00
Robert Watson
d4b5cae49b Reimplement the netisr framework in order to support parallel netisr
threads:

- Support up to one netisr thread per CPU, each processings its own
  workstream, or set of per-protocol queues.  Threads may be bound
  to specific CPUs, or allowed to migrate, based on a global policy.

  In the future it would be desirable to support topology-centric
  policies, such as "one netisr per package".

- Allow each protocol to advertise an ordering policy, which can
  currently be one of:

  NETISR_POLICY_SOURCE: packets must maintain ordering with respect to
    an implicit or explicit source (such as an interface or socket).

  NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work,
    as well as allowing protocols to provide a flow generation function
    for mbufs without flow identifers (m2flow).  Falls back on
    NETISR_POLICY_SOURCE if now flow ID is available.

  NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for
    each packet handled by netisr (m2cpuid).

- Provide utility functions for querying the number of workstreams
  being used, as well as a mapping function from workstream to CPU ID,
  which protocols may use in work placement decisions.

- Add explicit interfaces to get and set per-protocol queue limits, and
  get and clear drop counters, which query data or apply changes across
  all workstreams.

- Add a more extensible netisr registration interface, in which
  protocols declare 'struct netisr_handler' structures for each
  registered NETISR_ type.  These include name, handler function,
  optional mbuf to flow ID function, optional mbuf to CPU ID function,
  queue limit, and ordering policy.  Padding is present to allow these
  to be expanded in the future.  If no queue limit is declared, then
  a default is used.

- Queue limits are now per-workstream, and raised from the previous
  IFQ_MAXLEN default of 50 to 256.

- All protocols are updated to use the new registration interface, and
  with the exception of netnatm, default queue limits.  Most protocols
  register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use
  NETISR_POLICY_FLOW, and will therefore take advantage of driver-
  generated flow IDs if present.

- Formalize a non-packet based interface between interface polling and
  the netisr, rather than having polling pretend to be two protocols.
  Provide two explicit hooks in the netisr worker for start and end
  events for runs: netisr_poll() and netisr_pollmore(), as well as a
  function, netisr_sched_poll(), to allow the polling code to schedule
  netisr execution.  DEVICE_POLLING still embeds single-netisr
  assumptions in its implementation, so for now if it is compiled into
  the kernel, a single and un-bound netisr thread is enforced
  regardless of tunable configuration.

In the default configuration, the new netisr implementation maintains
the same basic assumptions as the previous implementation: a single,
un-bound worker thread processes all deferred work, and direct dispatch
is enabled by default wherever possible.

Performance measurement shows a marginal performance improvement over
the old implementation due to the use of batched dequeue.

An rmlock is used to synchronize use and registration/unregistration
using the framework; currently, synchronized use is disabled
(replicating current netisr policy) due to a measurable 3%-6% hit in
ping-pong micro-benchmarking.  It will be enabled once further rmlock
optimization has taken place.  However, in practice, netisrs are
rarely registered or unregistered at runtime.

A new man page for netisr will follow, but since one doesn't currently
exist, it hasn't been updated.

This change is not appropriate for MFC, although the polling shutdown
handler should be merged to 7-STABLE.

Bump __FreeBSD_version.

Reviewed by:	bz
2009-06-01 10:41:38 +00:00
Marko Zec
feb08d06b9 Introduce an interm userland-kernel API for creating vnets and
assigning ifnets from one vnet to another.  Deletion of vnets is not
yet supported.

The interface is implemented as an ioctl extension so that no syscalls
had to be introduced.  This should be acceptable given that the new
interface will be used for a short / interim period only, until the
new jail management framwork gains the capability of managing vnets.
This method for managing vimages / vnets has been in use for the past
7 years without any observable issues.

The userland tool to be used in conjunction with the interim API can be
found in p4: //depot/projects/vimage-commit2/src/usr.sbin/vimage/... and
will most probably never get commited to svn.

While here, bump copyright notices in kern_vimage.c and vimage.h to
cover work done in year 2009.

Approved by:	julian (mentor)
Discussed with:	bz, rwatson
2009-05-31 12:10:04 +00:00
Attilio Rao
1abcdbd127 When user_frac in the polling subsystem is low it is going to busy the
CPU for too long period than necessary.  Additively, interfaces are kept
polled (in the tick) even if no more packets are available.
In order to avoid such situations a new generic mechanism can be
implemented in proactive way, keeping track of the time spent on any
packet and fragmenting the time for any tick, stopping the processing
as soon as possible.

In order to implement such mechanism, the polling handler needs to
change, returning the number of packets processed.
While the intended logic is not part of this patch, the polling KPI is
broken by this commit, adding an int return value and the new flag
IFCAP_POLLING_NOCOUNT (which will signal that the return value is
meaningless for the installed handler and checking should be skipped).

Bump __FreeBSD_version in order to signal such situation.

Reviewed by:	emaste
Sponsored by:	Sandvine Incorporated
2009-05-30 15:14:44 +00:00
Robert Watson
1a109c1cb0 Make the rmlock(9) interface a bit more like the rwlock(9) interface:
- Add rm_init_flags() and accept extended options only for that variation.
- Add a flags space specifically for rm_init_flags(), rather than borrowing
  the lock_init() flag space.
- Define flag RM_RECURSE to use instead of LO_RECURSABLE.
- Define flag RM_NOWITNESS to allow an rmlock to be exempt from WITNESS
  checking; this wasn't possible previously as rm_init() always passed
  LO_WITNESS when initializing an rmlock's struct lock.
- Add RM_SYSINIT_FLAGS().
- Rename embedded mutex in rmlocks to make it more obvious what it is.
- Update consumers.
- Update man page.
2009-05-29 10:52:37 +00:00
Jamie Gritton
0304c73163 Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails.  Child jails may be restricted more than their parents,
but never less.  Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system.  Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings.  The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by:	bz (mentor)
2009-05-27 14:11:23 +00:00
Sam Leffler
5ce8d9708c rev bpf attach/detach event api to include the dlt 2009-05-25 16:34:35 +00:00
Marko Zec
37f17770e0 V_irtualize the if_clone framework, thus allowing for clonable ifnets
to optionally have overlapping unit numbers if attached in different
vnets.

At this stage if_loop is the only clonable ifnet class that has been
extended to allow for such overlapping allocation of unit numbers, i.e.
in each vnet it is possible to have a lo0 interface.  Other clonable ifnet
classes remain to operate with traditional semantics, i.e. each instance
of a clonable ifnet will be assigned a globally unique unit number,
regardless in which vnet such an ifnet becomes instantiated.

While here, garbage collect unused _lo_list field in struct vnet_net,
as well as improve indentation for #defines in sys/net/vnet.h.

The layout of struct vnet_net has changed, therefore bump
__FreeBSD_version.

This change has no functional impact on nooptions VIMAGE kernel builds.

Reviewed by:	bz, brooks
Approved by:	julian (mentor)
2009-05-23 21:43:44 +00:00
Marko Zec
67da1f3d8d Set ifp->if_afdata_initialized to 0 while holding IF_AFDATA_LOCK on ifp,
not after the lock has been released.

Reviewed by:	bz
Discussed with:	rwatson
2009-05-22 22:22:21 +00:00
Marko Zec
e0c14af9b3 Introduce the if_vmove() function, which will be used in the future
for reassigning ifnets from one vnet to another.

if_vmove() works by calling a restricted subset of actions normally
executed by if_detach() on an ifnet in the current vnet, and then
switches to the target vnet and executes an appropriate subset of
if_attach() actions there.

if_attach() and if_detach() have become wrapper functions around
if_attach_internal() and if_detach_internal(), where the later
variants have an additional argument, a flag indicating whether a
full attach or detach sequence is to be executed, or only a
restricted subset suitable for moving an ifnet from one vnet to
another.  Hence, if_vmove() will not call if_detach() and if_attach()
directly, but will call the if_detach_internal() and
if_attach_internal() variants instead, with the vmove flag set.

While here, staticize ifnet_setbyindex() since it is not referenced
from outside of sys/net/if.c.

Also rename ifccnt field in struct vimage to ifcnt, and do some minor
whitespace garbage collection where appropriate.

This change should have no functional impact on nooptions VIMAGE kernel
builds.

Reviewed by:	bz, rwatson, brooks?
Approved by:	julian (mentor)
2009-05-22 22:09:00 +00:00
Qing Li
c9d763bf41 When an interface address is removed and the last prefix
route is also being deleted, the link-layer address table
(arp or nd6) will flush those L2 llinfo entries that match
the removed prefix.

Reviewed by:	kmacy
2009-05-20 21:07:15 +00:00
Sam Leffler
b743c31009 add bpf_track eventhandler for monitoring bpf taps attached/detached
Reviewed by:	csjp
2009-05-18 17:18:40 +00:00
Robert Watson
a765f96051 Garbage collect unused NETISR_{ATM,NETGRAPH,PPP} netisr constants. 2009-05-18 10:33:23 +00:00
Robert Watson
2f120c90a7 Garbage collect now-unused NETISR_FORCEQUEUE, which overrode the global
direct dispatch policy for specific protocols (NETISR_USB).  We leave
the additional 'flags' argument to netisr_register() for the time being,
even though it is no longer required.
2009-05-13 17:22:33 +00:00
Robert Watson
270b609935 Remove now-unused NETISR_USB. 2009-05-13 17:17:05 +00:00
Marko Zec
21ca7b57bd Change the curvnet variable from a global const struct vnet *,
previously always pointing to the default vnet context, to a
dynamically changing thread-local one.  The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE().  Recursions
on curvnet are permitted, though strongly discuouraged.

This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.

The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc.  Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.

The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.

This change also introduces a DDB subcommand to show the list of all
vnet instances.

Approved by:	julian (mentor)
2009-05-05 10:56:12 +00:00
Marko Zec
5f416f8e84 Make indentation more uniform accross vnet container structs.
This is a purely cosmetic / NOP change.

Reviewed by:	bz
Approved by:	julian (mentor)
Verified by:	svn diff -x -w producing no output
2009-05-02 08:16:26 +00:00
Marko Zec
d7fcc52895 Unbreak options VIMAGE + nooptions INVARIANTS kernel builds.
Submitted by:	julian
Approved by:	julian (mentor)
2009-05-02 05:02:28 +00:00
Andrew Thompson
3f11aba75f Reorder the bridge add and delete routines to avoid calling ifpromisc() with
the bridge lock held.
2009-05-01 19:46:42 +00:00
Andrew Thompson
5c6026e91f Use the flowid if its available for selecting the tx port. 2009-04-30 14:25:44 +00:00
Marko Zec
f6dfe47a14 Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance.  Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables.  As an example, V_ifnet becomes:

    options VIMAGE:          ((struct vnet_net *) vnet_net)->_ifnet
    default build:           vnet_net_0._ifnet
    options VIMAGE_GLOBALS:  ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

    INIT_VNET_NET(ifp->if_vnet); becomes

    struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals.  If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet.  options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod.  SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by:	bz, rwatson
Approved by:	julian (mentor)
2009-04-30 13:36:26 +00:00
Kip Macy
6e3513f523 replace IFQ_ENQUEUE + if_start with if_transmit 2009-04-27 22:46:26 +00:00
Kip Macy
2635206d3a replace IFQ_HANDOFF with if_transmit 2009-04-27 22:45:56 +00:00
Kip Macy
68bd638360 remove gratuitous memory barrier, a remnant of unified L2 / L3 2009-04-27 22:45:19 +00:00
Kip Macy
1bc27a1ce3 remove call to IFQ_HANDOFF is it called by if_transmit in the default case
and doing so allows the ifnet driver to define its own queueing mechanism
2009-04-27 22:44:26 +00:00
Sam Leffler
5d3220403f use if_transmit intead of direct frobbing of the if_snd q; this is no
longer allowed

Identified by:	rwatson
Reviewed by:	kmacy
2009-04-27 22:06:49 +00:00
Marko Zec
093f25f8c8 In preparation for turning on options VIMAGE in next commits,
rearrange / replace / adjust several INIT_VNET_* initializer
macros, all of which currently resolve to whitespace.

Reviewed by:	bz (an older version of the patch)
Approved by:	julian (mentor)
2009-04-26 22:06:42 +00:00
Robert Watson
8bd015a1ca As with ifnet_byindex_ref(), don't return IFF_DYING interfaces from
ifunit_ref().  ifunit() continues to return them.

MFC after:	3 weeks
2009-04-23 15:56:01 +00:00