reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.
NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.
Use RTM_PINNED flag to mark route as immutable.
Forbid deleting immutable routes without special rtrequest1_fib() flag.
Adding interface address with prefix already in route table is handled
by atomically deleting old prefix and adding interface one.
Discussed with: andre, eri
MFC after: 3 weeks
of helper functions:
- carp_master() - boolean function which is true if an address
is in the MASTER state.
- ifa_preferred() - boolean function that compares two addresses,
and is aware of CARP.
Utilize ifa_preferred() in ifa_ifwithnet().
The previous version of patch also changed source address selection
logic in jails using carp_master(), but we failed to negotiate this part
with Bjoern. May be we will approach this problem again later.
Reported & tested by: Anton Yuzhaninov <citrin citrin.ru>
Sponsored by: Nginx, Inc
7.x, 8.x and 9.x with pf(4) imports: pfsync(4) should suppress CARP
preemption, while it is running its bulk update.
However, reimplement the feature in more elegant manner, that is
partially inspired by newer OpenBSD:
- Rename term "suppression" to "demotion", to match with OpenBSD.
- Keep a global demotion factor, that can be raised by several
conditions, for now these are:
- interface goes down
- carp(4) has problems with ip_output() or ip6_output()
- pfsync performs bulk update
- Unlike in OpenBSD the demotion factor isn't a counter, but
is actual value added to advskew. The adjustment values for
particular error conditions are also configurable, and their
defaults are maximum advskew value, so a single failure bumps
demotion to maximum. This is for POLA compatibility, and should
satisfy most users.
- Demotion factor is a writable sysctl, so user can do
foot shooting, if he desires to.
from scratch, copying needed functionality from the old implemenation
on demand, with a thorough review of all code. The main change is that
interface layer has been removed from the CARP. Now redundant addresses
are configured exactly on the interfaces, they run on.
The CARP configuration itself is, as before, configured and read via
SIOCSVH/SIOCGVH ioctls. A new prefix created with SIOCAIFADDR or
SIOCAIFADDR_IN6 may now be configured to a particular virtual host id,
which makes the prefix redundant.
ifconfig(8) semantics has been changed too: now one doesn't need
to clone carpXX interface, he/she should directly configure a vhid
on a Ethernet interface.
To supply vhid data from the kernel to an application the getifaddrs(8)
function had been changed to pass ifam_data with each address. [1]
The new implementation definitely closes all PRs related to carp(4)
being an interface, and may close several others. It also allows
to run a single redundant IP per interface.
Big thanks to Bjoern Zeeb for his help with inet6 part of patch, for
idea on using ifam_data and for several rounds of reviewing!
PR: kern/117000, kern/126945, kern/126714, kern/120130, kern/117448
Reviewed by: bz
Submitted by: bz [1]
to be assigned to a non-default FIB instance.
You may need to recompile world or ports due to the change of struct ifnet.
Submitted by: cjsp
Submitted by: Alexander V. Chernikov (melifaro ipfw.ru)
(original versions)
Reviewed by: julian
Reviewed by: Alexander V. Chernikov (melifaro ipfw.ru)
MFC after: 2 weeks
X-MFC: use spare in struct ifnet
(i.e. under COMPAT_FREEBSD32) in case ifconf() returned success to match
the native SIOCGIFCONF behavior.
PR: kern/158369
Reported by: Paul Procacci <pprocacci att gmail com>
MFC after: 1 week
from the interface index, then decrease refcount, not vice versa.
Otherwise there is a race (reproducible) when if_free_internal()
contests on IFNET_WLOCK(), and we got a zero-refed ifnet in the
index for a long time. It may be picked by some other thread,
that runs ifnet_byindex_ref(), who takes the ifnet from index,
and bumps refcount. When reader drops the lock, if_free_internal()
proceeds with free. Then reader tries to free it a second time.
VNET socket push back:
try to minimize the number of places where we have to switch vnets
and narrow down the time we stay switched. Add assertions to the
socket code to catch possibly unset vnets as seen in r204147.
While this reduces the number of vnet recursion in some places like
NFS, POSIX local sockets and some netgraph, .. recursions are
impossible to fix.
The current expectations are documented at the beginning of
uipc_socket.c along with the other information there.
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
Reviewed by: jhb
Tested by: zec
Tested by: Mikolaj Golub (to.my.trociny gmail.com)
MFC after: 2 weeks
Make VNET_ASSERT() available with either VNET_DEBUG or INVARIANTS.
Change the syntax to match KASSERT() to allow more flexible panic
messages rather than having a printf with hardcoded arguments
before panic.
Adjust the few assertions we have to the new format (and enhance
the output).
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
Reviewed by: jhb
MFC after: 2 weeks
table in if_grow(). The order of the SYSINIT's for ifnet state were swapped
so that the various locks were initialized before being used.
Reviewed by: pluknet, bz
MFC after: 2 weeks
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files. A better long-term solution is
still being considered. This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.
Changes reverted:
------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines
Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.
------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines
Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.
------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines
Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
enhancements (1). Switch to a standard 2-clause BSD license for this (2).
Unfortunately we have to un-static the ifindex_table for this but do not
publicly export it.
Suggested by: rwatson (1) a while back.
Approved by: thompsa (2) for the change from r204279.
MFC after: 6 days
- move all the chunks into one file, which allows to hide SIOCGIFCONF32
global definition as well.
- replace __amd64__ with proper COMPAT_FREEBSD32 around.
- handle 32bit capacity before going into the handler itself instead of
doing internal 32bit specific changes within it (e.g. as it's done for
SIOCGDEFIFACE32_IN6).
- use explicitely sized types for ABI compat.
Approved by: kib (mentor)
MFC after: 2 weeks
code associated with overflow or with the drain function. While this
function is not expected to be used often, it produces more information
in the form of an errno that sbuf_overflowed() did.
bridge(4), lagg(4) etc. and make use of function pointers and
pf_proto_register() to hook carp into the network stack.
Currently, because of the uncertainty about whether the unload path is free
of race condition panics, unloads are disallowed by default. Compiling with
CARPMOD_CAN_UNLOAD in CFLAGS removes this anti foot shooting measure.
This commit requires IP6PROTOSPACER, introduced in r211115.
Reviewed by: bz, simon
Approved by: ken (mentor)
MFC after: 2 weeks
queue length. The default value for this parameter is 50, which is
quite low for many of today's uses and the only way to modify this
parameter right now is to edit if_var.h file. Also add read-only
sysctl with the same name, so that it's possible to retrieve the
current value.
MFC after: 1 month
"Whitspace" churn after the VIMAGE/VNET whirls.
Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.
Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.
This also removes some header file pollution for putatively
static global variables.
Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.
Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
MFC after: 6 days
interface considers that it hits a fatal error, and will not copyout
the request structure back for _IOW and _IOWR ioctls, keeping them
untouched.
The previous implementation of the SIOCGIFDESCR ioctl intends to
feed the buffer length back to userland. However, if we return
an error, the feedback would be defeated and ifconfig(8) would
trap into an infinite loop.
This commit changes SIOCGIFDESCR to set buffer field to NULL to
indicate the previous ENAMETOOLONG case.
Reported by: bschmidt
MFC after: 2 weeks
dom_ifdetach() calls as they might sleep for callout_drain().
Do as we do in if_attachdomain1() [r121470] and handle
if_afdata_initialized earlier and call dom_ifdetach() unlocked.
Discussed with: rwatson
MFC after: 10 days
has actually succeeded to initialize and attach. There is a theoretical
possibility to drop out early in if_attachdomain1() leaving the array
uninitialized if we cannot get the lock.
Discussed with: rwatson
MFC after: 10 days
- 'show ifnets' prints a list of ifnet *s per virtual network stack,
- 'show ifnet <struct ifnet *>' prints fields matching the given ifp.
We do not yet print the complete set of fields and might want to
factor this out to an extra if_debug.c file in case this grows
a lot[1]. We may also want to grow 'show ifnet <if_xname>' support[1].
Sponsored by: ISPsystem
Suggested by: rwatson [1]
Reviewed by: rwatson
MFC after: 5 days
ifmultiaddr structures' reference to the parent interface, unless the parent
interface is really detaching. While here, program only link layer multicast
filters to a wlan's hardware parent interface.
PR: kern/142391, kern/142392
Reviewed by: sam, rpaolo, bms
MFC after: 1 week
address on an interface has changed. This lets stacked interfaces such as
vlan(4) detect that their lower interface has changed and adjust things in
order to keep working. Previously this situation broke at least vlan(4) and
lagg(4) configurations.
The EVENTHANDLER_INVOKE call was not placed within if_setlladdr() due to the
risk of a loop.
PR: kern/142927
Submitted by: Nikolay Denev
r195175. Remove all definitions, documentation, and usage.
fifo_misc.c:
Remove all kqueue tests as fifo_io.c performs all those that
would have remained.
Reviewed by: rwatson
MFC after: 3 weeks
X-MFC note: don't change vlan_link_state() function signature
renamed. Previously the vlan interfaces would lose their configuration as if
the parent interface had been physically removed. Now vlan interfaces ignore
rename events.
- Add a new ifnet flag (IFF_RENAMING) that is set while an ifnet is being
renamed. This flag can be checked in ifnet departure/arrival event
handlers to treat rename events differently.
- Change the ifnet departure event handler in the if_vlan(4) driver to
ignore departure events due to a trunk interface being renamed.
Reviewed by: brooks, rwatson
MFC after: 1 week
and address aliases. After an interface is brought down and brought
back up again, those self pointing routes disappeared. This patch
ensures after an interface is brought back up, the loopback routes
are reinstalled properly.
Reviewed by: bz
MFC after: immediately
which allows an index to be reserved for an ifnet without making
the ifnet available for management operations. Use this in if_alloc()
while the ifnet lock is released between initial index allocation and
completion of ifnet initialization.
Add ifindex_free() to centralize the implementation of releasing an
ifindex value. Use in if_free() and if_vmove(), as well as when
releasing a held index in if_alloc().
Reviewed by: bz
MFC after: 3 days
and centralize in a single function ifindex_alloc(). Assert the
IFNET_WLOCK, and add missing IFNET_WLOCK in if_alloc(). This does not
close all known races in this code.
Reviewed by: bz
MFC after: 3 days
has ifaddresses of AF_LINK type which thus have an embedded
if_index "backpointer", we must update that if_index backpointer
to reflect the new if_index that our ifnet just got assigned.
This change affects only options VIMAGE builds.
Submitted by: bz
Reviewed by: bz
Approved by: re (rwatson), julian (mentor)
several critical bugs, including race conditions and lock order issues:
Replace the single rwlock, ifnet_lock, with two locks, an rwlock and an
sxlock. Either can be held to stablize the lists and indexes, but both
are required to write. This allows the list to be held stable in both
network interrupt contexts and sleepable user threads across sleeping
memory allocations or device driver interactions. As before, writes to
the interface list must occur from sleepable contexts.
Reviewed by: bz, julian
MFC after: 3 days
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.
Reviewed by: bz
Approved by: re (vimage blanket)
(ifconfig ifN (-)vnet <jname|jid>) work correctly.
Move vi_if_move to if.c and split it up into two functions(*),
one for each ioctl.
In the reclaim case, correctly set the vnet before calling if_vmove.
Instead of silently allowing a move of an interface from the current
vnet to the current vnet, return an error. (*)
There is some duplicate interface name checking before actually moving
the interface between network stacks without locking and thus race
prone. Ideally if_vmove will correctly and automagically handle these
in the future.
Suggested by: rwatson (*)
Approved by: re (kib)
network stacks, VNET_SYSINIT:
- Add VNET_SYSINIT and VNET_SYSUNINIT macros to declare events that will
occur each time a network stack is instantiated and destroyed. In the
!VIMAGE case, these are simply mapped into regular SYSINIT/SYSUNINIT.
For the VIMAGE case, we instead use SYSINIT's to track their order and
properties on registration, using them for each vnet when created/
destroyed, or immediately on module load for already-started vnets.
- Remove vnet_modinfo mechanism that existed to serve this purpose
previously, as well as its dependency scheme: we now just use the
SYSINIT ordering scheme.
- Implement VNET_DOMAIN_SET() to allow protocol domains to declare that
they want init functions to be called for each virtual network stack
rather than just once at boot, compiling down to DOMAIN_SET() in the
non-VIMAGE case.
- Walk all virtualized kernel subsystems and make use of these instead
of modinfo or DOMAIN_SET() for init/uninit events. In some cases,
convert modular components from using modevent to using sysinit (where
appropriate). In some cases, do minor rejuggling of SYSINIT ordering
to make room for or better manage events.
Portions submitted by: jhb (VNET_SYSINIT), bz (cleanup)
Discussed with: jhb, bz, julian, zec
Reviewed by: bz
Approved by: re (VIMAGE blanket)
unused custom mutex/condvar-based sleep locks with two locks: an
rwlock (for non-sleeping use) and sxlock (for sleeping use). Either
acquired for read is sufficient to stabilize the vnet list, but both
must be acquired for write to modify the list.
Replace previous no-op read locking macros, used in various places
in the stack, with actual locking to prevent race conditions. Callers
must declare when they may perform unbounded sleeps or not when
selecting how to lock.
Refactor vnet sysinits so that the vnet list and locks are initialized
before kernel modules are linked, as the kernel linker will use them
for modules loaded by the boot loader.
Update various consumers of these KPIs based on whether they may sleep
or not.
Reviewed by: bz
Approved by: re (kib)
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
little purpose and are unused in the base system.
The IOCTL functionality is entirely duplicated and routing sockets
provide a richer interface than the kqueue functionality.
Further, it is not practical for these devices to be made sensible in
the face of VIMAGE.
Bump __FreeBSD_version on the off chance that there is any code out
there that actually uses this stuff.
Reviewed by: rwatson
Discussed with: bz, zec
Approved by: re@ (kensmith)
if_addr_rlock() and if_addr_runlock() for regular address lists, and
if_maddr_rlock() and if_maddr_runlock() for multicast address lists.
We will use these in various kernel modules to avoid encoding specific
type and locking strategy information into modules that currently use
IF_ADDR_LOCK() and IF_ADDR_UNLOCK() directly.
MFC after: 6 weeks
rather than pointers, requiring callers to properly dispose of those
references. The following routines now return references:
ifaddr_byindex
ifa_ifwithaddr
ifa_ifwithbroadaddr
ifa_ifwithdstaddr
ifa_ifwithnet
ifaof_ifpforaddr
ifa_ifwithroute
ifa_ifwithroute_fib
rt_getifa
rt_getifa_fib
IFP_TO_IA
ip_rtaddr
in6_ifawithifp
in6ifa_ifpforlinklocal
in6ifa_ifpwithaddr
in6_ifadd
carp_iamatch6
ip6_getdstifaddr
Remove unused macro which didn't have required referencing:
IFP_TO_IA6
This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc). In a few cases, add missing if_addr_list locking
required to safely acquire references.
Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit. Once
we have mbuf tag deep copy support, this can be fixed.
Reviewed by: bz
Obtained from: Apple, Inc. (portions)
MFC after: 6 weeks (portions)
Vimage module, which had been there already but now is stateful.
All variables are now file local; so this further limits the global
spreading of routing related things throughout the kernel.
Add a missing function local variable in case of MPATHing.
Reviewed by: zec
a pointer to an ifaddr matching the passed socket address, returns a
boolean indicating whether one was present. In the (near) future,
ifa_ifwithaddr() will return a referenced ifaddr rather than a raw
ifaddr pointer, and the new wrapper will allow callers that care only
about the boolean condition to avoid having to free that reference.
MFC after: 3 weeks
- Unify reference count and lock initialization in a single function,
ifa_init().
- Move tear-down from a macro (IFAFREE) to a function ifa_free().
- Move reference count bump from a macro (IFAREF) to a function ifa_ref().
- Instead of using a u_int protected by a mutex to refcount(9) for
reference count management.
The ifa_mtx is now used for exactly one ioctl, and possibly should be
removed.
MFC after: 3 weeks
by if_free (w/o doing if_attach); move ifq_attach to if_alloc and
rename ifq_attach/detach to ifq_init/ifq_delete to better identify
their purpose
Reviewed by: jhb, kmacy
parameter "vnet" when it is created, a new vnet instance will be created
along with the jail. Networks interfaces can be moved between prisons
with an ioctl similar to the one that moves them between vimages.
For now vnets will co-exist under both jails and vimages, but soon
struct vimage will be going away.
Reviewed by: zec, julian
Approved by: bz (mentor)
use IPv4/v6 for inter-node communication (according to my reading).
Properly wrap the carp callouts in INET || INET6 and refelect this
in sys/conf/files as well. While in theory this should be ok,
it might be a bit optimistic to think that carp could build with
inet6 only[1].
Discussed with: mlaier [1]
vnode interlock to protect the knote fields [1]. The locking assumes
that shared vnode lock is held, thus we get exclusive access to knote
either by exclusive vnode lock protection, or by shared vnode lock +
vnode interlock.
Do not use kl_locked() method to assert either lock ownership or the
fact that curthread does not own the lock. For shared locks, ownership
is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared
lock not owned by curthread, causing false positives in kqueue subsystem
assertions about knlist lock.
Remove kl_locked method from knlist lock vector, and add two separate
assertion methods kl_assert_locked and kl_assert_unlocked, that are
supposed to use proper asserts. Change knlist_init accordingly.
Add convenience function knlist_init_mtx to reduce number of arguments
for typical knlist initialization.
Submitted by: jhb [1]
Noted by: jhb [2]
Reviewed by: jhb
Tested by: rnoland
the ROUTETABLES kernel option thus there is no need to include opt_route.h
anymore in all consumers of vnet.h and no longer depend on it for module
builds.
Remove the hidden include in flowtable.h as well and leave the two
explicit #includes in ip_input.c and ip_output.c.
Vnet modules and protocol domains may now register destructor
functions to clean up and release per-module state. The destructor
mechanisms can be triggered by invoking "vimage -d", or a future
equivalent command which will be provided via the new jail framework.
While this patch introduces numerous placeholder destructor functions,
many of those are currently incomplete, thus leaking memory or (even
worse) failing to stop all running timers. Many of such issues are
already known and will be incrementaly fixed over the next weeks in
smaller incremental commits.
Apart from introducing new fields in structs ifnet, domain, protosw
and vnet_net, which requires the kernel and modules to be rebuilt, this
change should have no impact on nooptions VIMAGE builds, since vnet
destructors can only be called in VIMAGE kernels. Moreover,
destructor functions should be in general compiled in only in
options VIMAGE builds, except for kernel modules which can be safely
kldunloaded at run time.
Bump __FreeBSD_version to 800097.
Reviewed by: bz, julian
Approved by: rwatson, kib (re), julian (mentor)
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.
Discussed with: pjd
an accessor function to get the correct rnh pointer back.
Update netstat to get the correct pointer using kvm_read()
as well.
This not only fixes the ABI problem depending on the kernel
option but also permits the tunable to overwrite the kernel
option at boot time up to MAXFIBS, enlarging the number of
FIBs without having to recompile. So people could just use
GENERIC now.
Reviewed by: julian, rwatson, zec
X-MFC: not possible
assigning ifnets from one vnet to another. Deletion of vnets is not
yet supported.
The interface is implemented as an ioctl extension so that no syscalls
had to be introduced. This should be acceptable given that the new
interface will be used for a short / interim period only, until the
new jail management framwork gains the capability of managing vnets.
This method for managing vimages / vnets has been in use for the past
7 years without any observable issues.
The userland tool to be used in conjunction with the interim API can be
found in p4: //depot/projects/vimage-commit2/src/usr.sbin/vimage/... and
will most probably never get commited to svn.
While here, bump copyright notices in kern_vimage.c and vimage.h to
cover work done in year 2009.
Approved by: julian (mentor)
Discussed with: bz, rwatson
for reassigning ifnets from one vnet to another.
if_vmove() works by calling a restricted subset of actions normally
executed by if_detach() on an ifnet in the current vnet, and then
switches to the target vnet and executes an appropriate subset of
if_attach() actions there.
if_attach() and if_detach() have become wrapper functions around
if_attach_internal() and if_detach_internal(), where the later
variants have an additional argument, a flag indicating whether a
full attach or detach sequence is to be executed, or only a
restricted subset suitable for moving an ifnet from one vnet to
another. Hence, if_vmove() will not call if_detach() and if_attach()
directly, but will call the if_detach_internal() and
if_attach_internal() variants instead, with the vmove flag set.
While here, staticize ifnet_setbyindex() since it is not referenced
from outside of sys/net/if.c.
Also rename ifccnt field in struct vimage to ifcnt, and do some minor
whitespace garbage collection where appropriate.
This change should have no functional impact on nooptions VIMAGE kernel
builds.
Reviewed by: bz, rwatson, brooks?
Approved by: julian (mentor)
previously always pointing to the default vnet context, to a
dynamically changing thread-local one. The currvnet context
should be set on entry to networking code via CURVNET_SET() macros,
and reverted to previous state via CURVNET_RESTORE(). Recursions
on curvnet are permitted, though strongly discuouraged.
This change should have no functional impact on nooptions VIMAGE
kernel builds, where CURVNET_* macros expand to whitespace.
The curthread->td_vnet (aka curvnet) variable's purpose is to be an
indicator of the vnet context in which the current network-related
operation takes place, in case we cannot deduce the current vnet
context from any other source, such as by looking at mbuf's
m->m_pkthdr.rcvif->if_vnet, sockets's so->so_vnet etc. Moreover, so
far curvnet has turned out to be an invaluable consistency checking
aid: it helps to catch cases when sockets, ifnets or any other
vnet-aware structures may have leaked from one vnet to another.
The exact placement of the CURVNET_SET() / CURVNET_RESTORE() macros
was a result of an empirical iterative process, whith an aim to
reduce recursions on CURVNET_SET() to a minimum, while still reducing
the scope of CURVNET_SET() to networking only operations - the
alternative would be calling CURVNET_SET() on each system call entry.
In general, curvnet has to be set in three typicall cases: when
processing socket-related requests from userspace or from within the
kernel; when processing inbound traffic flowing from device drivers
to upper layers of the networking stack, and when executing
timer-driven networking functions.
This change also introduces a DDB subcommand to show the list of all
vnet instances.
Approved by: julian (mentor)
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:
1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:
options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet
2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:
INIT_VNET_NET(ifp->if_vnet); becomes
struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];
3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.
4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.
5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.
6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.
Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.
Reviewed by: bz, rwatson
Approved by: julian (mentor)
interface pointer, but also a reference to it.
Modify ifioctl() to use ifunit_ref(), holding the reference until
all ioctls, etc, have completed.
This closes a class of reader-writer races in which interfaces
could be removed during long-running ioctls, leading to crashes.
Many other consumers of ifunit() should now use ifunit_ref() to
avoid similar races.
MFC after: 3 weeks
pointers to "dead" implementations that no-op rather than invoking
the device driver. This would generally be unexpected and
possibly quite badly handled by most device drivers after
if_detach() has completed.
Reviewed by: bms
MFC after: 3 weeks
if_alloc(), and portions of data structure destruction from if_detach()
to if_free(). These changes leave more of the struct ifnet in a
safe-to-access condition between alloc and attach, and between detach
and free, and focus on attach/detach as stack usage events rather than
data structure initialization.
Affected fields include the linkstate task queue, if_afdata lock,
address lists, kqueue state, and MAC labels. ifq_attach() ifq_detach()
are not moved as ifq_attach() may use a queue length set by the device
driver between if_alloc() and if_attach().
MFC after: 3 weeks
calls if_free(), and remains set if the refcount is elevated. IF_DYING
skips the bit in the if_flags bitmask previously used by IFF_NEEDSGIANT,
so that an MFC can be done without changing which bit is used, as
IFF_NEEDSGIANT is still present in 7.x.
ifnet_byindex_ref() checks for IFF_DYING and returns NULL if it is set,
preventing new references from by acquired by index, preventing
monitoring sysctls from seeing it. Other lookup mechanisms currently
do not check IFF_DYING, but may need to in the future.
MFC after: 3 weeks
after the corresponding interface has been destroyed:
(1) Add an ifnet refcount, ifp->if_refcount. Initialize it to 1 in
if_alloc(), and modify if_free_type() to decrement and check the
refcount.
(2) Add new if_ref() and if_rele() interfaces to allow kernel code
walking global interface lists to release IFNET_[RW]LOCK() yet
keep the ifnet stable. Currently, if_rele() is a no-op wrapper
around if_free(), but this may change in the future.
(3) Add new ifnet field, if_alloctype, which caches the type passed
to if_alloc(), but unlike if_type, won't be changed by drivers.
This allows asynchronous free's of the interface after the
driver has released it to still use the right type. Use that
instead of the type passed to if_free_type(), but assert that
they are the same (might have to rethink this if that doesn't
work out).
(4) Add a new ifnet_byindex_ref(), which looks up an interface by
index and returns a reference rather than a pointer to it.
(5) Fix if_alloc() to fully initialize the if_addr_mtx before hooking
up the ifnet to global lists.
(6) Modify sysctls in if_mib.c to use ifnet_byindex_ref() and release
the ifnet when done.
When this change is MFC'd, it will need to replace if_ispare fields
rather than adding new fields in order to avoid breaking the binary
interface. Once this change is MFC'd, if_free_type() should be
removed, as its 'type' argument is now optional.
This refcount is not appropriate for counting mbuf pkthdr references,
and also not for counting entry into the device driver via ifnet
function pointers. An rmlock may be appropriate for the latter.
Rather, this is about ensuring data structure stability when reaching
an ifnet via global ifnet lists and tables followed by copy in or out
of userspace.
MFC after: 3 weeks
Reported by: mdtancsa
Reviewed by: brooks