vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.
Reviewed by: bz
Approved by: re (vimage blanket)
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
which is currently not protected by any type of lock. When triggered, the bug
would sometimes cause a panic when the TCP activity to an affected machine
eventually slowed during a lull. The panic only occurs if INVARIANTS is compiled
into the kernel, and has laid dormant for some time as a result of INVARIANTS
being off by default except in FreeBSD-CURRENT.
Switch to atomic operations in the locations where the variable is changed.
Reads have not been updated to be protected by atomics, so there is a
possibility of accounting errors in any given calculation where the variable is
read. This is considered unlikely to occur in the wild, and will not cause
serious harm on rare occasions where it does.
Thanks to Robert Watson for debugging help.
Reported by: Kamigishi Rei <spambox at haruhiism dot net>
Tested by: Kamigishi Rei <spambox at haruhiism dot net>
Reviewed by: silby
Approved by: re (rwatson), kensmith (mentor temporarily unavailable)
TCPSTAT_INC(), rather than directly manipulating the fields across the
kernel. This will make it easier to change the implementation of
these statistics, such as using per-CPU versions of the data structures.
MFC after: 3 days
from existing functions for initializing global state.
At this stage, the new per-vnet initializer functions are
directly called from the existing global initialization code,
which should in most cases result in compiler inlining those
new functions, hence yielding a near-zero functional change.
Modify the existing initializer functions which are invoked via
protosw, like ip_init() et. al., to allow them to be invoked
multiple times, i.e. per each vnet. Global state, if any,
is initialized only if such functions are called within the
context of vnet0, which will be determined via the
IS_DEFAULT_VNET(curvnet) check (currently always true).
While here, V_irtualize a few remaining global UMA zones
used by net/netinet/netipsec networking code. While it is
not yet clear to me or anybody else whether this is the right
thing to do, at this stage this makes the code more readable,
and makes it easier to track uncollected UMA-zone-backed
objects on vnet removal. In the long run, it's quite possible
that some form of shared use of UMA zone pools among multiple
vnets should be considered.
Bump __FreeBSD_version due to changes in layout of structs
vnet_ipfw, vnet_inet and vnet_net.
Approved by: julian (mentor)
container structures, depending on VIMAGE_GLOBALS compile time option.
Make VIMAGE_GLOBALS a new compile-time option, which by default will not
be defined, resulting in instatiations of global variables selected for
V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be
effectively compiled out. Instantiate new global container structures
to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0,
vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0.
Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_
macros resolve either to the original globals, or to fields inside
container structures, i.e. effectively
#ifdef VIMAGE_GLOBALS
#define V_rt_tables rt_tables
#else
#define V_rt_tables vnet_net_0._rt_tables
#endif
Update SYSCTL_V_*() macros to operate either on globals or on fields
inside container structs.
Extend the internal kldsym() lookups with the ability to resolve
selected fields inside the virtualization container structs. This
applies only to the fields which are explicitly registered for kldsym()
visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently
this is done only in sys/net/if.c.
Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code,
and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in
turn result in proper code being generated depending on VIMAGE_GLOBALS.
De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c
which were prematurely V_irtualized by automated V_ prepending scripts
during earlier merging steps. PF virtualization will be done
separately, most probably after next PF import.
Convert a few variable initializations at instantiation to
initialization in init functions, most notably in ipfw. Also convert
TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in
initializer functions.
Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.
For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.
Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation
for virtualization.
Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.
Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.
Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit
Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.
Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().
Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).
All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).
(*) netipsec/keysock.c did not validate depending on compile time options.
Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.
This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.
MFC after: 3 months
Tested by: kris (superset of committered patch)
directly to a merged model where only one callout, the next to fire,
is registered.
Instead of callout_reset(9) and callout_stop(9) the new function
tcp_timer_activate() is used which then internally manages the callout.
The single new callout is a mutex callout on inpcb simplifying the
locking a bit.
tcp_timer() is the called function which handles all race conditions
in one place and then dispatches the individual timer functions.
Reviewed by: rwatson (earlier version)
function may be called without any TCP SACK option blocks present. Protect
iteration over SACK option blocks by checking for SACK options present flag
first.
Bug reported by: wkoszek, keramida, Nicolas Blais
(tcp_sack_output_debug checks cached hints aginst computed values by walking the
scoreboard and reports discrepancies). The sack hinting code has been stable for
many months now so it is time for the debug code to go. Leaving tcp_sack_output_debug
ifdef'ed out in case we need to resurrect it at a later point.
processing is now done in the ACK processing case.
- Merge tcp_sack_option() and tcp_del_sackholes() into a new function
called tcp_sack_doack().
- Test (SEG.ACK < SND.MAX) before processing the ACK.
Submitted by: Noritoshi Demizu
Reveiewed by: Mohan Srinivasan, Raja Mukerji
Approved by: re
the tail (in tcp_sack_option()). The bug was caused by incorrect
accounting of the retransmitted bytes in the sackhint.
Reported by: Kris Kennaway.
Submitted by: Noritoshi Demizu.
- Walks the scoreboard backwards from the tail to reduce the number of
comparisons for each sack option received.
- Introduce functions to add/remove sack scoreboard elements, making
the code more readable.
Submitted by: Noritoshi Demizu
Reviewed by: Raja Mukerji, Mohan Srinivasan
that if we sort the incoming SACK blocks, we can update the scoreboard
in one pass of the scoreboard. The added overhead of sorting upto 4
sack blocks is much lower than traversing (potentially) large
scoreboards multiple times. The code was updating the scoreboard with
multiple passes over it (once for each sack option). The rewrite fixes
that, reducing the complexity of the main loop from O(n^2) to O(n).
Submitted by: Mohan Srinivasan, Noritoshi Demizu.
Reviewed by: Raja Mukerji.
real next hole to retransmit from the scoreboard, caused by a bug
which did not update the "nexthole" hint in one case in
tcp_sack_option().
Reported by: Daniel Eriksson
Submitted by: Mohan Srinivasan
or to compute the total retransmitted bytes in this sack recovery
episode, the scoreboard is traversed. While in sack recovery, this
traversal occurs on every call to tcp_output(), every dupack and
every partial ack. The scoreboard could potentially get quite large,
making this traversal expensive.
This change optimizes this by storing hints (for the next hole to
retransmit and the total retransmitted bytes in this sack recovery
episode) reducing the complexity to find these values from O(n) to
constant time.
The debug code that sanity checks the hints against the computed
value will be removed eventually.
Submitted by: Mohan Srinivasan, Noritoshi Demizu, Raja Mukerji.
code readability and facilitates some anticipated optimizations in
tcp_sack_option().
- Remove tcp_print_holes() and TCP_SACK_DEBUG.
Submitted by: Raja Mukerji.
Reviewed by: Mohan Srinivasan, Noritoshi Demizu.
in flight in SACK recovery.
Found by: Noritoshi Demizu
Submitted by: Mohan Srinivasan <mohans at yahoo-inc dot com>
Noritoshi Demizu <demizu at dd dot ij4u dot or dot jp>
Raja Mukerji <raja at moselle dot com>
(Fix for kern/78226).
Submitted by : Noritoshi Demizu <demizu at dd dot iij4u dot or dot jp>
Reviewed by : Mohan Srinivasan (mohans at yahoo-inc dot com),
Raja Mukerji (raja at moselle dot com).
per-connection and globally. This eliminates potential DoS attacks
where SACK scoreboard elements tie up too much memory.
Submitted by: Raja Mukerji (raja at moselle dot com).
Reviewed by: Mohan Srinivasan (mohans at yahoo-inc dot com).
ignore the sack options in that segment. Else we'd end up
corrupting the scoreboard.
Found by: Raja Mukerji (raja at moselle dot com)
Submitted by: Mohan Srinivasan
Remove the SACK "initburst" sysctl.
- Fix bugs in SACK dupack and partialack handling that can cause
large bursts while in SACK recovery.
Submitted by: Mohan Srinivasan
to control the packets injected while in sack recovery (for both
retransmissions and new data).
- Cleanups to the sack codepaths in tcp_output.c and tcp_sack.c.
- Add a new sysctl (net.inet.tcp.sack.initburst) that controls the
number of sack retransmissions done upon initiation of sack recovery.
Submitted by: Mohan Srinivasan <mohans@yahoo-inc.com>