avoidance:
- Enable setting the RXCSUM and TXCSUM flags for loopback interfaces;
set both by default.
- When RXCSUM is set, flag packets sent over the loopback interface as
having checked and valid IP, UDP, TCP checksums so that higher
protocol layers won't check them.
- Always clear CSUM_{IP,UDP_TCP} checksum required flags on transmit,
as they will have gotten there as a result of TXCSUM being set.
This is done only for packets explicitly sent over the loopback, not
simulated loopback via if_simloop() due to !SIMPLEX interfaces, etc.
Note that enabling TXCSUM but not RXCSUM will lead to unhappiness, as
checksums won't be generated but will be validated.
Kris reports that this leads to significant performance improvements
in loopback benchmarking with TCP and UDP for throughput:
RXCSUM RXCSUM+TXCSUM
TCP 15% 37%
UDP 10% 74%
Update man page.
Reviewed by: sam
Tested by: kris
MFC after: 1 week
net/route.h.
Remove the hidden include of opt_route.h and net/route.h from net/vnet.h.
We need to make sure that both opt_route.h and net/route.h are included
before net/vnet.h because of the way MRT figures out the number of FIBs
from the kernel option. If we do not, we end up with the default number
of 1 when including net/vnet.h and array sizes are wrong.
This does not change the list of files which depend on opt_route.h
but we can identify them now more easily.
in the loopback and synthetic loopback code so that packets are
access control checked and relabeled. Previously, the MAC
Framework enforced that packets sent over the loopback weren't
relabeled, but this will allow policies to make explicit choices
about how and whether to relabel packets on the loopback. Also,
for SIMPLEX devices, this produces more consistent behavior for
looped back packets to the local MAC address by labeling those
packets as coming from the interface.
Discussed with: csjp
Obtained from: TrustedBSD Project
directly include only the header files needed. This reduces the
unneeded spamming of various headers into lots of files.
For now, this leaves us with very few modules including vnet.h
and thus needing to depend on opt_route.h.
Reviewed by: brooks, gnn, des, zec, imp
Sponsored by: The FreeBSD Foundation
whitespace) macros from p4/vimage branch.
Do a better job at enclosing all instantiations of globals
scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks.
De-virtualize and mark as const saorder_state_alive and
saorder_state_any arrays from ipsec code, given that they are never
updated at runtime, so virtualizing them would be pointless.
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
for virtualization.
Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.
Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.
Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
from the vimage project, as per plan established at devsummit 08/08:
http://wiki.freebsd.org/Image/Notes200808DevSummit
Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator
macros, and CURVNET_SET() context setting macros, all currently
resolving to NOPs.
Prepare for virtualization of selected SYSCTL objects by introducing a
family of SYSCTL_V_*() macros, currently resolving to their global
counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT().
Move selected #defines from sys/sys/vimage.h to newly introduced header
files specific to virtualized subsystems (sys/net/vnet.h,
sys/netinet/vinet.h etc.).
All the changes are verified to have zero functional impact at this
point in time by doing MD5 comparision between pre- and post-change
object files(*).
(*) netipsec/keysock.c did not validate depending on compile time options.
Implemented by: julian, bz, brooks, zec
Reviewed by: julian, bz, brooks, kris, rwatson, ...
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
virtualization work done by Marko Zec (zec@).
This is the first in a series of commits over the course
of the next few weeks.
Mark all uses of global variables to be virtualized
with a V_ prefix.
Use macros to map them back to their global names for
now, so this is a NOP change only.
We hope to have caught at least 85-90% of what is needed
so we do not invalidate a lot of outstanding patches again.
Obtained from: //depot/projects/vimage-commit2/...
Reviewed by: brooks, des, ed, mav, julian,
jamie, kris, rwatson, zec, ...
(various people I forgot, different versions)
md5 (with a bit of help)
Sponsored by: NLnet Foundation, The FreeBSD Foundation
X-MFC after: never
V_Commit_Message_Reviewed_By: more people than the patch
Now that the pseudo-interface cloner has an internal list of instances,
there is no need to create a softc. The softc only contains a pointer to
the ifp, which means there is no valid reason to keep it. While there,
remove the corresponding malloc-pool.
Approved by: philip (mentor)
a private softc list is needed neither for tracking clones in general
nor for destroying all clones before the module unload -- if_clone
takes care of all that. (Note that some other interface drivers do
need a softc list to be able to scan it for their private purposes.)
parameter that can specify configuration parameters:
o rev cloner api's to add optional parameter block
o add SIOCCREATE2 that accepts parameter data
o rev vlan support to use new api (maintain old code)
Reviewed by: arch@
(1) bpf peer attaches to interface netif0
(2) Packet is received by netif0
(3) ifp->if_bpf pointer is checked and handed off to bpf
(4) bpf peer detaches from netif0 resulting in ifp->if_bpf being
initialized to NULL.
(5) ifp->if_bpf is dereferenced by bpf machinery
(6) Kaboom
This race condition likely explains the various different kernel panics
reported around sending SIGINT to tcpdump or dhclient processes. But really
this race can result in kernel panics anywhere you have frequent bpf attach
and detach operations with high packet per second load.
Summary of changes:
- Remove the bpf interface's "driverp" member
- When we attach bpf interfaces, we now set the ifp->if_bpf member to the
bpf interface structure. Once this is done, ifp->if_bpf should never be
NULL. [1]
- Introduce bpf_peers_present function, an inline operation which will do
a lockless read bpf peer list associated with the interface. It should
be noted that the bpf code will pickup the bpf_interface lock before adding
or removing bpf peers. This should serialize the access to the bpf descriptor
list, removing the race.
- Expose the bpf_if structure in bpf.h so that the bpf_peers_present function
can use it. This also removes the struct bpf_if; hack that was there.
- Adjust all consumers of the raw if_bpf structure to use bpf_peers_present
Now what happens is:
(1) Packet is received by netif0
(2) Check to see if bpf descriptor list is empty
(3) Pickup the bpf interface lock
(4) Hand packet off to process
From the attach/detach side:
(1) Pickup the bpf interface lock
(2) Add/remove from bpf descriptor list
Now that we are storing the bpf interface structure with the ifnet, there is
is no need to walk the bpf interface list to locate the correct bpf interface.
We now simply look up the interface, and initialize the pointer. This has a
nice side effect of changing a bpf interface attach operation from O(N) (where
N is the number of bpf interfaces), to O(1).
[1] From now on, we can no longer check ifp->if_bpf to tell us whether or
not we have any bpf peers that might be interested in receiving packets.
In collaboration with: sam@
MFC after: 1 month
IFF_DRV_RUNNING, as well as the move from ifnet.if_flags to
ifnet.if_drv_flags. Device drivers are now responsible for
synchronizing access to these flags, as they are in if_drv_flags. This
helps prevent races between the network stack and device driver in
maintaining the interface flags field.
Many __FreeBSD__ and __FreeBSD_version checks maintained and continued;
some less so.
Reviewed by: pjd, bz
MFC after: 7 days
a DLT_NULL interface. In particular:
1) Consistently use type u_int32_t for the header of a
DLT_NULL device - it continues to represent the address
family as always.
2) In the DLT_NULL case get bpf_movein to store the u_int32_t
in a sockaddr rather than in the mbuf, to be consistent
with all the DLT types.
3) Consequently fix a bug in bpf_movein/bpfwrite which
only permitted packets up to 4 bytes less than the MTU
to be written.
4) Fix all DLT_NULL devices to have the code required to
allow writing to their bpf devices.
5) Move the code to allow writing to if_lo from if_simloop
to looutput, because it only applies to DLT_NULL devices
but was being applied to other devices that use if_simloop
possibly incorrectly.
PR: 82157
Submitted by: Matthew Luckie <mjl@luckie.org.nz>
Approved by: re (scottl)
struct ifnet or the layer 2 common structure it was embedded in have
been replaced with a struct ifnet pointer to be filled by a call to the
new function, if_alloc(). The layer 2 common structure is also allocated
via if_alloc() based on the interface type. It is hung off the new
struct ifnet member, if_l2com.
This change removes the size of these structures from the kernel ABI and
will allow us to better manage them as interfaces come and go.
Other changes of note:
- Struct arpcom is no longer referenced in normal interface code.
Instead the Ethernet address is accessed via the IFP2ENADDR() macro.
To enforce this ac_enaddr has been renamed to _ac_enaddr.
- The second argument to ether_ifattach is now always the mac address
from driver private storage rather than sometimes being ac_enaddr.
Reviewed by: sobomax, sam
its users.
netisr_queue() now returns (0) on success and ERRNO on failure. At the
moment ENXIO (netisr queue not functional) and ENOBUFS (netisr queue full)
are supported.
Previously it would return (1) on success but the return value of IF_HANDOFF()
was interpreted wrongly and (0) was actually returned on success. Due to this
schednetisr() was never called to kick the scheduling of the isr. However this
was masked by other normal packets coming through netisr_dispatch() causing the
dequeueing of waiting packets.
PR: kern/70988
Found by: MOROHOSHI Akihiko <moro@remus.dti.ne.jp>
MFC after: 3 days
for unknown events.
A number of modules return EINVAL in this instance, and I have left
those alone for now and instead taught MOD_QUIESCE to accept this
as "didn't do anything".
- Split the code out into if_clone.[ch].
- Locked struct if_clone. [1]
- Add a per-cloner match function rather then simply matching names of
the form <name><unit> and <name>.
- Use the match function to allow creation of <interface>.<tag>
vlan interfaces. The old way is preserved unchanged!
- Also the match function to allow creation of stf(4) interfaces named
stf0, stf, or 6to4. This is the only major user visible change in
that "ifconfig stf" creates the interface stf rather then stf0 and
does not print "stf0" to stdout.
- Allow destroy functions to fail so they can refuse to delete
interfaces. Currently, we forbid the deletion of interfaces which
were created in the init function, particularly lo0, pflog0, and
pfsync0. In the case of lo0 this was a panic implementation so it
does not count as a user visiable change. :-)
- Since most interfaces do not need the new functionality, an family of
wrapper functions, ifc_simple_*(), were created to wrap old style
cloner functions.
- The IF_CLONE_INITIALIZER macro is replaced with a new incompatible
IFC_CLONE_INITIALIZER and ifc_simple consumers use IFC_SIMPLE_DECLARE
instead.
Submitted by: Maurycy Pawlowski-Wieronski <maurycy at fouk.org> [1]
Reviewed by: andre, mlaier
Discussed on: net
really sure why we have a softc list for if_loop, given that it
can't be unloaded, but that's an issue to revisit in the future as
corrupting the softc list would still cause panics.
Reviewed by: benno
a new bpf_mtap2 routine that does the right thing for an mbuf
and a variable-length chunk of data that should be prepended.
o while we're sweeping the drivers, use u_int32_t uniformly when
when prepending the address family (several places were assuming
sizeof(int) was 4)
o return M_ASSERTVALID to BPF_MTAP* now that all stack-allocated
mbufs have been eliminated; this may better be moved to the bpf
routines
Reviewed by: arch@ and several others
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.
It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.
tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.
It removes significant locking requirements from the tcp stack with
regard to the routing table.
Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)
if_xname, if_dname, and if_dunit. if_xname is the name of the interface
and if_dname/unit are the driver name and instance.
This change paves the way for interface renaming and enhanced pseudo
device creation and configuration symantics.
Approved By: re (in principle)
Reviewed By: njl, imp
Tested On: i386, amd64, sparc64
Obtained From: NetBSD (if_xname)
direct dispatch) to avoid extensive kernel stack usage and to
avoid directly re-entering the network stack. The latter causes
locking problems when, for example, a complete TCP handshake`
happens w/o a context switch.
with an mbuf until it is reclaimed. This is in contrast to tags that
vanish when an mbuf chain passes through an interface. Persistent tags
are used, for example, by MAC labels.
Add an m_tag_delete_nonpersistent function to strip non-persistent tags
from mbufs and use it to strip such tags from packets as they pass through
the loopback interface and when turned around by icmp. This fixes problems
with "tag leakage".
Pointed out by: Jonathan Stone
Reviewed by: Robert Watson
that covers updates to the contents. Note this is separate from holding
a reference and/or locking the routing table itself.
Other/related changes:
o rtredirect loses the final parameter by which an rtentry reference
may be returned; this was never used and added unwarranted complexity
for locking.
o minor style cleanups to routing code (e.g. ansi-fy function decls)
o remove the logic to bump the refcnt on the parent of cloned routes,
we assume the parent will remain as long as the clone; doing this avoids
a circularity in locking during delete
o convert some timeouts to MPSAFE callouts
Notes:
1. rt_mtx in struct rtentry is guarded by #ifdef _KERNEL as user-level
applications cannot/do-no know about mutex's. Doing this requires
that the mutex be the last element in the structure. A better solution
is to introduce an externalized version of struct rtentry but this is
a major task because of the intertwining of rtentry and other data
structures that are visible to user applications.
2. There are known LOR's that are expected to go away with forthcoming
work to eliminate many held references. If not these will be resolved
prior to release.
3. ATM changes are untested.
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS (partly)
function couldn't handle chains of > MCLBYTES, and it had a bug which
caused corruption and panics in certain low mbuf situations.
Additionally, change the failure case so that looutput returns ENOBUFS
rather than attempting to pass on non-defragmented mbuf chains.
Finally, remove the printf which would happen every time the low memory
situation occured. It served no useful purpose other than to clue me
in as to what was causing the panic in question. :)
MFC after: 4 days
returning some additional room in the first mbuf in a chain, and
avoiding feature-specific contents in the mbuf header. To do this:
- Modify mbuf_to_label() to extract the tag, returning NULL if not
found.
- Introduce mac_init_mbuf_tag() which does most of the work
mac_init_mbuf() used to do, except on an m_tag rather than an
mbuf.
- Scale back mac_init_mbuf() to perform m_tag allocation and invoke
mac_init_mbuf_tag().
- Replace mac_destroy_mbuf() with mac_destroy_mbuf_tag(), since
m_tag's are now GC'd deep in the m_tag/mbuf code rather than
at a higher level when mbufs are directly free()'d.
- Add mac_copy_mbuf_tag() to support m_copy_pkthdr() and related
notions.
- Generally change all references to mbuf labels so that they use
mbuf_to_label() rather than &mbuf->m_pkthdr.label. This
required no changes in the MAC policies (yay!).
- Tweak mbuf release routines to not call mac_destroy_mbuf(),
tag destruction takes care of it for us now.
- Remove MAC magic from m_copy_pkthdr() and m_move_pkthdr() --
the existing m_tag support does all this for us. Note that
we can no longer just zero the m_tag list on the target mbuf,
rather, we have to delete the chain because m_tag's will
already be hung off freshly allocated mbuf's.
- Tweak m_tag copying routines so that if we're copying a MAC
m_tag, we don't do a binary copy, rather, we initialize the
new storage and do a deep copy of the label.
- Remove use of MAC_FLAG_INITIALIZED in a few bizarre places
having to do with mbuf header copies previously.
- When an mbuf is copied in ip_input(), we no longer need to
explicitly copy the label because it will get handled by the
m_tag code now.
- No longer any weird handling of MAC labels in if_loop.c during
header copies.
- Add MPC_LOADTIME_FLAG_LABELMBUFS flag to Biba, MLS, mac_test.
In mac_test, handle the label==NULL case, since it can be
dynamically loaded.
In order to improve performance with this change, introduce the notion
of "lazy MAC label allocation" -- only allocate m_tag storage for MAC
labels if we're running with a policy that uses MAC labels on mbufs.
Policies declare this intent by setting the MPC_LOADTIME_FLAG_LABELMBUFS
flag in their load-time flags field during declaration. Note: this
opens up the possibility of post-boot policy modules getting back NULL
slot entries even though they have policy invariants of non-NULL slot
entries, as the policy might have been loaded after the mbuf was
allocated, leaving the mbuf without label storage. Policies that cannot
handle this case must be declared as NOTLATE, or must be modified.
- mac_labelmbufs holds the current cumulative status as to whether
any policies require mbuf labeling or not. This is updated whenever
the active policy set changes by the function mac_policy_updateflags().
The function iterates the list and checks whether any have the
flag set. Write access to this variable is protected by the policy
list; read access is currently not protected for performance reasons.
This might change if it causes problems.
- Add MAC_POLICY_LIST_ASSERT_EXCLUSIVE() to permit the flags update
function to assert appropriate locks.
- This makes allocation in mac_init_mbuf() conditional on the flag.
Reviewed by: sam
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
of asserting that an mbuf has a packet header. Use it instead of hand-
rolled versions wherever applicable.
Submitted by: Hiten Pandya <hiten@unixdaemons.com>
drain routines are done by swi_net, which allows for better queue control
at some future point. Packets may also be directly dispatched to a netisr
instead of queued, this may be of interest at some installations, but
currently defaults to off.
Reviewed by: hsu, silby, jayanth, sam
Sponsored by: DARPA, NAI Labs