- ip_output() and ip_output6() simply call flowtable_lookup(),
passing mbuf and address family. That's the only code under
#ifdef FLOWTABLE in the protocols code now.
o Revamp statistics gathering and export.
- Remove hand made pcpu stats, and utilize counter(9).
- Snapshot of statistics is available via 'netstat -rs'.
- All sysctls are moved into net.flowtable namespace, since
spreading them over net.inet isn't correct.
o Properly separate at compile time INET and INET6 parts.
o General cleanup.
- Remove chain of multiple flowtables. We simply have one for
IPv4 and one for IPv6.
- Flowtables are allocated in flowtable.c, symbols are static.
- With proper argument to SYSINIT() we no longer need flowtable_ready.
- Hash salt doesn't need to be per-VNET.
- Removed rudimentary debugging, which use quite useless in dtrace era.
The runtime behavior of flowtable shouldn't be changed by this commit.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].
[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].
Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip
four counters to struct ifaddr. This kills '+=' on a variables shared
between processors for every packet.
- Nuke struct if_data from struct ifaddr.
- In ip_input() do not put a reference on ifaddr, instead update statistics
right now in place and do IN_IFADDR_RUNLOCK(). These removes atomic(9)
for every packet. [1]
- To properly support NET_RT_IFLISTL sysctl used by getifaddrs(3), in
rtsock.c fill if_data fields using counter_u64_fetch().
- Accidentially fix bug in COMPAT_32 version of NET_RT_IFLISTL, which
took if_data not from the ifaddr, but from ifaddr's ifnet. [2]
Submitted by: melifaro [1], pluknet[2]
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.
Tested by: gnn, hiren
MFC after: 1 month
features. The changes in particular are:
o Remove rarely used "header" pointer and replace it with a 64bit protocol/
layer specific union PH_loc for local use. Protocols can flexibly overlay
their own 8 to 64 bit fields to store information while the packet is
worked on.
o Mechanically convert IP reassembly, IGMP/MLD and ATM to use pkthdr.PH_loc
instead of pkthdr.header.
o Extend csum_flags to 64bits to allow for additional future offload
information to be carried (e.g. iSCSI, IPsec offload, and others).
o Move the RSS hash type enumerator from abusing m_flags to its own 8bit
rsstype field. Adjust accessor macros.
o Add cosqos field to store Class of Service / Quality of Service information
with the packet. It is not yet supported in any drivers but allows us to
get on par with Cisco/Juniper in routing applications (plus MPLS QoS) with
a modernized ALTQ.
o Add four 8 bit fields l[2-5]hlen to store the relative header offsets
from the start of the packet. This is important for various offload
capabilities and to relieve the drivers from having to parse the packet
and protocol headers to find out location of checksums and other
information. Header parsing in drivers is a lot of copy-paste and
unhandled corner cases which we want to avoid.
o Add another flexible 64bit union to map various additional persistent
packet information, like ether_vtag, tso_segsz and csum fields.
Depending on the csum_flags settings some fields may have different usage
making it very flexible and adaptable to future capabilities.
o Restructure the CSUM flags to better signify their outbound (down the
stack) and inbound (up the stack) use. The CSUM flags used to be a bit
chaotic and rather poorly documented leading to incorrect use in many
places. Bring clarity into their use through better naming.
Compatibility mappings are provided to preserve the API. The drivers
can be corrected one by one and MFC'd without issue.
o The size of pkthdr stays the same at 48/56bytes (32/64bit architectures).
Sponsored by: The FreeBSD Foundation
increased the pointer, not the memory it points to.
In collaboration with: kib
Reported & tested by: Ian FREISLICH <ianf clue.co.za>
Sponsored by: Nginx, Inc.
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.
NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.
Convert 'struct ipstat' and 'struct tcpstat' to counter(9).
This speeds up IP forwarding at extreme packet rates, and
makes accounting more precise.
Sponsored by: Nginx, Inc.
- fix indentation
- put the operator at the end of the line for long statements
- remove spaces between the type and the variable in a cast
- remove excessive parentheses
Tested by: md5
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.
Suggested by: andre
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.
Sponsored by: Yandex LLC
Discussed with: net@
MFC after: 2 weeks
before passing a packet to protocol input routines.
For several protocols this mean that now protocol needs to
do subtraction itself, and for another half this means that
we do not need to add header length back to the packet.
Make ip_stripoptions() to adjust ip_len, since now we enter
this function with a packet header whose ip_len does represent
length of entire packet, not payload only.
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.
After this change a packet processed by the stack isn't
modified at all[2] except for TTL.
After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.
[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.
[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.
Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>
- All packets in NETISR_IP queue are in net byte order.
- ip_input() is entered in net byte order and converts packet
to host byte order right _after_ processing pfil(9) hooks.
- ip_output() is entered in host byte order and converts packet
to net byte order right _before_ processing pfil(9) hooks.
- ip_fragment() accepts and emits packet in net byte order.
- ip_forward(), ip_mloopback() use host byte order (untouched actually).
- ip_fastforward() no longer modifies packet at all (except ip_ttl).
- Swapping of byte order there and back removed from the following modules:
pf(4), ipfw(4), enc(4), if_bridge(4).
- Swapping of byte order added to ipfilter(4), based on __FreeBSD_version
- __FreeBSD_version bumped.
- pfil(9) manual page updated.
Reviewed by: ray, luigi, eri, melifaro
Tested by: glebius (LE), ray (BE)
it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning
here: it may be supplied to provide route, and it may be supplied to
store and return to caller the route that ip_output()/ip6_output()
finds. In the latter case skipping FLOWTABLE lookup is pessimisation.
The difference between struct route filled by FLOWTABLE and filled
by rtalloc() family is that the former doesn't hold a reference on
its rtentry. Reference is hold by flow entry, and it is about to
be released in future. Thus, route filled by FLOWTABLE shouldn't
be passed to RTFREE() macro.
- Introduce new flag for struct route/route_in6, that marks route
not holding a reference on rtentry.
- Introduce new macro RO_RTFREE() that cleans up a struct route
depending on its kind.
- All callers to ip_output()/ip6_output() that do supply non-NULL
but empty route should use RO_RTFREE() to free results of
lookup.
- ip_output()/ip6_output() now do FLOWTABLE lookup always when
ro->ro_rt == NULL.
Tested by: tuexen (SCTP part)
packets a cmsg of type IP_RECVTOS which contains the TOS byte.
Much like IP_RECVTTL does for TTL. This allows to implement a
protocol on top of UDP and implementing ECN.
MFC after: 3 days
- Remove ia_net, ia_netmask, ia_netbroadcast from struct in_ifaddr.
- Remove net.inet.ip.subnetsarelocal, I bet no one need it in 2011.
- fix bug when we were not forwarding to a host which matches classful
net address. For example router having 192.168.x.y/16 network attached,
would not forward traffic to 192.168.*.0, which are legal IPs in
CIDR world.
- For compatibility, leave autoguessing of mask based on class.
Reviewed by: andre, bz, rwatson
Move ip_defttl to raw_ip.c where it is actually used. In an IPv6
only world we do not want to compile ip_input.c in for that and
it is a shared default with INET6.
Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days
Move the ipport_tick_callout and related functions from ip_input.c
to in_pcb.c. The random source port allocation code has been merged
and is now local to in_pcb.c only.
Use a SYSINIT to get the callout started and no longer depend on
initialization from the inet code, which would not work in an IPv6
only setup.
Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days
Move fw_one_pass to where it belongs: it is a property of ipfw,
not of ip_input.
Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 3 days
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files. A better long-term solution is
still being considered. This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.
Changes reverted:
------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines
Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.
------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines
Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.
------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines
Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
In protosw we define pr_protocol as short, while on the wire
it is an uint8_t. That way we can have "internal" protocols
like DIVERT, SEND or gaps for modules (PROTO_SPACER).
Switch ipproto_{un,}register to accept a short protocol number(*)
and do an upfront check for valid boundries. With this we
also consistently report EPROTONOSUPPORT for out of bounds
protocols, as we did for proto == 0. This allows a caller
to not error for this case, which is especially important
if we want to automatically call these from domain handling.
(*) the functions have been without any in-tree consumer
since the initial introducation, so this is considered save.
Implement ip6proto_{un,}register() similarly to their legacy IP
counter parts to allow modules to hook up dynamically.
Reviewed by: philip, will
MFC after: 1 week
bridge(4), lagg(4) etc. and make use of function pointers and
pf_proto_register() to hook carp into the network stack.
Currently, because of the uncertainty about whether the unload path is free
of race condition panics, unloads are disallowed by default. Compiling with
CARPMOD_CAN_UNLOAD in CFLAGS removes this anti foot shooting measure.
This commit requires IP6PROTOSPACER, introduced in r211115.
Reviewed by: bz, simon
Approved by: ken (mentor)
MFC after: 2 weeks
"Whitspace" churn after the VIMAGE/VNET whirls.
Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.
Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.
This also removes some header file pollution for putatively
static global variables.
Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.
Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
MFC after: 6 days
- increase flow cleaning frequency and decrease flow caching time
when near the flow limit
- stop allocating new flows when within 3% of maxflows don't start
allocating again until below 12.5%
MFC after: 7 days
- add a name argument to flowtable_alloc for printing with ddb commands
- extend ddb commands to print destination address or 4-tuples
- don't parse ports in ulp header if FL_HASH_ALL is not passed
- add kern_flowtable_insert to enable more generic use of flowtable
(e.g. system calls for adding entries)
- don't hash loopback addresses
- cleanup whitespace
- keep statistics per-cpu for per-cpu flowtables to avoid cache line contention
- add sysctls to accumulate stats and report aggregate
MFC after: 7 days
a "locked" version that will only handle a single network stack
instance. The latter is called directly from ip_destroy().
Hook up an ip_destroy() function to release resources from the
legacy IP network layer upon virtual network stack teardown.
Sponsored by: ISPsystem
Reviewed by: rwatson
MFC After: 5 days
packet filters. ALso allows ipfw to be enabled on on ejail and disabled
on another. In 8.0 it's a global setting.
Sitting aroung in tree waiting to commit for: 2 months
MFC after: 2 months
all pertinent statatistics for the subsystem. These structures are
sometimes "borrowed" by kernel modules that require a place to store
statistics for similar events.
Add KPI accessor functions for statistics structures referenced by kernel
modules so that they no longer encode certain specifics of how the data
structures are named and stored. This change is intended to make it
easier to move to per-CPU network stats following 8.0-RELEASE.
The following modules are affected by this change:
if_bridge
if_cxgb
if_gif
ip_mroute
ipdivert
pf
In practice, most of these statistics consumers should, in fact, maintain
their own statistics data structures rather than borrowing structures
from the base network stack. However, that change is too agressive for
this point in the release cycle.
Reviewed by: bz
Approved by: re (kib)
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.
Reviewed by: bz
Approved by: re (vimage blanket)
back to the bottom of ip_init() as found in 7.x. I missed the fact that
the bottom half of the init routine only runs in the !VNET case.
Submitted by: zec
Approved by: re (vimage blanket)
nor destructors, as there's no actual work to do.
In most cases, the constructors weren't needed because of the existing
protocol initialization functions run by net_init_domain() as part of
VNET_MOD_NET, or they were eliminated when support for static
initialization of virtualized globals was added.
Garbage collect dependency references to modules without constructors or
destructors, notably VNET_MOD_INET and VNET_MOD_INET6.
Reviewed by: bz
Approved by: re (vimage blanket)
unused custom mutex/condvar-based sleep locks with two locks: an
rwlock (for non-sleeping use) and sxlock (for sleeping use). Either
acquired for read is sufficient to stabilize the vnet list, but both
must be acquired for write to modify the list.
Replace previous no-op read locking macros, used in various places
in the stack, with actual locking to prevent race conditions. Callers
must declare when they may perform unbounded sleeps or not when
selecting how to lock.
Refactor vnet sysinits so that the vnet list and locks are initialized
before kernel modules are linked, as the kernel linker will use them
for modules loaded by the boot loader.
Update various consumers of these KPIs based on whether they may sleep
or not.
Reviewed by: bz
Approved by: re (kib)