reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.
NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.
Convert 'struct ipstat' and 'struct tcpstat' to counter(9).
This speeds up IP forwarding at extreme packet rates, and
makes accounting more precise.
Sponsored by: Nginx, Inc.
- fix indentation
- put the operator at the end of the line for long statements
- remove spaces between the type and the variable in a cast
- remove excessive parentheses
Tested by: md5
Instead, add protocol specific mbuf flags M_IP_NEXTHOP and
M_IP6_NEXTHOP. Use them to indicate that the mbuf's chain
contains the PACKET_TAG_IPFORWARD tag. And do a tag lookup
only when this flag is set.
Suggested by: andre
on the related functionality in the runtime via the sysctl variable
net.pfil.forward. It is turned off by default.
Sponsored by: Yandex LLC
Discussed with: net@
MFC after: 2 weeks
before passing a packet to protocol input routines.
For several protocols this mean that now protocol needs to
do subtraction itself, and for another half this means that
we do not need to add header length back to the packet.
Make ip_stripoptions() to adjust ip_len, since now we enter
this function with a packet header whose ip_len does represent
length of entire packet, not payload only.
in network byte order. Any host byte order processing is
done in local variables and host byte order values are
never[1] written to a packet.
After this change a packet processed by the stack isn't
modified at all[2] except for TTL.
After this change a network stack hacker doesn't need to
scratch his head trying to figure out what is the byte order
at the given place in the stack.
[1] One exception still remains. The raw sockets convert host
byte order before pass a packet to an application. Probably
this would remain for ages for compatibility.
[2] The ip_input() still subtructs header len from ip->ip_len,
but this is planned to be fixed soon.
Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru>
Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me>
- All packets in NETISR_IP queue are in net byte order.
- ip_input() is entered in net byte order and converts packet
to host byte order right _after_ processing pfil(9) hooks.
- ip_output() is entered in host byte order and converts packet
to net byte order right _before_ processing pfil(9) hooks.
- ip_fragment() accepts and emits packet in net byte order.
- ip_forward(), ip_mloopback() use host byte order (untouched actually).
- ip_fastforward() no longer modifies packet at all (except ip_ttl).
- Swapping of byte order there and back removed from the following modules:
pf(4), ipfw(4), enc(4), if_bridge(4).
- Swapping of byte order added to ipfilter(4), based on __FreeBSD_version
- __FreeBSD_version bumped.
- pfil(9) manual page updated.
Reviewed by: ray, luigi, eri, melifaro
Tested by: glebius (LE), ray (BE)
it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning
here: it may be supplied to provide route, and it may be supplied to
store and return to caller the route that ip_output()/ip6_output()
finds. In the latter case skipping FLOWTABLE lookup is pessimisation.
The difference between struct route filled by FLOWTABLE and filled
by rtalloc() family is that the former doesn't hold a reference on
its rtentry. Reference is hold by flow entry, and it is about to
be released in future. Thus, route filled by FLOWTABLE shouldn't
be passed to RTFREE() macro.
- Introduce new flag for struct route/route_in6, that marks route
not holding a reference on rtentry.
- Introduce new macro RO_RTFREE() that cleans up a struct route
depending on its kind.
- All callers to ip_output()/ip6_output() that do supply non-NULL
but empty route should use RO_RTFREE() to free results of
lookup.
- ip_output()/ip6_output() now do FLOWTABLE lookup always when
ro->ro_rt == NULL.
Tested by: tuexen (SCTP part)
packets a cmsg of type IP_RECVTOS which contains the TOS byte.
Much like IP_RECVTTL does for TTL. This allows to implement a
protocol on top of UDP and implementing ECN.
MFC after: 3 days
- Remove ia_net, ia_netmask, ia_netbroadcast from struct in_ifaddr.
- Remove net.inet.ip.subnetsarelocal, I bet no one need it in 2011.
- fix bug when we were not forwarding to a host which matches classful
net address. For example router having 192.168.x.y/16 network attached,
would not forward traffic to 192.168.*.0, which are legal IPs in
CIDR world.
- For compatibility, leave autoguessing of mask based on class.
Reviewed by: andre, bz, rwatson
Move ip_defttl to raw_ip.c where it is actually used. In an IPv6
only world we do not want to compile ip_input.c in for that and
it is a shared default with INET6.
Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days
Move the ipport_tick_callout and related functions from ip_input.c
to in_pcb.c. The random source port allocation code has been merged
and is now local to in_pcb.c only.
Use a SYSINIT to get the callout started and no longer depend on
initialization from the inet code, which would not work in an IPv6
only setup.
Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days
Move fw_one_pass to where it belongs: it is a property of ipfw,
not of ip_input.
Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 3 days
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files. A better long-term solution is
still being considered. This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.
Changes reverted:
------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines
Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.
------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines
Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.
------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines
Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
In protosw we define pr_protocol as short, while on the wire
it is an uint8_t. That way we can have "internal" protocols
like DIVERT, SEND or gaps for modules (PROTO_SPACER).
Switch ipproto_{un,}register to accept a short protocol number(*)
and do an upfront check for valid boundries. With this we
also consistently report EPROTONOSUPPORT for out of bounds
protocols, as we did for proto == 0. This allows a caller
to not error for this case, which is especially important
if we want to automatically call these from domain handling.
(*) the functions have been without any in-tree consumer
since the initial introducation, so this is considered save.
Implement ip6proto_{un,}register() similarly to their legacy IP
counter parts to allow modules to hook up dynamically.
Reviewed by: philip, will
MFC after: 1 week
bridge(4), lagg(4) etc. and make use of function pointers and
pf_proto_register() to hook carp into the network stack.
Currently, because of the uncertainty about whether the unload path is free
of race condition panics, unloads are disallowed by default. Compiling with
CARPMOD_CAN_UNLOAD in CFLAGS removes this anti foot shooting measure.
This commit requires IP6PROTOSPACER, introduced in r211115.
Reviewed by: bz, simon
Approved by: ken (mentor)
MFC after: 2 weeks
"Whitspace" churn after the VIMAGE/VNET whirls.
Remove the need for some "init" functions within the network
stack, like pim6_init(), icmp_init() or significantly shorten
others like ip6_init() and nd6_init(), using static initialization
again where possible and formerly missed.
Move (most) variables back to the place they used to be before the
container structs and VIMAGE_GLOABLS (before r185088) and try to
reduce the diff to stable/7 and earlier as good as possible,
to help out-of-tree consumers to update from 6.x or 7.x to 8 or 9.
This also removes some header file pollution for putatively
static global variables.
Revert VIMAGE specific changes in ipfilter::ip_auth.c, that are
no longer needed.
Reviewed by: jhb
Discussed with: rwatson
Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
MFC after: 6 days
- increase flow cleaning frequency and decrease flow caching time
when near the flow limit
- stop allocating new flows when within 3% of maxflows don't start
allocating again until below 12.5%
MFC after: 7 days
- add a name argument to flowtable_alloc for printing with ddb commands
- extend ddb commands to print destination address or 4-tuples
- don't parse ports in ulp header if FL_HASH_ALL is not passed
- add kern_flowtable_insert to enable more generic use of flowtable
(e.g. system calls for adding entries)
- don't hash loopback addresses
- cleanup whitespace
- keep statistics per-cpu for per-cpu flowtables to avoid cache line contention
- add sysctls to accumulate stats and report aggregate
MFC after: 7 days
a "locked" version that will only handle a single network stack
instance. The latter is called directly from ip_destroy().
Hook up an ip_destroy() function to release resources from the
legacy IP network layer upon virtual network stack teardown.
Sponsored by: ISPsystem
Reviewed by: rwatson
MFC After: 5 days
packet filters. ALso allows ipfw to be enabled on on ejail and disabled
on another. In 8.0 it's a global setting.
Sitting aroung in tree waiting to commit for: 2 months
MFC after: 2 months
all pertinent statatistics for the subsystem. These structures are
sometimes "borrowed" by kernel modules that require a place to store
statistics for similar events.
Add KPI accessor functions for statistics structures referenced by kernel
modules so that they no longer encode certain specifics of how the data
structures are named and stored. This change is intended to make it
easier to move to per-CPU network stats following 8.0-RELEASE.
The following modules are affected by this change:
if_bridge
if_cxgb
if_gif
ip_mroute
ipdivert
pf
In practice, most of these statistics consumers should, in fact, maintain
their own statistics data structures rather than borrowing structures
from the base network stack. However, that change is too agressive for
this point in the release cycle.
Reviewed by: bz
Approved by: re (kib)
vnet.h, we now use jails (rather than vimages) as the abstraction
for virtualization management, and what remained was specific to
virtual network stacks. Minor cleanups are done in the process,
and comments updated to reflect these changes.
Reviewed by: bz
Approved by: re (vimage blanket)
back to the bottom of ip_init() as found in 7.x. I missed the fact that
the bottom half of the init routine only runs in the !VNET case.
Submitted by: zec
Approved by: re (vimage blanket)
nor destructors, as there's no actual work to do.
In most cases, the constructors weren't needed because of the existing
protocol initialization functions run by net_init_domain() as part of
VNET_MOD_NET, or they were eliminated when support for static
initialization of virtualized globals was added.
Garbage collect dependency references to modules without constructors or
destructors, notably VNET_MOD_INET and VNET_MOD_INET6.
Reviewed by: bz
Approved by: re (vimage blanket)
unused custom mutex/condvar-based sleep locks with two locks: an
rwlock (for non-sleeping use) and sxlock (for sleeping use). Either
acquired for read is sufficient to stabilize the vnet list, but both
must be acquired for write to modify the list.
Replace previous no-op read locking macros, used in various places
in the stack, with actual locking to prevent race conditions. Callers
must declare when they may perform unbounded sleeps or not when
selecting how to lock.
Refactor vnet sysinits so that the vnet list and locks are initialized
before kernel modules are linked, as the kernel linker will use them
for modules loaded by the boot loader.
Update various consumers of these KPIs based on whether they may sleep
or not.
Reviewed by: bz
Approved by: re (kib)
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.
Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.
Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.
This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.
Bump __FreeBSD_version and update UPDATING.
Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
in_ifaddrhead and INADDR_HASH address lists.
Previously, these lists were used unsynchronized as they were effectively
never changed in steady state, but we've seen increasing reports of
writer-writer races on very busy VPN servers as core count has gone up
(and similar configurations where address lists change frequently and
concurrently).
For the time being, use rwlocks rather than rmlocks in order to take
advantage of their better lock debugging support. As a result, we don't
enable ip_input()'s read-locking of INADDR_HASH until an rmlock conversion
is complete and a performance analysis has been done. This means that one
class of reader-writer races still exists.
MFC after: 6 weeks
Reviewed by: bz
'ifa' was used as the TAILQ_FOREACH() iterator argument, and 'ia' was just
derived form it, it could be left non-NULL which confused later
conditional freeing code. This could cause kernel panics if multicast IP
packets were received. [1]
Call 'struct in_ifaddr *' in ip_rtaddr() 'ia', not 'ifa' in keeping with
normal conventions.
When 'ipstealth' is enabled returns from ip_input early, properly release
the 'ia' reference.
Reported by: lstewart, sam [1]
MFC after: 6 weeks
rather than pointers, requiring callers to properly dispose of those
references. The following routines now return references:
ifaddr_byindex
ifa_ifwithaddr
ifa_ifwithbroadaddr
ifa_ifwithdstaddr
ifa_ifwithnet
ifaof_ifpforaddr
ifa_ifwithroute
ifa_ifwithroute_fib
rt_getifa
rt_getifa_fib
IFP_TO_IA
ip_rtaddr
in6_ifawithifp
in6ifa_ifpforlinklocal
in6ifa_ifpwithaddr
in6_ifadd
carp_iamatch6
ip6_getdstifaddr
Remove unused macro which didn't have required referencing:
IFP_TO_IA6
This closes many small races in which changes to interface
or address lists while an ifaddr was in use could lead to use of freed
memory (etc). In a few cases, add missing if_addr_list locking
required to safely acquire references.
Because of a lack of deep copying support, we accept a race in which
an in6_ifaddr pointed to by mbuf tags and extracted with
ip6_getdstifaddr() doesn't hold a reference while in transmit. Once
we have mbuf tag deep copy support, this can be fixed.
Reviewed by: bz
Obtained from: Apple, Inc. (portions)
MFC after: 6 weeks (portions)
This change should make options VIMAGE kernel builds usable again,
to some extent at least.
Note that the size of struct vnet_inet has changed, though in
accordance with one-bump-per-day policy we didn't update the
__FreeBSD_version number, given that it has already been touched
by r194640 a few hours ago.
Reviewed by: bz
Approved by: julian (mentor)
actual implementation.
Remove the accessor functions for the compiled out case, just returning
"unavail" values. Remove the kernel conditional from the header file as
it is no longer needed, only leaving the externs.
Hide the improperly virtualized SYSCTL/TUNABLE for the flowtable size
under the kernel option as well.
Reviewed by: rwatson
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.
Discussed with: pjd
+ move ipfw and dummynet hooks declarations to raw_ip.c (definitions
in ip_var.h) same as for most other global variables.
This removes some dependencies from ip_input.c;
+ remove the IPFW_LOADED macro, just test ip_fw_chk_ptr directly;
+ remove the DUMMYNET_LOADED macro, just test ip_dn_io_ptr directly;
+ move ip_dn_ruledel_ptr to ip_fw2.c which is the only file using it;
To be merged together with rev 193497
MFC after: 5 days