mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.
There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.
PR: kern/210106
Reviewed by: jhb
Approved by: re (glebius)
but removed due to other changes in the system. Restore the llentry pointer
to the "struct route", and use it to cache the L2 lookup (ARP or ND6) as
appropriate.
Submitted by: Mike Karels
Differential Revision: https://reviews.freebsd.org/D6262
Two new functions are provided, bit_ffs_at() and bit_ffc_at(), which allow
for efficient searching of set or cleared bits starting from any bit offset
within the bit string.
Performance is improved by operating on longs instead of bytes and using
ffsl() for searches within a long. ffsl() is a compiler builtin in both
clang and gcc for most architectures, converting what was a brute force
while loop search into a couple of instructions.
All of the bitstring(3) API continues to be contained in the header file.
Some of the functions are large enough that perhaps they should be uninlined
and moved to a library, but that is beyond the scope of this commit.
sys/sys/bitstring.h:
Convert the majority of the existing bit string implementation from
macros to inline functions.
Properly protect the implementation from inadvertant macro expansion
when included in a user's program by prefixing all private
macros/functions and local variables with '_'.
Add bit_ffs_at() and bit_ffc_at(). Implement bit_ffs() and
bit_ffc() in terms of their "at" counterparts.
Provide a kernel implementation of bit_alloc(), making the full API
usable in the kernel.
Improve code documenation.
share/man/man3/bitstring.3:
Add pre-exisiting API bit_ffc() to the synopsis.
Document new APIs.
Document the initialization state of the bit strings
allocated/declared by bit_alloc() and bit_decl().
Correct documentation for bitstr_size(). The original code comments
indicate the size is in bytes, not "elements of bitstr_t". The new
implementation follows this lead. Only hastd assumed "elements"
rather than bytes and it has been corrected.
etc/mtree/BSD.tests.dist:
tests/sys/Makefile:
tests/sys/sys/Makefile:
tests/sys/sys/bitstring.c:
Add tests for all existing and new functionality.
include/bitstring.h
Include all headers needed by sys/bitstring.h
lib/libbluetooth/bluetooth.h:
usr.sbin/bluetooth/hccontrol/le.c:
Include bitstring.h instead of sys/bitstring.h.
sbin/hastd/activemap.c:
Correct usage of bitstr_size().
sys/dev/xen/blkback/blkback.c
Use new bit_alloc.
sys/kern/subr_unit.c:
Remove hard-coded assumption that sizeof(bitstr_t) is 1. Get rid of
unrb.busy, which caches the number of bits set in unrb.map. When
INVARIANTS are disabled, nothing needs to know that information.
callapse_unr can be adapted to use bit_ffs and bit_ffc instead.
Eliminating unrb.busy saves memory, simplifies the code, and
provides a slight speedup when INVARIANTS are disabled.
sys/net/flowtable.c:
Use the new kernel implementation of bit-alloc, instead of hacking
the old libc-dependent macro.
sys/sys/param.h
Update __FreeBSD_version to indicate availability of new API
Submitted by: gibbs, asomers
Reviewed by: gibbs, ngie
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D6004
Add if_requestencap() interface method which is capable of calculating
various link headers for given interface. Right now there is support
for INET/INET6/ARP llheader calculation (IFENCAP_LL type request).
Other types are planned to support more complex calculation
(L2 multipath lagg nexthops, tunnel encap nexthops, etc..).
Reshape 'struct route' to be able to pass additional data (with is length)
to prepend to mbuf.
These two changes permits routing code to pass pre-calculated nexthop data
(like L2 header for route w/gateway) down to the stack eliminating the
need for other lookups. It also brings us closer to more complex scenarios
like transparently handling MPLS nexthops and tunnel interfaces.
Last, but not least, it removes layering violation introduced by flowtable
code (ro_lle) and simplifies handling of existing if_output consumers.
ARP/ND changes:
Make arp/ndp stack pre-calculate link header upon installing/updating lle
record. Interface link address change are handled by re-calculating
headers for all lles based on if_lladdr event. After these changes,
arpresolve()/nd6_resolve() returns full pre-calculated header for
supported interfaces thus simplifying if_output().
Move these lookups to separate ether_resolve_addr() function which ether
returs error or fully-prepared link header. Add <arp|nd6_>resolve_addr()
compat versions to return link addresses instead of pre-calculated data.
BPF changes:
Raw bpf writes occupied _two_ cases: AF_UNSPEC and pseudo_AF_HDRCMPLT.
Despite the naming, both of there have ther header "complete". The only
difference is that interface source mac has to be filled by OS for
AF_UNSPEC (controlled via BIOCGHDRCMPLT). This logic has to stay inside
BPF and not pollute if_output() routines. Convert BPF to pass prepend data
via new 'struct route' mechanism. Note that it does not change
non-optimized if_output(): ro_prepend handling is purely optional.
Side note: hackish pseudo_AF_HDRCMPLT is supported for ethernet and FDDI.
It is not needed for ethernet anymore. The only remaining FDDI user is
dev/pdq mostly untouched since 2007. FDDI support was eliminated from
OpenBSD in 2013 (sys/net/if_fddisubr.c rev 1.65).
Flowtable changes:
Flowtable violates layering by saving (and not correctly managing)
rtes/lles. Instead of passing lle pointer, pass pointer to pre-calculated
header data from that lle.
Differential Revision: https://reviews.freebsd.org/D4102
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.
This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.
"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.
Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.
MFC after: 1 month
Sponsored by: Mellanox Technologies
tested and is unfinished. However, I've tested my version,
it works okay. As before it is unfinished: timeout aren't
driven by TCP session state. To enable the HASH_ALL mode,
one needs in kernel config:
options FLOWTABLE_HASH_ALL
o Reduce the alignment on flentry to 64 bytes. Without
the FLOWTABLE_HASH_ALL option, twice less memory would
be consumed by flows.
o API to ip_output()/ip6_output() got even more thin: 1 liner.
o Remove unused unions. Simply use fle->f_key[].
o Merge all IPv4 code into flowtable_lookup_ipv4(), and do same
flowtable_lookup_ipv6(). Stop copying data to on stack
sockaddr structures, simply use key[] on stack.
o Move code from flowtable_lookup_common() that actually works
on insertion into flowtable_insert().
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
insert flow entry. During the route lookup the critical section is
exited. It may happen, that after route lookup we will be executed
on an other CPU that already has such flowentry. Before this change
we simply freed the flowentry and returned to ip_output() with
failure.
Actually there is nothing wrong with using previously allocated
flow entry, updating it properly. Thus, make flowentry_insert()
return the new either old fle, and make use of it.
Count reuses as "collisions" and real inserts as "inserts".
Reviewed by: adrian
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
Some of the collisions that are occuring are due to flowtable lookups
that succeed but have an invalid lle - typically because the L2 adjacency
lookup hasn't completed. This would lead to a follow-up insert which
would then fail (ie, collision) and the code would fall through to doing
a slow-path L2/L3 lookup in the netinet/netinet6 code.
This patch simply aborts storing a new flowtable entry if the lle isn't
yet valid.
Whilst I'm here, add a new pcpu counter for the item so the number of
failures can be tracked separately from generic "collisions."
Reviewed by: glebius
MFC after: 10 days
Sponsored by: Netflix, Inc.
and probably is a leftover from first prototyping by Kip. The
non-pcpu implementation used mutexes, so it doubtfully worked
better than simple routing lookup.
o Use UMA_ZONE_PCPU zone for pointers instead of [MAXCPU] arrays,
use zpcpu_get() to access data in there.
o Substitute own single list implementation with SLIST(). This
has two functional side effects:
- new flows go into head of a list, before they went to tail.
- a bug when incorrect flow was deleted in flow cleaner is
fixed.
o Due to cache line alignment, there is no reason to keep
different zones for IPv4 and IPv6 flows. Both consume one
cache line, real size of allocation is equal.
o Rely on that f_hash, f_rt, f_lle are stable during fle
lifetime, remove useless volatile quilifiers.
o More INET/INET6 splitting.
Reviewed by: adrian
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
- ip_output() and ip_output6() simply call flowtable_lookup(),
passing mbuf and address family. That's the only code under
#ifdef FLOWTABLE in the protocols code now.
o Revamp statistics gathering and export.
- Remove hand made pcpu stats, and utilize counter(9).
- Snapshot of statistics is available via 'netstat -rs'.
- All sysctls are moved into net.flowtable namespace, since
spreading them over net.inet isn't correct.
o Properly separate at compile time INET and INET6 parts.
o General cleanup.
- Remove chain of multiple flowtables. We simply have one for
IPv4 and one for IPv6.
- Flowtables are allocated in flowtable.c, symbols are static.
- With proper argument to SYSINIT() we no longer need flowtable_ready.
- Hash salt doesn't need to be per-VNET.
- Removed rudimentary debugging, which use quite useless in dtrace era.
The runtime behavior of flowtable shouldn't be changed by this commit.
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
- Provide missing function that can do hashing of arbitrary sized buffer.
- Refetch lookup3.c and do only minimal edits to it, so that diff between
our jenkins_hash.c and lookup3.c is minimal.
- Add declarations for jenkins_hash(), jenkins_hash32() to sys/hash.h.
- Document these functions in hash(9)
Obtained from: http://burtleburtle.net/bob/c/lookup3.c
it skips FLOWTABLE lookup. However, the non-NULL ro has dual meaning
here: it may be supplied to provide route, and it may be supplied to
store and return to caller the route that ip_output()/ip6_output()
finds. In the latter case skipping FLOWTABLE lookup is pessimisation.
The difference between struct route filled by FLOWTABLE and filled
by rtalloc() family is that the former doesn't hold a reference on
its rtentry. Reference is hold by flow entry, and it is about to
be released in future. Thus, route filled by FLOWTABLE shouldn't
be passed to RTFREE() macro.
- Introduce new flag for struct route/route_in6, that marks route
not holding a reference on rtentry.
- Introduce new macro RO_RTFREE() that cleans up a struct route
depending on its kind.
- All callers to ip_output()/ip6_output() that do supply non-NULL
but empty route should use RO_RTFREE() to free results of
lookup.
- ip_output()/ip6_output() now do FLOWTABLE lookup always when
ro->ro_rt == NULL.
Tested by: tuexen (SCTP part)
While doing so, for consistency with the rtalloc_ign_fib(9) interface
called, remove the "in_" prefix from rtalloc_ign_wrapper() no longer
indicating that it would only handle the INET case.
Sponsored by: Cisco Systems, Inc.
referenced within its timeout window. This change clears the LLE_VALID flag when an llentry
is removed from an interface's hash table and adds an extra check to the flowtable code
for the LLE_VALID flag in llentry to avoid retaining and using a stale reference.
Reviewed by: qingli@
MFC after: 2 weeks
The SYSCTL_NODE macro defines a list that stores all child-elements of
that node. If there's no SYSCTL_DECL macro anywhere else, there's no
reason why it shouldn't be static.
This was lost when it was converted to using a condition variable instead
of lbolt.
- Drop the priority of flowtable down to PPAUSE when it is idle as well
since it is a similar background task.
MFC after: 2 weeks
variable into two so that we can see on which one we are waiting.
This might also more properly propagate the update of the
flowclean_cycles flag and avoid "hangs" people were seeing.
Suggested by: rwatson [1]
Sponsored by: ISPsystem [1]
Reviewed by: julian [1]
Updated by: Mikolaj Golub (to.my.trociny gmail.com)
Tested by: Mikolaj Golub (to.my.trociny gmail.com)
MFC After: 1 week
[1] Early 2010, initial version.
DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various
people working on the affected files. A better long-term solution is
still being considered. This reversal may give some modules empty
set_pcpu or set_vnet sections, but these are harmless.
Changes reverted:
------------------------------------------------------------------------
r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines
Instead of unconditionally emitting .globl's for the __start_set_xxx and
__stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu
sections are actually defined.
------------------------------------------------------------------------
r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines
Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout
the tree.
------------------------------------------------------------------------
r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines
Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE.
- increase flow cleaning frequency and decrease flow caching time
when near the flow limit
- stop allocating new flows when within 3% of maxflows don't start
allocating again until below 12.5%
MFC after: 7 days
address as well as the transport protocol port information
from the outbound packets. The routing code is generic and
compares every byte in the given sockaddr object. Therefore
the temporary sockaddr objects must be cleared due to padding
bytes. In addition, the port information must be stripped
or the route search will either fail or return the incorrect
route entry.
Unit testing is done using OpenVPN over the if_tun interface.
MFC after: 7 days
- add a name argument to flowtable_alloc for printing with ddb commands
- extend ddb commands to print destination address or 4-tuples
- don't parse ports in ulp header if FL_HASH_ALL is not passed
- add kern_flowtable_insert to enable more generic use of flowtable
(e.g. system calls for adding entries)
- don't hash loopback addresses
- cleanup whitespace
- keep statistics per-cpu for per-cpu flowtables to avoid cache line contention
- add sysctls to accumulate stats and report aggregate
MFC after: 7 days
allow for connection load balancing across interfaces. Currently
the address alias handling method is colliding with the ECMP code.
For example, when two interfaces are configured on the same prefix,
only one prefix route is installed. So connection load balancing
among the available interfaces is not possible.
The other advantage of ECMP is for failover. The issue with the
current code, is that the interface link-state is not reflected
in the route entry. For example, if there are two interfaces on
the same prefix, the cable on one interface is unplugged, new and
existing connections should switch over to the other interface.
This is not done today and packets go into a black hole.
Also, there is a small bug in the kernel where deleting ECMP routes
in the userland will always return an error even though the command
is successfully executed.
MFC after: 5 days