Some applications forward from/to host rings most or all the
traffic received or sent on a physical interface. In this
cases it is desirable to have more than a pair of RX/TX host
rings, and use multiple threads to speed up forwarding.
This change adds support for multiple host rings. On registering
a netmap port, the user can specify the number of desired receive
and transmit host rings in the nr_host_tx_rings and nr_host_rx_rings
fields of the nmreq_register structure.
MFC after: 2 weeks
At this point, all routes should've already been dropped by removing all
members from the bridge. This condition is in-fact KASSERT'd in the line
immediately above where this nop flush was added.
At this point, all routes should've already been dropped by removing all
members from the bridge. This condition is in-fact KASSERT'd in the line
immediately above where this nop flush was added.
After r345180 we need to have the appropriate vnet context set to delete an
rtnode in bridge_rtnode_destroy().
That's usually the case, but not when it's called by the STP code (through
bstp_notify_rtage()).
We have to set the vnet context in bridge_rtable_expire() just as we do in the
other STP callback bridge_state_change().
Reviewed by: kevans
bridge_rtnode_zone still has outstanding allocations at the time of
destruction in the current model because all of the interface teardown
happens in a VNET_SYSUNINIT, -after- the MOD_UNLOAD has already been
processed. The SYSUNINIT triggers destruction of the interfaces, which then
attempts to free the memory from the zone that's already been destroyed, and
we hit a panic.
Solve this by virtualizing the uma_zone we allocate the rtnodes from to fix
the ordering. bridge_rtable_fini should also take care to flush any
remaining routes that weren't taken care of when dynamic routes were flushed
in bridge_stop.
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D19578
If the spanning tree root interface is removed from the bridge we panic
on the next 'ifconfig'.
While the STP code is notified whenever a bridge member interface is
removed from the bridge it does not clear the bs_root_port. This means
bs_root_port can still point at an bridge_iflist which has been free()d.
The next access to it will panic.
Explicitly check if the interface we're removing in bstp_destroy() is
the root, and if so re-assign the roles, which clears bs_root_port.
Reviewed by: philip
MFC after: 2 weeks
The counters of pf tables are updated outside the rule lock. That means state
updates might overwrite each other. Furthermore allocation and
freeing of counters happens outside the lock as well.
Use counter(9) for the counters, and always allocate the counter table
element, so that the race condition cannot happen any more.
PR: 230619
Submitted by: Kajetan Staszkiewicz <vegeta@tuxpowered.net>
Reviewed by: glebius
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19558
This has the advantage of being obvious to sniff out the designated prefix
by eye and it has all the right bits set. Comment stolen from ffec.
I've removed bryanv@'s pending question of using the FreeBSD OUI range --
no one has followed up on this with a definitive action, and there's no
particular reason to shoot for it and the administrative overhead that comes
with deciding exactly how to use it.
We currently have two places with identical fake hwaddr generation --
if_vxlan and if_bridge. Lift it into if_ethersubr for reuse in other
interfaces that may also need a fake addr.
Reviewed by: bryanv, kp, philip
Differential Revision: https://reviews.freebsd.org/D19573
PFIL_MEMPTR flag are intentionally providing a memory address that
isn't aligned to pointer alignment. This is done to align an IPv4
or IPv6 header that is expected to follow Ethernet header.
When we return PFIL_REALLOCED we store a pointer to allocated mbuf
at this address. With this change the KPI changes to store the pointer
at aligned address, which usually yields in +2 bytes.
Provide two inlines:
pfil_packet_align() to get aligned pfil_packet_t for a misaligned one
pfil_mem2mbuf() to read out mbuf pointer from misaligned pfil_packet_t
Provide function pfil_realloc(), not used yet, that would convert a
memory pfil_packet_t to an mbuf one.
Reported by: hps
Reviewed by: hps, gallatin
r344504 added an extra ARP_LOG() call in case of an if_output() failure.
It turns out IPv4 can be noisy. In order to not spam the console by default:
(a) add a counter for these events so people can keep better track of how
often it happens, and
(b) add a sysctl to select the default ARP_LOG log level and set it to
INFO avoiding the one (the new) DEBUG level by default.
Claim a spare (1st one after 10 years since the stats were added) in order
to not break netstat from FreeBSD 12->13 updates in the future.
Reviewed by: karels
Differential Revision: https://reviews.freebsd.org/D19490
All changes are hidden behind the EXPERIMENTAL option and are not compiled
in by default.
Add ND6_IFF_IPV6_ONLY_MANUAL to be able to set the interface into no-IPv4-mode
manually without router advertisement options. This will allow developers to
test software for the appropriate behaviour even on dual-stack networks or
IPv6-Only networks without the option being set in RA messages.
Update ifconfig to allow setting and displaying the flag.
Update the checks for the filters to check for either the automatic or the manual
flag to be set. Add REVARP to the list of filtered IPv4-related protocols and add
an input filter similar to the output filter.
Add a check, when receiving the IPv6-Only RA flag to see if the receiving
interface has any IPv4 configured. If it does, ignore the IPv6-Only flag.
Add a per-VNET global sysctl, which is on by default, to not process the automatic
RA IPv6-Only flag. This way an administrator (if this is compiled in) has control
over the behaviour in case the node still relies on IPv4.
From Jake:
"The iflib_fl_setup() function tries to pick various buffer sizes based
on the max_frame_size value defined by the parent driver. However, this
code was wrapped under CONTIGMALLOC_WORKS, which was never actually
defined anywhere.
This same code pattern was used in if_em.c, likely trying to match
what iflib uses.
Since CONTIGMALLOC_WORKS is not defined, remove this dead code from
iflib_fl_setup and if_em.c
Given that various iflib drivers appear to be using a similar
calculation, it might be worth making this buffer size a value that the
driver can peek at in the future."
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19199
The if_tun cloner is not virtualised, but if_clone_attach() does use a
virtualised list of cloners.
The result is that we can't find the if_tun cloner when we try to remove
a renamed tun interface. Virtualise the cloner, and move the final
cleanup into a sysuninit so that we're sure this happens after all of
the vnet_sysuninits
Note that we need unit numbers to be system-unique (rather than unique
per vnet, as is done by if_clone_simple()). The unit number is used to
create the corresponding /dev/tunX device node, and this node must match
with the interface.
Switch to if_clone_advanced() so that we have control over the unit
numbers.
Reproduction scenario:
jail -c -n foo persist vnet
jexec test ifconfig tun create
jexec test ifconfig tun0 name wg0
jexec test ifconfig wg0 destroy
PR: 235704
Reviewed by: bz, hrs, hselasky
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19248
Mask off the bits we don't care about when checking that capabilities
of the member interfaces have been disabled as intended.
Submitted by: Ryan Moeller <ryan@ixsystems.com>
Reviewed by: kristof, mav
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D18924
ratelimiting code. The two modules (lagg and vlan) did have
allocation routines, and even though they are indirect (and
vector down to the underlying interfaces) they both need to
have a free routine (that also vectors down to the actual interface).
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D19032
So far, intr_{g,s}etaffinity(9) take a single int for identifying
a device interrupt. This approach doesn't work on all architectures
supported, as a single int isn't sufficient to globally specify a
device interrupt. In particular, with multiple interrupt controllers
in one system as found on e. g. arm and arm64 machines, an interrupt
number as returned by rman_get_start(9) may be only unique relative
to the bus and, thus, interrupt controller, a certain device hangs
off from.
In turn, this makes taskqgroup_attach{,_cpu}(9) and - internal to
the gtaskqueue implementation - taskqgroup_attach_deferred{,_cpu}()
not work across architectures. Yet in turn, iflib(4) as gtaskqueue
consumer so far doesn't fit architectures where interrupt numbers
aren't globally unique.
However, at least for intr_setaffinity(..., CPU_WHICH_IRQ, ...) as
employed by the gtaskqueue implementation to bind an interrupt to a
particular CPU, using bus_bind_intr(9) instead is equivalent from
a functional point of view, with bus_bind_intr(9) taking the device
and interrupt resource arguments required for uniquely specifying a
device interrupt.
Thus, change the gtaskqueue implementation to employ bus_bind_intr(9)
instead and intr_{g,s}etaffinity(9) to take the device and interrupt
resource arguments required respectively. This change also moves
struct grouptask from <sys/_task.h> to <sys/gtaskqueue.h> and wraps
struct gtask along with the gtask_fn_t typedef into #ifdef _KERNEL
as userland likes to include <sys/_task.h> or indirectly drags it
in - for better or worse also with _KERNEL defined -, which with
device_t and struct resource dependencies otherwise is no longer
as easily possible now.
The userland inclusion problem probably can be improved a bit by
introducing a _WANT_TASK (as well as a _WANT_MOUNT) akin to the
existing _WANT_PRISON etc., which is orthogonal to this change,
though, and likely needs an exp-run.
While at it:
- Change the gt_cpu member in the grouptask structure to be of type
int as used elswhere for specifying CPUs (an int16_t may be too
narrow sooner or later),
- move the gtaskqueue_enqueue_fn typedef from <sys/gtaskqueue.h> to
the gtaskqueue implementation as it's only used and needed there,
- change the GTASK_INIT macro to use "gtask" rather than "task" as
argument given that it actually operates on a struct gtask rather
than a struct task, and
- let subr_gtaskqueue.c consistently use __func__ to print functions
names.
Reported by: mmel
Reviewed by: mmel
Differential Revision: https://reviews.freebsd.org/D19139
o Correct the obvious bugs in the netmap(4) parts:
- No longer check for the existence of DMA maps as bus_dma(9)
is used unconditionally in iflib(4) since r341095.
- Supply the correct DMA tag and map pairs to bus_dma(9)
functions (see also the commit message of r343753).
- In iflib_netmap_timer_adjust(), add synchronization of the
TX descriptors before calling the ift_txd_credits_update
method as the latter evaluates the TX descriptors possibly
updated by the MAC.
- In _task_fn_tx(), wrap the netmap(4)-specific bits in
#ifdef DEV_NETMAP just as done in _task_fn_admin() and
_task_fn_rx() respectively.
o In iflib_fast_intr_rxtx(), synchronize the TX rather than
the RX descriptors before calling the ift_txd_credits_update
method (see also above).
o There's no need to synchronize an RX buffer that is going to
be recycled in iflib_rxd_pkt_get(), yet; it's sufficient to
do that as late as passing RX buffers to the MAC via the
ift_rxd_refill method. Hence, combine that synchronization
with the synchronization of new buffers into a common spot
in _iflib_fl_refill().
o There's no need to synchronize the RX descriptors of a free
list in preparation of the MAC updating their statuses with
every invocation of rxd_frag_to_sd(); it's enough to do this
once before handing control over to the MAC, i. e. before
calling ift_rxd_flush method in _iflib_fl_refill(), which
already performs the necessary synchronization.
o Given that the ift_rxd_available method evaluates the RX
descriptors which possibly have been altered by the MAC,
synchronize as appropriate beforehand. Most notably this
is now done in iflib_rxd_avail(), which in turn means that
we don't need to issue the same synchronization yet again
before calling the ift_rxd_pkt_get method in iflib_rxeof().
o In iflib_txd_db_check(), synchronize the TX descriptors
before handing them over to the MAC for transmission via
the ift_txd_flush method.
o In iflib_encap(), move the TX buffer synchronization after
the invocation of the ift_txd_encap() method. If the MAC
driver fails to encapsulate the packet and we retry with
a defragmented mbuf chain or finally fail, the cycles for
TX buffer synchronization have been wasted. Synchronizing
afterwards matches what non-iflib(4) drivers typically do
and is sufficient as the MAC will not actually start with
the transmission before - in this case - the ift_txd_flush
method is called.
Moreover, for the latter reason the synchronization of the
TX descriptors in iflib_encap() can go as it's enough to
synchronize them before passing control over to the MAC by
issuing the ift_txd_flush() method (see above).
o In iflib_txq_can_drain(), only synchronize TX descriptors
if the ift_txd_credits_update method accessing these is
actually called.
Differential Revision: https://reviews.freebsd.org/D19081
In general, the time savings come from separating the active and
inactive queues lists into separate interface and non-interface queue
lists, and changing the rule and queue tag management from list-based
to hash-bashed.
In HFSC, a linear scan of the class table during each queue destroy
was also eliminated.
There are now two new tunables to control the hash size used for each
tag set (default for each is 128):
net.pf.queue_tag_hashsize
net.pf.rule_tag_hashsize
Reviewed by: kp
MFC after: 1 week
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D19131
controller datasheet revision 3.3, in the context of Ethernet
MACs the control data describing the packet buffers typically
are named "descriptors". Each of these descriptors references
one buffer, multiple of which a packet can be composed of.
By contrast, in comments, messages and the names of structure
members, iflib(4) refers to DMA resources employed for RX and
TX buffers (rather than control data) as "desc(riptors)".
This odd naming convention of iflib(4) made reviewing r343085
and identifying wrong and missing bus_dmamap_sync(9) calls in
particular way harder than it already is. This convention may
also explain why the netmap(4) part of iflib(4) pairs the DMA
tags for control data with DMA maps of buffers and vice versa
in calls to bus_dma(9) functions.
Therefore, change iflib(4) to refer to buf(fers) when buffers
and not the usual understanding of descriptors is meant. This
change does not include corrections to the DMA resources used
in the netmap(4) parts. However, it revises error messages to
state which kind of allocation/creation failed. Specifically,
the "Unable to allocate tx_buffer (map) memory" copy & pasted
inappropriately on several occasions was replaced with proper
messages.
o Enhance some other error messages to indicate which half - RX
or TX - they apply to instead of using identical text in both
cases and generally canonicalize them.
o Correct the descriptions of iflib_{r,t}xsd_alloc() to reflect
reality; current code doesn't use {r,t}x_buffer structures.
o In iflib_queues_alloc():
- Remove redundant BUS_DMA_NOWAIT of iflib_dma_alloc() calls,
- change the M_WAITOK from malloc(9) calls into M_NOWAIT. The
return values are already checked, deferred DMA allocations
not being an option at this point, BUS_DMA_NOWAIT has to be
used anyway and prior malloc(9) calls in this function also
specify M_NOWAIT.
Reviewed by: shurd
Differential Revision: https://reviews.freebsd.org/D19067
Add SYNC_KLOOP_MODE option, and add support for direct mode, where application
executes the TXSYNC and RXSYNC in the context of the ioeventfd wake up callback.
MFC after: 5 days
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented. The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.
In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.
New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.
Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.
Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.
Differential Revision: https://reviews.freebsd.org/D18951
lagg_capabilities() will set the capability once interfaces supporting
the feature are added to the lagg. Setting it on a lagg without any
interfaces is pointless as the if_snd_tag_alloc call will always fail
in that case.
Reviewed by: hselasky, gallatin
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19040
The pfil(9) system is about to be converted to epoch(9) synchronization, so
we need [temporarily] go back with ipfw internal locking.
Discussed with: ae
bus_teardown_intr(9) before pci_release_msi(9).
- Ensure that iflib(4) and associated drivers pass correct RIDs to
bus_release_resource(9) by obtaining the RIDs via rman_get_rid(9)
on the corresponding resources instead of using the RIDs initially
passed to bus_alloc_resource_any(9) as the latter function may
change those RIDs. Solely em(4) for the ioport resource (but not
others) and bnxt(4) were using the correct RIDs by caching the ones
returned by bus_alloc_resource_any(9).
- Change the logic of iflib_msix_init() around to only map the MSI-X
BAR if MSI-X is actually supported, i. e. pci_msix_count(9) returns
> 0. Otherwise the "Unable to map MSIX table " message triggers for
devices that simply don't support MSI-X and the user may think that
something is wrong while in fact everything works as expected.
- Put some (mostly redundant) debug messages emitted by iflib(4)
and em(4) during attachment under bootverbose. The non-verbose
output of em(4) seen during attachment now is close to the one
prior to the conversion to iflib(4).
- Replace various variants of spelling "MSI-X" (several in messages)
with "MSI-X" as used in the PCI specifications.
- Remove some trailing whitespace from messages emitted by iflib(4)
and change them to consistently start with uppercase.
- Remove some obsolete comments about releasing interrupts from
drivers and correct a few others.
Reviewed by: erj, Jacob Keller, shurd
Differential Revision: https://reviews.freebsd.org/D18980
corresponding bitmap before adding an mbuf has actually succeeded.
Previously, m_gethdr(M_NOWAIT, ...) failing caused a "hole" in the
RX ring but not in its bitmap. One implication of such a hole was
that in a subsequent call to _iflib_fl_refill() with the RX buffer
accounting still indicating another reclaimable buffer, bit_ffc(3)
nevertheless returned -1 in frag_idx which in turn caused havoc
when used as an index. Thus, additionally assert that frag_idx is
0 or greater.
Another possible consequence of a hole in the RX ring was a NULL-
dereference when trying to use the unallocated mbuf, for example
in iflib_rxd_pkt_get().
While at it, make the variable declarations in _iflib_fl_refill()
conform to style(9) and remove redundant checks already performed
by bit_ffc{,_at}(3).
- In iflib_queues_alloc(), don't pass redundant M_ZERO to bit_alloc(3).
Reported and tested by: pho
The new loop to sync and unload descriptors was indexed
by "i", rather than "j". The panic was caused by "i"
being advanced rather than "j", and eventually becoming
out of bounds.
Reviewed by: kib
MFC after: 3 days
Sponsored by: Netflix
Changelist:
- Add the proper memory barriers in the kloop ring processing
functions.
- Fix memory barriers usage in the user helpers (nm_sync_kloop_appl_write,
nm_sync_kloop_appl_read).
- Fix nm_kr_txempty() helper to look at rhead rather than rcur. This
is important since the kloop can read a value of rcur which is ahead
of the value of rhead (see explanation in nm_sync_kloop_appl_write)
- Remove obsolete ptnetmap_guest_write_kring_csb() and
ptnet_guest_read_kring_csb(), and update if_ptnet(4) to use those.
- Prepare in advance the arguments for netmap_sync_kloop_[tr]x_ring(),
to make the kloop faster.
- Provide kernel and user implementation for nm_ldld_barrier() and
nm_ldst_barrier()
MFC after: 2 weeks
This is more compatible with formatting tools and looks more normal.
Reported by: jhb (on a different review)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D18442
Also, expose IFLIB_MAX_RX_SEGS to iflib drivers and add
iflib_dma_alloc_align() to the iflib API.
Performance is generally better with the tunable/sysctl
dev.vmx.<index>.iflib.tx_abdicate=1.
Reviewed by: shurd
MFC after: 1 week
Relnotes: yes
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D18761
- Specify BUS_DMA_NOWAIT for bus_dmamap_load() on rx refill, since
callbacks are not supposed to be used.
- Match tso/non-tso tags to corresponding tx map operations. Create
separate tso maps for tx descriptors. In particular, do not use
non-tso tag to load, unload, or destroy a map created with tso tag.
- Add missed bus_dmamap_sync() calls.
Submitted by: marius.
Reported and tested by: pho
Reviewed by: marius
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
There are several reasons:
- The structure being exported via IFDATA_LINKSPECIFIC doesn't appear
to be a standard MIB.
- The structure being exported is private to the kernel and always
has been.
- No other drivers in common use set the if_linkmib field.
- Because IFDATA_LINKSPECIFIC can be used to overwrite the linkmib
structure, a privileged user could use it to corrupt internal
vlan(4) state. [1]
PR: 219472
Reported by: CTurt <ecturt@gmail.com> [1]
Reviewed by: kp (previous version)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18779
- Remove macros that covertly create epoch_tracker on thread stack. Such
macros a quite unsafe, e.g. will produce a buggy code if same macro is
used in embedded scopes. Explicitly declare epoch_tracker always.
- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
locking macros to what they actually are - the net_epoch.
Keeping them as is is very misleading. They all are named FOO_RLOCK(),
while they no longer have lock semantics. Now they allow recursion and
what's more important they now no longer guarantee protection against
their companion WLOCK macros.
Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.
This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.
Discussed with: jtl, gallatin
iflib_init_locked() assumes that iflib_stop() has been called, however,
it is not called for suspend. iflib_if_init_locked() calls stop then init,
so fixes the problem.
This was causing errors after a resume from suspend.
PR: 224059
Reported by: zeising
MFC after: 1 week
Sponsored by: Limelight Networks
r336560 was supposed to restore pre-r323954 behaviour when tx_abdicate is
not set (the default case). However, it appears that rather than the drainage
check being made conditional on tx_abdicate being set, it was duplicated
so it occured twice if tx_abdicate was set and once if it was not.
Now when !tx_abdicate, drainage is only checked if the doorbell isn't
pending.
Reported by: lev
MFC after: 1 week
Sponsored by: Limelight Networks
Changelist:
- Replace netmap passthrough host support with a more general
mechanism to call TXSYNC/RXSYNC from an in-kernel event-loop.
No kernel threads are used to use this feature: the application
is required to spawn a thread (or a process) and issue a
SYNC_KLOOP_START (NIOCCTRL) command in the thread body. The
kernel loop is executed by the ioctl implementation, which returns
to userspace only when a different thread calls SYNC_KLOOP_STOP
or the netmap file descriptor is closed.
- Update the if_ptnet driver to cope with the new data structures,
and prune all the obsolete ptnetmap code.
- Add support for "null" netmap ports, useful to allocate netmap_if,
netmap_ring and netmap buffers to be used by specialized applications
(e.g. hypervisors). TXSYNC/RXSYNC on these ports have no effect.
- Various fixes and code refactoring.
Sponsored by: Sunny Valley Networks
Differential Revision: https://reviews.freebsd.org/D18015