stf(4) interfaces are not multicast-capable so they can't perform DAD.
They also did not set IFF_DRV_RUNNING when an address was assigned, so
the logic in nd6_timer() would periodically flag such an address as
tentative, resulting in interface flapping.
Fix the problem by setting IFF_DRV_RUNNING when an address is assigned,
and do some related cleanup:
- In in6if_do_dad(), remove a redundant check for !UP || !RUNNING.
There is only one caller in the tree, and it only looks at whether
the return value is non-zero.
- Have in6if_do_dad() return false if the interface is not
multicast-capable.
- Set ND6_IFF_NO_DAD when an address is assigned to an stf(4) interface
and the interface goes UP as a result. Note that this is not
sufficient to fix the problem because the new address is marked as
tentative and DAD is started before in6_ifattach() is called.
However, setting no_dad is formally correct.
- Change nd6_timer() to not flag addresses as tentative if no_dad is
set.
This is based on a patch from Viktor Dukhovni.
Reported by: Viktor Dukhovni <ietf-dane@dukhovni.org>
Reviewed by: ae
MFC after: 3 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19751
From Jake:
iflib_if_transmit returns ENOBUFS when the device is down, or when the
link isn't active.
This was changed in r308792 from return (0), so that the function
correctly reports an error that it was unable to transmit.
However, using ENOBUFS can cause some network applications to produce
the following or similar errors:
"ping: sendto: No buffer space available"
This is a bit confusing as the real cause of the issue is that the
network device is down.
Replace the ENOBUFS return with ENETDOWN to indicate more clearly that
the reason for the failure to send is due to the network device is
offline.
This will cause the error message to be reported as
"ping: sendto: Network is down"
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@, sbruno@, bz@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19652
From Jake:
The iflib_device_register function takes the CTX lock before calling
IFDI_ATTACH_PRE, and releases it upon finishing the registration.
Mirror this process in iflib_pseudo_register, so that we always hold the
CTX lock during the attach process when registering a pseudo interface
or a regular interface.
This was caught by code inspection while attempting to analyze where the
CTX lock was held.
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@, erj@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19604
lagg_bcast_start appeared to have a bug in that was using the last
lagg port structure after exiting the epoch that was keeping that
structure alive. However, upon further inspection, the epoch was
already entered by the caller (lagg_transmit), so the epoch enter/exit
in lagg_bcast_start was actually unnecessary.
This commit generally removes uses of the net epoch via LAGG_RLOCK to
protect the list of ports when the list of ports was already protected
by an existing LAGG_RLOCK in a caller, or the LAGG_XLOCK.
It also adds a missing epoch enter/exit in lagg_snd_tag_alloc while
accessing the lagg port structures. An ifp is still accessed via an
unsafe reference after the epoch is exited, but that is true in the
current code and will be fixed in a future change.
Reviewed by: gallatin
MFC after: 1 month
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19718
Consider a bridge0 with em0 and em1 members. Traffic rx'd by em0 and
transmitted by bridge0 through em1 gets accounted for in IPACKETS/IBYTES
and bridge0 bpf -- assuming it's not unicast traffic destined for em1.
Unicast traffic destined for em1 traffic is not accounted for by any
mechanism, and isn't pushed through bridge0's bpf machinery as any other
packets that pass over the bridge do.
Fix this and simplify GRAB_OUR_PACKETS by bailing out early if it was rx'd
by the interface that it was addressed for. Everything else there is
relevant for any traffic that came in from one member that's being directed
at another member of the bridge.
Reviewed by: kp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D19614
From Jake:
The iflib core never modifies the isc_driver_version string. Allow
drivers to safely assign pointers to constant buffers by marking this
parameter const.
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: erj@, gallatin@, jhb@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19577
From Jake:
iflib_fl_setup calculates a suitable buffer size for the Rx mbufs based
on the isc_max_frame_size value that drivers setup. This calculation is
repeated by drivers when programming their hardware with the size of
each Rx buffer.
This can lead to a mismatch where the iflib mbuf size is different from
the expected size of the buffer as programmed by the hardware. This can
lead to unexpected results.
If iflib ever wants to support mbuf sizes larger than one page, every
driver must be updated to account for the new possible buffer sizes.
Fix this by calculating the mbuf size prior to calling IFDI_INIT, and
adding the iflib_get_rx_mbuf_sz function which will expose this value to
drivers, so that they do not repeat the same calculation.
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@, erj@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19489
From Jake:
iflib_encap calls bus_dmamap_load_mbuf_sg. Upon it returning EFBIG, an
m_collapse and an m_defrag are attempted to shrink the mbuf cluster to
fit within the DMA segment limitations.
However, if we call m_defrag, and then bus_dmamap_load_mbuf_sg returns
EFBIG on the now defragmented mbuf, we will continuously re-call
bus_dmamap_load_mbuf_sg over and over.
This happens because m_head isn't NULL, and remap is >1, so we don't try
to m_collapse or m_defrag again. The only way we exit the loop is if
m_head is NULL. However, m_head can't be modified by the call to
bus_dmamap_load_mbuf_sg, because we don't pass it as a double pointer.
I believe this will be an incredibly rare occurrence, because it is
unlikely that bus_dmamap_load_mbuf_sg will actually fail on the second
defragment with an EFBIG error. However, it still seems like
a possibility that we should account for.
Fix the exit check to ensure that if remap is >1, we will also exit,
even if m_head is not NULL.
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@, gallatin@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19468
and remove possible panic condition.
It is already allowed to sleep in bpfattach[2], since BPF_LOCK was
converted to SX lock in r332388. Also move KASSERT() to the top of
function and make full initialization before bpf_if will be linked
to BPF's list of interfaces.
MFC after: 2 weeks
Some applications forward from/to host rings most or all the
traffic received or sent on a physical interface. In this
cases it is desirable to have more than a pair of RX/TX host
rings, and use multiple threads to speed up forwarding.
This change adds support for multiple host rings. On registering
a netmap port, the user can specify the number of desired receive
and transmit host rings in the nr_host_tx_rings and nr_host_rx_rings
fields of the nmreq_register structure.
MFC after: 2 weeks
At this point, all routes should've already been dropped by removing all
members from the bridge. This condition is in-fact KASSERT'd in the line
immediately above where this nop flush was added.
At this point, all routes should've already been dropped by removing all
members from the bridge. This condition is in-fact KASSERT'd in the line
immediately above where this nop flush was added.
After r345180 we need to have the appropriate vnet context set to delete an
rtnode in bridge_rtnode_destroy().
That's usually the case, but not when it's called by the STP code (through
bstp_notify_rtage()).
We have to set the vnet context in bridge_rtable_expire() just as we do in the
other STP callback bridge_state_change().
Reviewed by: kevans
bridge_rtnode_zone still has outstanding allocations at the time of
destruction in the current model because all of the interface teardown
happens in a VNET_SYSUNINIT, -after- the MOD_UNLOAD has already been
processed. The SYSUNINIT triggers destruction of the interfaces, which then
attempts to free the memory from the zone that's already been destroyed, and
we hit a panic.
Solve this by virtualizing the uma_zone we allocate the rtnodes from to fix
the ordering. bridge_rtable_fini should also take care to flush any
remaining routes that weren't taken care of when dynamic routes were flushed
in bridge_stop.
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D19578
If the spanning tree root interface is removed from the bridge we panic
on the next 'ifconfig'.
While the STP code is notified whenever a bridge member interface is
removed from the bridge it does not clear the bs_root_port. This means
bs_root_port can still point at an bridge_iflist which has been free()d.
The next access to it will panic.
Explicitly check if the interface we're removing in bstp_destroy() is
the root, and if so re-assign the roles, which clears bs_root_port.
Reviewed by: philip
MFC after: 2 weeks
The counters of pf tables are updated outside the rule lock. That means state
updates might overwrite each other. Furthermore allocation and
freeing of counters happens outside the lock as well.
Use counter(9) for the counters, and always allocate the counter table
element, so that the race condition cannot happen any more.
PR: 230619
Submitted by: Kajetan Staszkiewicz <vegeta@tuxpowered.net>
Reviewed by: glebius
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19558
This has the advantage of being obvious to sniff out the designated prefix
by eye and it has all the right bits set. Comment stolen from ffec.
I've removed bryanv@'s pending question of using the FreeBSD OUI range --
no one has followed up on this with a definitive action, and there's no
particular reason to shoot for it and the administrative overhead that comes
with deciding exactly how to use it.
We currently have two places with identical fake hwaddr generation --
if_vxlan and if_bridge. Lift it into if_ethersubr for reuse in other
interfaces that may also need a fake addr.
Reviewed by: bryanv, kp, philip
Differential Revision: https://reviews.freebsd.org/D19573
PFIL_MEMPTR flag are intentionally providing a memory address that
isn't aligned to pointer alignment. This is done to align an IPv4
or IPv6 header that is expected to follow Ethernet header.
When we return PFIL_REALLOCED we store a pointer to allocated mbuf
at this address. With this change the KPI changes to store the pointer
at aligned address, which usually yields in +2 bytes.
Provide two inlines:
pfil_packet_align() to get aligned pfil_packet_t for a misaligned one
pfil_mem2mbuf() to read out mbuf pointer from misaligned pfil_packet_t
Provide function pfil_realloc(), not used yet, that would convert a
memory pfil_packet_t to an mbuf one.
Reported by: hps
Reviewed by: hps, gallatin
r344504 added an extra ARP_LOG() call in case of an if_output() failure.
It turns out IPv4 can be noisy. In order to not spam the console by default:
(a) add a counter for these events so people can keep better track of how
often it happens, and
(b) add a sysctl to select the default ARP_LOG log level and set it to
INFO avoiding the one (the new) DEBUG level by default.
Claim a spare (1st one after 10 years since the stats were added) in order
to not break netstat from FreeBSD 12->13 updates in the future.
Reviewed by: karels
Differential Revision: https://reviews.freebsd.org/D19490
All changes are hidden behind the EXPERIMENTAL option and are not compiled
in by default.
Add ND6_IFF_IPV6_ONLY_MANUAL to be able to set the interface into no-IPv4-mode
manually without router advertisement options. This will allow developers to
test software for the appropriate behaviour even on dual-stack networks or
IPv6-Only networks without the option being set in RA messages.
Update ifconfig to allow setting and displaying the flag.
Update the checks for the filters to check for either the automatic or the manual
flag to be set. Add REVARP to the list of filtered IPv4-related protocols and add
an input filter similar to the output filter.
Add a check, when receiving the IPv6-Only RA flag to see if the receiving
interface has any IPv4 configured. If it does, ignore the IPv6-Only flag.
Add a per-VNET global sysctl, which is on by default, to not process the automatic
RA IPv6-Only flag. This way an administrator (if this is compiled in) has control
over the behaviour in case the node still relies on IPv4.
From Jake:
"The iflib_fl_setup() function tries to pick various buffer sizes based
on the max_frame_size value defined by the parent driver. However, this
code was wrapped under CONTIGMALLOC_WORKS, which was never actually
defined anywhere.
This same code pattern was used in if_em.c, likely trying to match
what iflib uses.
Since CONTIGMALLOC_WORKS is not defined, remove this dead code from
iflib_fl_setup and if_em.c
Given that various iflib drivers appear to be using a similar
calculation, it might be worth making this buffer size a value that the
driver can peek at in the future."
Submitted by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed by: shurd@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D19199
The if_tun cloner is not virtualised, but if_clone_attach() does use a
virtualised list of cloners.
The result is that we can't find the if_tun cloner when we try to remove
a renamed tun interface. Virtualise the cloner, and move the final
cleanup into a sysuninit so that we're sure this happens after all of
the vnet_sysuninits
Note that we need unit numbers to be system-unique (rather than unique
per vnet, as is done by if_clone_simple()). The unit number is used to
create the corresponding /dev/tunX device node, and this node must match
with the interface.
Switch to if_clone_advanced() so that we have control over the unit
numbers.
Reproduction scenario:
jail -c -n foo persist vnet
jexec test ifconfig tun create
jexec test ifconfig tun0 name wg0
jexec test ifconfig wg0 destroy
PR: 235704
Reviewed by: bz, hrs, hselasky
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D19248
Mask off the bits we don't care about when checking that capabilities
of the member interfaces have been disabled as intended.
Submitted by: Ryan Moeller <ryan@ixsystems.com>
Reviewed by: kristof, mav
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D18924
ratelimiting code. The two modules (lagg and vlan) did have
allocation routines, and even though they are indirect (and
vector down to the underlying interfaces) they both need to
have a free routine (that also vectors down to the actual interface).
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D19032
So far, intr_{g,s}etaffinity(9) take a single int for identifying
a device interrupt. This approach doesn't work on all architectures
supported, as a single int isn't sufficient to globally specify a
device interrupt. In particular, with multiple interrupt controllers
in one system as found on e. g. arm and arm64 machines, an interrupt
number as returned by rman_get_start(9) may be only unique relative
to the bus and, thus, interrupt controller, a certain device hangs
off from.
In turn, this makes taskqgroup_attach{,_cpu}(9) and - internal to
the gtaskqueue implementation - taskqgroup_attach_deferred{,_cpu}()
not work across architectures. Yet in turn, iflib(4) as gtaskqueue
consumer so far doesn't fit architectures where interrupt numbers
aren't globally unique.
However, at least for intr_setaffinity(..., CPU_WHICH_IRQ, ...) as
employed by the gtaskqueue implementation to bind an interrupt to a
particular CPU, using bus_bind_intr(9) instead is equivalent from
a functional point of view, with bus_bind_intr(9) taking the device
and interrupt resource arguments required for uniquely specifying a
device interrupt.
Thus, change the gtaskqueue implementation to employ bus_bind_intr(9)
instead and intr_{g,s}etaffinity(9) to take the device and interrupt
resource arguments required respectively. This change also moves
struct grouptask from <sys/_task.h> to <sys/gtaskqueue.h> and wraps
struct gtask along with the gtask_fn_t typedef into #ifdef _KERNEL
as userland likes to include <sys/_task.h> or indirectly drags it
in - for better or worse also with _KERNEL defined -, which with
device_t and struct resource dependencies otherwise is no longer
as easily possible now.
The userland inclusion problem probably can be improved a bit by
introducing a _WANT_TASK (as well as a _WANT_MOUNT) akin to the
existing _WANT_PRISON etc., which is orthogonal to this change,
though, and likely needs an exp-run.
While at it:
- Change the gt_cpu member in the grouptask structure to be of type
int as used elswhere for specifying CPUs (an int16_t may be too
narrow sooner or later),
- move the gtaskqueue_enqueue_fn typedef from <sys/gtaskqueue.h> to
the gtaskqueue implementation as it's only used and needed there,
- change the GTASK_INIT macro to use "gtask" rather than "task" as
argument given that it actually operates on a struct gtask rather
than a struct task, and
- let subr_gtaskqueue.c consistently use __func__ to print functions
names.
Reported by: mmel
Reviewed by: mmel
Differential Revision: https://reviews.freebsd.org/D19139
o Correct the obvious bugs in the netmap(4) parts:
- No longer check for the existence of DMA maps as bus_dma(9)
is used unconditionally in iflib(4) since r341095.
- Supply the correct DMA tag and map pairs to bus_dma(9)
functions (see also the commit message of r343753).
- In iflib_netmap_timer_adjust(), add synchronization of the
TX descriptors before calling the ift_txd_credits_update
method as the latter evaluates the TX descriptors possibly
updated by the MAC.
- In _task_fn_tx(), wrap the netmap(4)-specific bits in
#ifdef DEV_NETMAP just as done in _task_fn_admin() and
_task_fn_rx() respectively.
o In iflib_fast_intr_rxtx(), synchronize the TX rather than
the RX descriptors before calling the ift_txd_credits_update
method (see also above).
o There's no need to synchronize an RX buffer that is going to
be recycled in iflib_rxd_pkt_get(), yet; it's sufficient to
do that as late as passing RX buffers to the MAC via the
ift_rxd_refill method. Hence, combine that synchronization
with the synchronization of new buffers into a common spot
in _iflib_fl_refill().
o There's no need to synchronize the RX descriptors of a free
list in preparation of the MAC updating their statuses with
every invocation of rxd_frag_to_sd(); it's enough to do this
once before handing control over to the MAC, i. e. before
calling ift_rxd_flush method in _iflib_fl_refill(), which
already performs the necessary synchronization.
o Given that the ift_rxd_available method evaluates the RX
descriptors which possibly have been altered by the MAC,
synchronize as appropriate beforehand. Most notably this
is now done in iflib_rxd_avail(), which in turn means that
we don't need to issue the same synchronization yet again
before calling the ift_rxd_pkt_get method in iflib_rxeof().
o In iflib_txd_db_check(), synchronize the TX descriptors
before handing them over to the MAC for transmission via
the ift_txd_flush method.
o In iflib_encap(), move the TX buffer synchronization after
the invocation of the ift_txd_encap() method. If the MAC
driver fails to encapsulate the packet and we retry with
a defragmented mbuf chain or finally fail, the cycles for
TX buffer synchronization have been wasted. Synchronizing
afterwards matches what non-iflib(4) drivers typically do
and is sufficient as the MAC will not actually start with
the transmission before - in this case - the ift_txd_flush
method is called.
Moreover, for the latter reason the synchronization of the
TX descriptors in iflib_encap() can go as it's enough to
synchronize them before passing control over to the MAC by
issuing the ift_txd_flush() method (see above).
o In iflib_txq_can_drain(), only synchronize TX descriptors
if the ift_txd_credits_update method accessing these is
actually called.
Differential Revision: https://reviews.freebsd.org/D19081
In general, the time savings come from separating the active and
inactive queues lists into separate interface and non-interface queue
lists, and changing the rule and queue tag management from list-based
to hash-bashed.
In HFSC, a linear scan of the class table during each queue destroy
was also eliminated.
There are now two new tunables to control the hash size used for each
tag set (default for each is 128):
net.pf.queue_tag_hashsize
net.pf.rule_tag_hashsize
Reviewed by: kp
MFC after: 1 week
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D19131
controller datasheet revision 3.3, in the context of Ethernet
MACs the control data describing the packet buffers typically
are named "descriptors". Each of these descriptors references
one buffer, multiple of which a packet can be composed of.
By contrast, in comments, messages and the names of structure
members, iflib(4) refers to DMA resources employed for RX and
TX buffers (rather than control data) as "desc(riptors)".
This odd naming convention of iflib(4) made reviewing r343085
and identifying wrong and missing bus_dmamap_sync(9) calls in
particular way harder than it already is. This convention may
also explain why the netmap(4) part of iflib(4) pairs the DMA
tags for control data with DMA maps of buffers and vice versa
in calls to bus_dma(9) functions.
Therefore, change iflib(4) to refer to buf(fers) when buffers
and not the usual understanding of descriptors is meant. This
change does not include corrections to the DMA resources used
in the netmap(4) parts. However, it revises error messages to
state which kind of allocation/creation failed. Specifically,
the "Unable to allocate tx_buffer (map) memory" copy & pasted
inappropriately on several occasions was replaced with proper
messages.
o Enhance some other error messages to indicate which half - RX
or TX - they apply to instead of using identical text in both
cases and generally canonicalize them.
o Correct the descriptions of iflib_{r,t}xsd_alloc() to reflect
reality; current code doesn't use {r,t}x_buffer structures.
o In iflib_queues_alloc():
- Remove redundant BUS_DMA_NOWAIT of iflib_dma_alloc() calls,
- change the M_WAITOK from malloc(9) calls into M_NOWAIT. The
return values are already checked, deferred DMA allocations
not being an option at this point, BUS_DMA_NOWAIT has to be
used anyway and prior malloc(9) calls in this function also
specify M_NOWAIT.
Reviewed by: shurd
Differential Revision: https://reviews.freebsd.org/D19067
Add SYNC_KLOOP_MODE option, and add support for direct mode, where application
executes the TXSYNC and RXSYNC in the context of the ioeventfd wake up callback.
MFC after: 5 days
The KPI have been reviewed and cleansed of features that were planned
back 20 years ago and never implemented. The pfil(9) internals have
been made opaque to protocols with only returned types and function
declarations exposed. The KPI is made more strict, but at the same time
more extensible, as kernel uses same command structures that userland
ioctl uses.
In nutshell [KA]PI is about declaring filtering points, declaring
filters and linking and unlinking them together.
New [KA]PI makes it possible to reconfigure pfil(9) configuration:
change order of hooks, rehook filter from one filtering point to a
different one, disconnect a hook on output leaving it on input only,
prepend/append a filter to existing list of filters.
Now it possible for a single packet filter to provide multiple rulesets
that may be linked to different points. Think of per-interface ACLs in
Cisco or Juniper. None of existing packet filters yet support that,
however limited usage is already possible, e.g. default ruleset can
be moved to single interface, as soon as interface would pride their
filtering points.
Another future feature is possiblity to create pfil heads, that provide
not an mbuf pointer but just a memory pointer with length. That would
allow filtering at very early stages of a packet lifecycle, e.g. when
packet has just been received by a NIC and no mbuf was yet allocated.
Differential Revision: https://reviews.freebsd.org/D18951
lagg_capabilities() will set the capability once interfaces supporting
the feature are added to the lagg. Setting it on a lagg without any
interfaces is pointless as the if_snd_tag_alloc call will always fail
in that case.
Reviewed by: hselasky, gallatin
MFC after: 2 weeks
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19040
The pfil(9) system is about to be converted to epoch(9) synchronization, so
we need [temporarily] go back with ipfw internal locking.
Discussed with: ae
bus_teardown_intr(9) before pci_release_msi(9).
- Ensure that iflib(4) and associated drivers pass correct RIDs to
bus_release_resource(9) by obtaining the RIDs via rman_get_rid(9)
on the corresponding resources instead of using the RIDs initially
passed to bus_alloc_resource_any(9) as the latter function may
change those RIDs. Solely em(4) for the ioport resource (but not
others) and bnxt(4) were using the correct RIDs by caching the ones
returned by bus_alloc_resource_any(9).
- Change the logic of iflib_msix_init() around to only map the MSI-X
BAR if MSI-X is actually supported, i. e. pci_msix_count(9) returns
> 0. Otherwise the "Unable to map MSIX table " message triggers for
devices that simply don't support MSI-X and the user may think that
something is wrong while in fact everything works as expected.
- Put some (mostly redundant) debug messages emitted by iflib(4)
and em(4) during attachment under bootverbose. The non-verbose
output of em(4) seen during attachment now is close to the one
prior to the conversion to iflib(4).
- Replace various variants of spelling "MSI-X" (several in messages)
with "MSI-X" as used in the PCI specifications.
- Remove some trailing whitespace from messages emitted by iflib(4)
and change them to consistently start with uppercase.
- Remove some obsolete comments about releasing interrupts from
drivers and correct a few others.
Reviewed by: erj, Jacob Keller, shurd
Differential Revision: https://reviews.freebsd.org/D18980
corresponding bitmap before adding an mbuf has actually succeeded.
Previously, m_gethdr(M_NOWAIT, ...) failing caused a "hole" in the
RX ring but not in its bitmap. One implication of such a hole was
that in a subsequent call to _iflib_fl_refill() with the RX buffer
accounting still indicating another reclaimable buffer, bit_ffc(3)
nevertheless returned -1 in frag_idx which in turn caused havoc
when used as an index. Thus, additionally assert that frag_idx is
0 or greater.
Another possible consequence of a hole in the RX ring was a NULL-
dereference when trying to use the unallocated mbuf, for example
in iflib_rxd_pkt_get().
While at it, make the variable declarations in _iflib_fl_refill()
conform to style(9) and remove redundant checks already performed
by bit_ffc{,_at}(3).
- In iflib_queues_alloc(), don't pass redundant M_ZERO to bit_alloc(3).
Reported and tested by: pho
The new loop to sync and unload descriptors was indexed
by "i", rather than "j". The panic was caused by "i"
being advanced rather than "j", and eventually becoming
out of bounds.
Reviewed by: kib
MFC after: 3 days
Sponsored by: Netflix
Changelist:
- Add the proper memory barriers in the kloop ring processing
functions.
- Fix memory barriers usage in the user helpers (nm_sync_kloop_appl_write,
nm_sync_kloop_appl_read).
- Fix nm_kr_txempty() helper to look at rhead rather than rcur. This
is important since the kloop can read a value of rcur which is ahead
of the value of rhead (see explanation in nm_sync_kloop_appl_write)
- Remove obsolete ptnetmap_guest_write_kring_csb() and
ptnet_guest_read_kring_csb(), and update if_ptnet(4) to use those.
- Prepare in advance the arguments for netmap_sync_kloop_[tr]x_ring(),
to make the kloop faster.
- Provide kernel and user implementation for nm_ldld_barrier() and
nm_ldst_barrier()
MFC after: 2 weeks
This is more compatible with formatting tools and looks more normal.
Reported by: jhb (on a different review)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D18442
Also, expose IFLIB_MAX_RX_SEGS to iflib drivers and add
iflib_dma_alloc_align() to the iflib API.
Performance is generally better with the tunable/sysctl
dev.vmx.<index>.iflib.tx_abdicate=1.
Reviewed by: shurd
MFC after: 1 week
Relnotes: yes
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D18761
- Specify BUS_DMA_NOWAIT for bus_dmamap_load() on rx refill, since
callbacks are not supposed to be used.
- Match tso/non-tso tags to corresponding tx map operations. Create
separate tso maps for tx descriptors. In particular, do not use
non-tso tag to load, unload, or destroy a map created with tso tag.
- Add missed bus_dmamap_sync() calls.
Submitted by: marius.
Reported and tested by: pho
Reviewed by: marius
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
There are several reasons:
- The structure being exported via IFDATA_LINKSPECIFIC doesn't appear
to be a standard MIB.
- The structure being exported is private to the kernel and always
has been.
- No other drivers in common use set the if_linkmib field.
- Because IFDATA_LINKSPECIFIC can be used to overwrite the linkmib
structure, a privileged user could use it to corrupt internal
vlan(4) state. [1]
PR: 219472
Reported by: CTurt <ecturt@gmail.com> [1]
Reviewed by: kp (previous version)
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18779
- Remove macros that covertly create epoch_tracker on thread stack. Such
macros a quite unsafe, e.g. will produce a buggy code if same macro is
used in embedded scopes. Explicitly declare epoch_tracker always.
- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
locking macros to what they actually are - the net_epoch.
Keeping them as is is very misleading. They all are named FOO_RLOCK(),
while they no longer have lock semantics. Now they allow recursion and
what's more important they now no longer guarantee protection against
their companion WLOCK macros.
Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.
This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.
Discussed with: jtl, gallatin
iflib_init_locked() assumes that iflib_stop() has been called, however,
it is not called for suspend. iflib_if_init_locked() calls stop then init,
so fixes the problem.
This was causing errors after a resume from suspend.
PR: 224059
Reported by: zeising
MFC after: 1 week
Sponsored by: Limelight Networks
r336560 was supposed to restore pre-r323954 behaviour when tx_abdicate is
not set (the default case). However, it appears that rather than the drainage
check being made conditional on tx_abdicate being set, it was duplicated
so it occured twice if tx_abdicate was set and once if it was not.
Now when !tx_abdicate, drainage is only checked if the doorbell isn't
pending.
Reported by: lev
MFC after: 1 week
Sponsored by: Limelight Networks
Changelist:
- Replace netmap passthrough host support with a more general
mechanism to call TXSYNC/RXSYNC from an in-kernel event-loop.
No kernel threads are used to use this feature: the application
is required to spawn a thread (or a process) and issue a
SYNC_KLOOP_START (NIOCCTRL) command in the thread body. The
kernel loop is executed by the ioctl implementation, which returns
to userspace only when a different thread calls SYNC_KLOOP_STOP
or the netmap file descriptor is closed.
- Update the if_ptnet driver to cope with the new data structures,
and prune all the obsolete ptnetmap code.
- Add support for "null" netmap ports, useful to allocate netmap_if,
netmap_ring and netmap buffers to be used by specialized applications
(e.g. hypervisors). TXSYNC/RXSYNC on these ports have no effect.
- Various fixes and code refactoring.
Sponsored by: Sunny Valley Networks
Differential Revision: https://reviews.freebsd.org/D18015
This code has apparently never compiled on FreeBSD since its
introduction in 2004 (r130365). It has certainly not compiled
since 2006, when r164033 added #elsif [sic] preprocessor directives.
The code was left in the tree to reduce the diff from upstream (KAME).
Since that upstream is no longer relevant, remove the long-dead code.
This commit is the direct result of:
unifdef -m -UALTQ3_COMPAT sys/net/altq/*
A later commit will do some manual cleanup.
I do not plan to MFC this. If that would help you, go for it.
IFNET_RLOCK_NOSLEEP() is epoch_enter_preempt() in FreeBSD 12+. Holding
it in sysctl_rtsock() doesn't protect us from ifnet unlinking, because
unlinking occurs with IFNET_WLOCK(), that is rw_wlock+sx_xlock, and it
doesn check that concurrent code is running in epoch section. But while
we are in epoch section, we should be able to do access to ifnet's
fields, even it was unlinked. Thus do not change if_addr and if_hw_addr
fields in ifnet_detach_internal() to NULL, since rtsock code can do
access to these fields and this is allowed while it is running in epoch
section.
This should fix the race, when ifnet_detach_internal() unlinks ifnet
after we checked it for IFF_DYING in sysctl_dumpentry.
Move free(ifp->if_hw_addr) into ifnet_free_internal(). Also remove the
NULL check for ifp->if_description, since free(9) can correctly handle
NULL pointer.
MFC after: 1 week
- Remove the complex mechanism to choose between using busdma
and raw pmap_kextract at runtime. The reduced complexity makes
the code easier to read and maintain.
- Fix a bug in the small packet receive path where clusters were
repeatedly mapped but never unmapped. We now store the cluster's
bus address and avoid re-mapping the cluster each time a small
packet is received.
This patch fixes bugs I've seen where ixl(4) will not even
respond to ping without seeing DMAR faults.
I see a small improvement (14%) on packet forwarding tests using
a Haswell based Xeon E5-2697 v3. Olivier sees a small
regression (-3% to -6%) with lower end hardware.
Reviewed by: mmacy
Not objected to by: sbruno
MFC after: 8 weeks
Sponsored by: Netflix, Inc
Differential Revision: https://reviews.freebsd.org/D17901
The panic can happen, when some application does dump of routing table
using sysctl interface. To prevent this, set IFF_DYING flag in
if_detach_internal() function, when ifnet under lock is removed from
the chain. In sysctl_rtsock() take IFNET_RLOCK_NOSLEEP() to prevent
ifnet detach during routes enumeration. In case, if some interface was
detached in the time before we take the lock, add the check, that ifnet
is not DYING. This prevents access to memory that could be freed after
ifnet is unlinked.
PR: 227720, 230498, 233306
Reviewed by: bz, eugen
MFC after: 1 week
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D18338
Various structures exported by sysctl_rtsock() contain padding fields
which were not being zeroed.
Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: ae
MFC after: 3 days
Security: kernel memory disclosure
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18333
endpoints.
This can be used to configure several IPsec tunnels between two hosts
with different security associations.
Obtained from: Yandex LLC
MFC after: 2 weeks
Sponsored by: Yandex LLC
The CBQ BORROW flag conflicts with the RMCF_CODEL flag; the
two sets of definitions actually define the same things. The symptom
is that a kernel with CBQ support and not CODEL fails to load a QoS
policy with the obscure error "pfctl: DIOCADDALTQ: Cannot allocate memory."
If ALTQ_DEBUG is enabled, the error becomes a little clearer:
"rmc_newclass: CODEL not configured for CBQ!" is printed by the kernel.
There really shouldn't be two sets of macros that have to be defined
consistently, but the include structure isn't right for exporting
CBQ flags to altq_rmclass.h. Re-align the definitions, and add
CTASSERTs in the kernel to ensure that the definitions are consistent.
PR: 215716
Reviewed by: pkelsey
MFC after: 2 weeks
Sponsored by: Forcepoint LLC
Differential Revision: https://reviews.freebsd.org/D17758
iflib_stop() was not resetting the rxq completion queue state variables.
This meant that for any driver that has receive completion queues, after a
reinit, iflib would start asking what's available on the rx side starting at
whatever the completion queue index was prior to the stop, instead of at 0.
Submitted by: pkelsey
Reported by: pkelsey
MFC after: 3 days
Sponsored by: Limelight Networks
Ensure that any time CSUM_IP_TSO or CSUM_IP6_TSO is set that the corresponding
CSUM_IP6?_TCP / CSUM_IP flags are also set.
Rather than requireing drivers to bake-in an understanding that TSO implies
checksum offloads, make it explicit.
This change requires us to move the IFLIB_NEED_ZERO_CSUM implementation to
ensure it's zeroed for TSO.
Reported by: Jacob Keller <jacob.e.keller@intel.com>
MFC after: 1 week
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17801
r333502 removed initialization of ifc_nhwtxqs, and it's not clear
there's a need to copy it into the struct iflib_ctx at all. Use
ctx->ifc_sctx->isc_ntxqs instead.
Further, iflib_stop() did not clear the last ring in the case where
isc_nfl != isc_nrxqs (such as when IFLIB_HAS_RXCQ is set). Use
ctx->ifc_sctx->isc_nrxqs here instead of isc_nfl.
Reported by: pkelsey
Reviewed by: pkelsey
MFC after: 3 days
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17979
mutexes but now are converted to epoch(9) use thread-private epoch_tracker.
Embedding tracker into ifnet(9) or ifnet derived structures creates a non
reentrable function, that will fail miserably if called simultaneously from
two different contexts.
A thread private tracker will provide a single tracker that would allow to
call these functions safely. It doesn't allow nested call, but this is not
expected from compatibility KPIs.
Reviewed by: markj
r338838 attempted to fix issues with rxcsum and rxcsum6.
However, the rxcsum bits were set as though if_setcapenablebit() was
being called, not if_togglecapenable() which is in use. As a result,
it was not possible to disable rxcsum when rxcsum6 was supported.
PR: 233004
Reported by: lev
Reviewed by: lev
MFC after: 3 days
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17881
If the syncdev is removed we no longer need to clean up the multicast
entry we've got set up for that device.
Pass the ifnet detach event through pf to pfsync, and remove our
multicast handle, and mark us as no longer having a syncdev.
Note that this callback is always installed, even if the pfsync
interface is disabled (and thus it's not a per-vnet callback pointer).
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D17502
pf subscribes to ifnet_departure_event events, so it can clean up the
ifg_pf_kif and if_pf_kif pointers in the ifnet.
During vnet shutdown interfaces could go away without sending the event,
so pf ends up cleaning these up as part of its shutdown sequence, which
happens after the ifnet has already been freed.
Send the ifnet_departure_event during vnet shutdown, allowing pf to
clean up correctly.
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D17500
The callbacks are installed and removed depending on the state of the
pfsync device, which is per-vnet. The callbacks must also be per-vnet.
MFC after: 2 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D17499
So we have a global limit of 1024 fragments, but it is fine grained to
the region of the packet. Smaller packets may have less fragments.
This costs another 16 bytes of memory per reassembly and devides the
worst case for searching by 8.
Obtained from: OpenBSD
Differential Revision: https://reviews.freebsd.org/D17734
Remember 16 entry points based on the fragment offset. Instead of
a worst case of 8196 list traversals we now check a maximum of 512
list entries or 16 array elements.
Obtained from: OpenBSD
Differential Revision: https://reviews.freebsd.org/D17733
VIMAGE, and feature richness and global state increasing the 8k of
vnet module space are no longer sufficient for people and loading
multiple modules, e.g., pf(4) and ipl(4) or ipsec(4) will fail on
the second module.
Increase the module space to 8 * PAGE_SIZE which should be enough
to hold multiple firewalls, ipsec, multicast (as in the old days was
a problem), epair, carp, and any kind of other vnet enabled modules.
Sadly this is a global byte array part of the vnet_set, so we cannot
dynamically change its size; otherwise a TUNABLE would have been
a better solution.
PR: 228854
Reported by: Ernie Luzar, Marek Zarychta
Discussed with: rgrimes on current
MFC after: 3 days
This change defines the RA "6" (IPv6-Only) flag which routers
may advertise, kernel logic to check if all routers on a link
have the flag set and accordingly update a per-interface flag.
If all routers agree that it is an IPv6-only link, ether_output_frame(),
based on the interface flag, will filter out all ETHERTYPE_IP/ARP
frames, drop them, and return EAFNOSUPPORT to upper layers.
The change also updates ndp to show the "6" flag, ifconfig to
display the IPV6_ONLY nd6 flag if set, and rtadvd to allow
announcing the flag.
Further changes to tcpdump (contrib code) are availble and will
be upstreamed.
Tested the code (slightly earlier version) with 2 FreeBSD
IPv6 routers, a FreeBSD laptop on ethernet as well as wifi,
and with Win10 and OSX clients (which did not fall over with
the "6" flag set but not understood).
We may also want to (a) implement and RX filter, and (b) over
time enahnce user space to, say, stop dhclient from running
when the interface flag is set. Also we might want to start
IPv6 before IPv4 in the future.
All the code is hidden under the EXPERIMENTAL option and not
compiled by default as the draft is a work-in-progress and
we cannot rely on the fact that IANA will assign the bits
as requested by the draft and hence they may change.
Dear 6man, you have running code.
Discussed with: Bob Hinden, Brian E Carpenter
Previously, changing the MTU would require destroying the lagg and
creating a new one. Now it is allowed to change the MTU of
the lagg interface and the MTU of the ports will be set to match.
If any port cannot set the new MTU, all ports are reverted to the original
MTU of the lagg. Additionally, when adding ports, the MTU of a port will be
automatically set to the MTU of the lagg. As always, the MTU of the lagg is
initially determined by the MTU of the first port added. If adding an
interface as a port for some reason fails, that interface is reverted to its
original MTU.
Submitted by: Ryan Moeller <ryan@freqlabs.com>
Reviewed by: mav
Relnotes: Yes
Sponsored by: iXsystems Inc.
Differential Revision: https://reviews.freebsd.org/D17576
That commit is causing kernel panics in em(4), so this will be reverted
until those are fixed.
Reported by: ae@, pho@, et al
Sponsored by: Intel Corporation
This change is similar to r339646. The callback that checks for appearing
and disappearing of tunnel ingress address can be called during VNET
teardown. To prevent access to already freed memory, add check to the
callback and epoch_wait() call to be sure that callback has finished its
work.
MFC after: 20 days
allowed.
ipsec_srcaddr() callback can be called during VNET teardown, since
ingress address checking subsystem isn't VNET specific. And thus
callback can make access to already freed memory. To prevent this,
use V_ipsec_idhtbl pointer as indicator of VNET readiness. And make
epoch_wait() after resetting it to NULL in vnet_ipsec_uninit() to
be sure that ipsec_srcaddr() is finished its work.
Reported by: kp
MFC after: 20 days
Changelist:
- Move large parts of VALE code to a new file and header netmap_bdg.[ch].
This is useful to reuse the code within upcoming projects.
- Improvements and bug fixes to pipes and monitors.
- Introduce nm_os_onattach(), nm_os_onenter() and nm_os_onexit() to
handle differences between FreeBSD and Linux.
- Introduce some new helper functions to handle more host rings and fake
rings (netmap_all_rings(), netmap_real_rings(), ...)
- Added new sysctl to enable/disable hw checksum in emulated netmap mode.
- nm_inject: add support for NS_MOREFRAG
Approved by: gnn (mentor)
Differential Revision: https://reviews.freebsd.org/D17364
The taskqgroup_detach function does not check if task is already enqueued when
detaching it. This may lead to kernel panic if enqueued task starts after
context state lock is destroyed. Ensure that the already enqueued admin tasks
are executed before detaching them.
The issue was discovered during validation of D16429. Unloading of if_ixlv
followed by immediate removal of VFs with iovctl -D may lead to panic on
NODEBUG kernel.
As well, check if iflib is in detach before enqueueing new admin or iov
tasks, to prevent new tasks from executing while the taskqgroup tasks
are being drained.
Submitted by: Krzysztof Galazka <krzysztof.galazka@intel.com>
Reviewed by: shurd@, erj@
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D17404
SX-locks, during if_purgeaddrs(), by not allowing to hold the epoch
read lock over typical network IOCTL code paths. This is a regression
issue after r334305.
Reviewed by: ae (network)
Differential revision: https://reviews.freebsd.org/D17647
MFC after: 1 week
Sponsored by: Mellanox Technologies
* use CK_LIST and FNV hash to keep chains of softc;
* read access to softc is protected by epoch();
* write access is protected by ipsec_ioctl_sx. Changing of softc fields
is allowed only when softc is unlinked from CK_LIST chains.
* linking/unlinking of softc is allowed only when ipsec_ioctl_sx is
exclusive locked.
* the plain LIST of all softc is replaced by hash table that uses ingress
address of tunnels as a key.
* added support for appearing/disappearing of ingress address handling.
Now it is allowed configure non-local ingress IP address, and thus the
problem with if_ipsec(4) configuration that happens on boot, when
ingress address is not yet configured, is solved.
MFC after: 1 month
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17190
* register handler for ingress address appearing/disappearing;
* add new srcaddr hash table for fast softc lookup by srcaddr;
* when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface,
and set it otherwise;
MFC after: 1 month
Sponsored by: Yandex LLC
* register handler for ingress address appearing/disappearing;
* add new srcaddr hash table for fast softc lookup by srcaddr;
* when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface,
and set it otherwise;
MFC after: 1 month
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17214
* register handler for ingress address appearing/disappearing;
* add new srcaddr hash table for fast softc lookup by srcaddr;
* when srcaddr disappears, clear IFF_DRV_RUNNING flag from interface,
and set it otherwise;
* remove the note about ingress address from BUGS section.
MFC after: 1 month
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17134
appearing and disappearing on the host system.
Such handling is need, because tunneling interfaces must use addresses,
that are configured on the host as ingress addresses for tunnels.
Otherwise the system can send spoofed packets with source address, that
belongs to foreign host.
The KPI uses ifaddr_event_ext event to implement addresses tracking.
Tunneling interfaces register event handlers and then they are
notified by the kernel, when an address disappears or appears.
ifaddr_event_compat() handler from if.c replaced by srcaddr_change_event()
in the ip_encap.c
MFC after: 1 month
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17134
vlan_lladdr_fn() is called from taskqueue, which means there's no vnet context
set. We can end up trying to send ARP messages (through the iflladdr_event
event), which requires a vnet context.
PR: 227654
MFC after: 3 days
handler receives the type of event IFADDR_EVENT_ADD/IFADDR_EVENT_DEL,
and the pointer to ifaddr. Also ifaddr_event now is implemented using
ifaddr_event_ext handler.
MFC after: 3 weeks
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D17100
Finishes the conversion of the 40Gb Intel Ethernet drivers to iflib(9) for
FreeBSD 12.0, and fixes numerous bugs in both ixl(4) and iavf(4).
This commit also re-adds the VF driver to GENERIC since it now compiles and
functions.
The VF driver name was changed from ixlv(4) to iavf(4) because the VF driver is
now intended to be used with future products, not just with Fortville/Fort Park
VFs.
A man page update that documents these drivers is forthcoming in a separate
commit.
Reviewed by: sbruno@, kbowling@
Tested by: jeffrey.e.pieper@intel.com
Approved by: re (gjb@)
Relnotes: yes
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D16429
epoch section without exiting that epoch section. This is bad for two
reasons: the epoch section won't exit, and we will leave the epoch tracker
from the stack on the epoch list.
Fix the epoch leak by making sure we exit epoch sections before returning.
Reviewed by: ae, gallatin, mmacy
Approved by: re (gjb, kib)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D17450
is done via using ifconfig, which uses a SIOCSIFMTU ioctl() command, or
doing it using a TUNSIFINFO/TAPSIFINFO ioctl() command.
Without this patch, for IPv6 the new MTU is not used when creating routes.
Especially, when initiating TCP connections after increasing the MTU,
the old MTU is still used to compute the MSS.
Thanks to ae@ and bz@ for helping to improve the patch.
Reviewed by: ae@, bz@
Approved by: re (kib@)
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D17180
Remove unused and easy to misuse PNP macro parameter
Inspired by r338025, just remove the element size parameter to the
MODULE_PNP_INFO macro entirely. The 'table' parameter is now required to
have correct pointer (or array) type. Since all invocations of the macro
already had this property and the emitted PNP data continues to include the
element size, there is no functional change.
Mostly done with the coccinelle 'spatch' tool:
$ cat modpnpsize0.cocci
@normaltables@
identifier b,c;
expression a,d,e;
declarer MODULE_PNP_INFO;
@@
MODULE_PNP_INFO(a,b,c,d,
-sizeof(d[0]),
e);
@singletons@
identifier b,c,d;
expression a;
declarer MODULE_PNP_INFO;
@@
MODULE_PNP_INFO(a,b,c,&d,
-sizeof(d),
1);
$ rg -l MODULE_PNP_INFO -- sys | \
xargs spatch --in-place --sp-file modpnpsize0.cocci
(Note that coccinelle invokes diff(1) via a PATH search and expects diff to
tolerate the -B flag, which BSD diff does not. So I had to link gdiff into
PATH as diff to use spatch.)
Tinderbox'd (-DMAKE_JUST_KERNELS).
Approved by: re (glen)
Various capabilities were not being handled correctly in the
SIOCSIFCAP handler. Specifically:
IFCAP_RXCSUM and IFCAP_RXCSUM_IPV6 could be set even if not supported
It was impossible to disable IFCAP_RXCSUM and/or IFCAP_RXCSUM_IPV6 via
ifconfig since it does ioctl() per command-line flag rather than combine
them into a single call.
IFCAP_VLAN_HWCSUM could not be modified via the ioctl()
Setting any combination of the three IFCAP_WOL flags would set only
IFCAP_WOL_MCAST | IFCAP_WOL_MAGIC. For example, setting only
IFCAP_WOL_UCAST would result in both IFCAP_WOL_MCAST and IFCAP_WOL_MAGIC
being enabled, but IFCAP_WOL_UCAST would not be enabled.
Because if_vlancap() was called before if_togglecapenable(), vlan flags
were sometimes not applied correctly.
Interfaces were being unnecessarily stopped and restarted for WoL
PR: 231151
Submitted by: Kaho Toshikazu <kaho@elam.kais.kyoto-u.ac.jp>
Reported by: Shirkdog <mshirk@daemon-security.com>
Reviewed by: galladin
Approved by: re (gjb)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D17158
Similar to arm64, riscv compiler uses PC-relative loads/stores,
and with static data compiler does not emit relocations.
In result, kernel module linker has nothing to fix and data accessed
from the wrong location.
Approved by: re (gjb)
Sponsored by: DARPA, AFRL
Remove sysctls:
txq_drain_encapfail - now a duplicate of encap_txd_encap_fail
intr_link - was never incremented
intr_msix - was never incremented
rx_zero_len - was never incremented
The following were not incremented in all code-paths that apply:
m_pullups, mbuf_defrag, rxd_flush, tx_encap, rx_intr_enables, tx_frees,
encap_txd_encap_fail.
Fixes:
Replace the broken collapse_pkthdr() implementation with an MPASS().
fl_refills and fl_refills_large were not incremented when using netmap.
Reviewed by: gallatin
Approved by: re (marius)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16733
use the already existing one. No functional changes.
Reviewed by: karels, ae
Approved by: re (rgrimes)
Differential Revision: https://reviews.freebsd.org/D17004
The MP ring may have txq pointers enqueued. Previously, these were
passed to m_free() when IFC_QFLUSH was set. This patch checks for
the value and doesn't call m_free().
Reviewed by: gallatin
Approved by: re (gjb)
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16882
given in random(4).
This includes updating of the relevant man pages, and no-longer-used
harvesting parameters.
Ensure that the pseudo-unit-test still does something useful, now also
with the "other" algorithm instead of Yarrow.
PR: 230870
Reviewed by: cem
Approved by: so(delphij,gtetlow)
Approved by: re(marius)
Differential Revision: https://reviews.freebsd.org/D16898
ether_set_pcp should not be called from ether_output_frame for VLAN
interfaces -- the vid + pcp will be inserted during vlan_transmit in
that case. r337943 sets the VLAN's ifnet's if_pcp to a proper PCP value
and this led to double encapsulation (once with vid 0 and second time
with vid+pcp).
PR: 230794
Reviewed by: kib@
Approved by: re@ (gjb@)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D16887
2^32 bps or greater to be used. Prior to this, bandwidth parameters
would simply wrap at the 2^32 boundary. The computations in the HFSC
scheduler and token bucket regulator have been modified to operate
correctly up to at least 100 Gbps. No other algorithms have been
examined or modified for correct operation above 2^32 bps (some may
have existing computation resolution or overflow issues at rates below
that threshold). pfctl(8) will now limit non-HFSC bandwidth
parameters to 2^32 - 1 before passing them to the kernel.
The extensions to the pf(4) ioctl interface have been made in a
backwards-compatible way by versioning affected data structures,
supporting all versions in the kernel, and implementing macros that
will cause existing code that consumes that interface to use version 0
without source modifications. If version 0 consumers of the interface
are used against a new kernel that has had bandwidth parameters of
2^32 or greater configured by updated tools, such bandwidth parameters
will be reported as 2^32 - 1 bps by those old consumers.
All in-tree consumers of the pf(4) interface have been updated. To
update out-of-tree consumers to the latest version of the interface,
define PFIOC_USE_LATEST ahead of any includes and use the code of
pfctl(8) as a guide for the ioctls of interest.
PR: 211730
Reviewed by: jmallett, kp, loos
MFC after: 2 weeks
Relnotes: yes
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D16782
Upcoming Ethernet hardware will support new media types that aren't in the kernel
yet, so they are added here. These mostly include new 25G/50G/100G media types;
and this commit introduces new 200G/400G speeds and media.
Reviewed by: hselasky@, jhb@
MFC after: 1 week
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D16731
I was not aware Warner was making or planning to make forward progress in
this area and have since been informed of that.
It's easy to apply/reapply when churn dies down.
Inspired by r338025, just remove the element size parameter to the
MODULE_PNP_INFO macro entirely. The 'table' parameter is now required to
have correct pointer (or array) type. Since all invocations of the macro
already had this property and the emitted PNP data continues to include the
element size, there is no functional change.
Mostly done with the coccinelle 'spatch' tool:
$ cat modpnpsize0.cocci
@normaltables@
identifier b,c;
expression a,d,e;
declarer MODULE_PNP_INFO;
@@
MODULE_PNP_INFO(a,b,c,d,
-sizeof(d[0]),
e);
@singletons@
identifier b,c,d;
expression a;
declarer MODULE_PNP_INFO;
@@
MODULE_PNP_INFO(a,b,c,&d,
-sizeof(d),
1);
$ rg -l MODULE_PNP_INFO -- sys | \
xargs spatch --in-place --sp-file modpnpsize0.cocci
(Note that coccinelle invokes diff(1) via a PATH search and expects diff to
tolerate the -B flag, which BSD diff does not. So I had to link gdiff into
PATH as diff to use spatch.)
Tinderbox'd (-DMAKE_JUST_KERNELS).
toe_l2_resolve to fill up the complete vtag and not just the vid.
Reviewed by: kib@
MFC after: 1 week
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D16752
This is actually several different bugs:
- The code is not designed to handle inpcb deletion after interface deletion
- add reference for inpcb membership
- The multicast address has to be removed from interface lists when the refcount
goes to zero OR when the interface goes away
- decouple list disconnect from refcount (v6 only for now)
- ifmultiaddr can exist past being on interface lists
- add flag for tracking whether or not it's enqueued
- deferring freeing moptions makes the incpb cleanup code simpler but opens the
door wider still to races
- call inp_gcmoptions synchronously after dropping the the inpcb lock
Fundamentally multicast needs a rewrite - but keep applying band-aids for now.
Tested by: kp
Reported by: novel, kp, lwhsu
Lacp needs to manage the link state itself. Unlike other
lagg protocols, the ability of lacp to pass traffic
depends not only on the lagg members having link, but also
on the lacp protocol converging to a distributing state with the
link partner.
If we prematurely mark the link as up, then we will send a
gratuitous arp (via arp_handle_ifllchange()) before the lacp
interface is capable of passing traffic. When this happens,
the gratuitous arp is lost, and our link partner may cache
a stale mac address (eg, when the base mac address for the
lagg bundle changes, due to a BIOS change re-ordering NIC
unit numbers)
Reviewed by: jtl, hselasky
Sponsored by: Netflix
modules. It also fails in the same way, we are unable to relocate static
variables as the compiler uses PC-relative loads with nothing for the
kernel linker to relocate.
Sponsored by: DARPA, AFRL
declaired static. This will allow us to change the definition on arm64
as it has the same issues described in r336349.
Reviewed by: bz
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D16147
Since r336313, TSO support for LEM-class devices is removed again as it
was before the conversion of {l,}em(4) to iflib(4) in r311849 and as a
result, isc_tx_tso_segments_max is 0 for LEM-class devices now. Thus,
inappropriate watermarks were used for this class.
This is really only a band-aid, though, because so far iflib(9) doesn't
fully take into account that DMA engines can support different maxima
of segments for transfers of TSO and non-TSO packets. For example, the
DESC_RECLAIMABLE macro is based on isc_tx_nsegments while MAX_TX_DESC
used isc_tx_tso_segments_max only. For most in-tree consumers that
doesn't make a difference as the maxima are the same for both kinds of
transfers (that is, apart from the fact that TSO may require up to 2
sentinel descriptors but also not with every MAC supported). However,
isc_tx_nsegments is 8 but isc_tx_tso_segments_max is 85 by default
with ixl(4).
tests for avail > 0, avail can never be 0 within that loop. Thus, move
decrementing avail and budget_left into the loop and before the code which
checks for additional descriptors having become available in case all the
previous ones have been processed but there still is budget left so the
latter code works as expected. [1]
- In iflib_{busdma_load_mbuf_sg,parse_header}(), remove dead stores to m
and n respectively. [2, 3]
- In collapse_pkthdr(), ensure that m_next isn't NULL before dereferencing
it. [4]
- Remove a duplicate assignment of segs in iflib_encap().
Reported by: Coverity
CID: 1356027 [1], 1356047 [2], 1368205 [3], 1356028 [4]
r323954 changed the mp ring behaviour when 64-bit atomics were
available to abdicate the TX ring rather than having one become a
consumer thereby running to completion on TX. The consumer of the mp
ring was then triggered in the tx task rather than blocking the TX call.
While this significantly lowered the number of RX drops in small-packet
forwarding, it also negatively impacts TX performance.
With this change, the default behaviour is reverted, causing one TX ring
to become a consumer during the enqueue call. A new sysctl,
dev.X.Y.iflib.tx_abdicate is added to control this behaviour.
Reviewed by: gallatin
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16302
Use the timer to poll for TX completions when there are
outstanding TX slots. Track when the last driver timer was called
to prevent overcalling it. Also clean up some kring vs NIC ring
usage.
Reviewed by: marius, Johannes Lundberg <johalun0@gmail.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16300
This deduplicates the code a bit, and also implicitly adds missing
callout_stop() to in[6]_lltable_delete_entry() functions.
PR: 209682, 225927
Submitted by: hselasky (previous version)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D4605
- Ever since the workaround for the silicon bug of TSO4 causing MAC hangs
was committed in r295133, CSUM_TSO always got disabled unconditionally
by em(4) on the first invocation of em_init_locked(). However, even with
that problem fixed, it turned out that for at least e. g. 82579 not all
necessary TSO workarounds are in place, still causing MAC hangs even at
Gigabit speed. Thus, for stable/11, TSO usage was deliberately disabled
in r323292 (r323293 for stable/10) for the EM-class by default, allowing
users to turn it on if it happens to work with their particular EM MAC
in a Gigabit-only environment.
In head, the TSO workaround for speeds other than Gigabit was lost with
the conversion to iflib(9) in r311849 (possibly along with another one
or two TSO workarounds). Yet at the same time, for EM-class MACs TSO4
got enabled by default again, causing device hangs. Therefore, change the
default for this hardware class back to have TSO4 off, allowing users
to turn it on manually if it happens to work in their environment as
we do in stable/{10,11}. An alternative would be to add a whitelist of
EM-class devices where TSO4 actually is reliable with the workarounds in
place, but given that the advantage of TSO at Gigabit speed is rather
limited - especially with the overhead of these workarounds -, that's
really not worth it. [1]
This change includes the addition of an isc_capabilities to struct
if_softc_ctx so iflib(9) can also handle interface capabilities that
shouldn't be enabled by default which is used to handle the default-off
capabilities of e1000 as suggested by shurd@ and moving their handling
from em_setup_interface() to em_if_attach_pre() accordingly.
- Although 82543 support TSO4 in theory, the former lem(4) didn't have
support for TSO4, presumably because TSO4 is even more broken in the
LEM-class of MACs than the later EM ones. Still, TSO4 for LEM-class
devices was enabled as part of the conversion to iflib(9) in r311849,
causing device hangs. So revert back to the pre-r311849 behavior of
not supporting TSO4 for LEM-class at all, which includes not creating
a TSO DMA tag in iflib(9) for devices not having IFCAP_TSO4 set. [2]
- In fact, the FreeBSD TCP stack can handle a TSO size of IP_MAXPACKET
(65535) rather than FREEBSD_TSO_SIZE_MAX (65518). However, the TSO
DMA must have a maxsize of the maximum TSO size plus the size of a
VLAN header for software VLAN tagging. The iflib(9) converted em(4),
thus, first correctly sets scctx->isc_tx_tso_size_max to EM_TSO_SIZE
in em_if_attach_pre(), but later on overrides it with IP_MAXPACKET
in em_setup_interface() (apparently, left-over from pre-iflib(9)
times). So remove the later and correct iflib(9) to correctly cap
the maximum TSO size reported to the stack at IP_MAXPACKET. While at
it, let iflib(9) use if_sethwtsomax*().
This change includes the addition of isc_tso_max{seg,}size DMA engine
constraints for the TSO DMA tag to struct if_shared_ctx and letting
iflib_txsd_alloc() automatically adjust the maxsize of that tag in case
IFCAP_VLAN_MTU is supported as requested by shurd@.
- Move the if_setifheaderlen(9) call for adjusting the maximum Ethernet
header length from {ixgbe,ixl,ixlv,ixv,em}_setup_interface() to iflib(9)
so adjustment is automatically done in case IFCAP_VLAN_MTU is supported.
As a consequence, this adjustment now is also done in case of bnxt(4)
which missed it previously.
- Move the reduction of the maximum TSO segment count reported to the
stack by the number of m_pullup(9) calls (which in the worst case,
can add another mbuf and, thus, the requirement for another DMA
segment each) in the transmit path for performance reasons from
em_setup_interface() to iflib_txsd_alloc() as these pull-ups are now
done in iflib_parse_header() rather than in the no longer existing
em_xmit(). Moreover, this optimization applies to all drivers using
iflib(9) and not just em(4); all in-tree iflib(9) consumers still
have enough room to handle full size TSO packets. Also, reduce the
adjustment to the maximum number of m_pullup(9)'s now performed in
iflib_parse_header().
- Prior to the conversion of em(4)/igb(4)/lem(4) and ixl(4) to iflib(9)
in r311849 and r335338 respectively, these drivers didn't enable
IFCAP_VLAN_HWFILTER by default due to VLAN events not being passed
through by lagg(4). With iflib(9), IFCAP_VLAN_HWFILTER was turned on
by default but also lagg(4) was fixed in that regard in r203548. So
just remove the now redundant and defunct IFCAP_VLAN_HWFILTER handling
in {em,ixl,ixlv}_setup_interface().
- Nuke other redundant IFCAP_* setting in {em,ixl,ixlv}_setup_interface()
which is (more completely) already done in {em,ixl,ixlv}_if_attach_pre()
now.
- Remove some redundant/dead setting of scctx->isc_tx_csum_flags in
em_if_attach_pre().
- Remove some IFCAP_* duplicated either directly or indirectly (e. g.
via IFCAP_HWCSUM) in {EM,IGB,IXL}_CAPS.
- Don't bother to fiddle with IFCAP_HWSTATS in ixgbe(4)/ixgbev(4) as
iflib(9) adds that capability unconditionally.
- Remove some unused macros from em(4).
- Bump __FreeBSD_version as some of the above changes require the modules
of drivers using iflib(9) to be recompiled.
Okayed by: sbruno@ at 201806 DevSummit Transport Working Group [1]
Reviewed by: sbruno (earlier version), erj
PR: 219428 (part of; comment #10) [1], 220997 (part of; comment #3) [2]
Differential Revision: https://reviews.freebsd.org/D15720
The typical system now has a lot more memory than when pf was new, and is also
expected to handle more connections. Increase the default size of the state
table.
Note that users can overrule this using 'set limit states' in pf.conf.
From OpenBSD:
The year is 2018.
Mercury, Bowie, Cash, Motorola and DEC all left us.
Just pf still has a default state table limit of 10000.
Had! Now it's a tiny little bit more, 100k.
lead guitar: me
ok chorus: phessler theo claudio benno
background school girl laughing: bob
Obtained from: OpenBSD
Add generic function if_tunnel_check_nesting() that does check for
allowed nesting level for tunneling interfaces and also does loop
detection. Use it in gif(4), gre(4) and me(4) interfaces.
Differential Revision: https://reviews.freebsd.org/D16162
ifioctl(). Move it inside the proper #ifdef. This was throwing a valid
"Assigned but unused" warning with gcc.
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16063
Several third-parties use at least some of these ioctls. While it would be
better for regression testing if they were used in base (or at least in the
test suite), it's currently not worth the trouble to push through removal.
Submitted by: antoine, markj
- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
there's no longer any benefit to dropping the pcbinfo lock
and trying to do so just adds an error prone branchfest to
these functions
- Remove cases of same function recursion on the epoch as
recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
thread as the tracker field is now stack or heap allocated
as appropriate.
Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066
Several ioctls are unused in pf, in the sense that no base utility
references them. Additionally, a cursory review of pf-based ports
indicates they're not used elsewhere either. Some of them have been
unused since the original import. As far as I can tell, they're also
unused in OpenBSD. Finally, removing this code removes the need for
future pf work to take them into account.
Reviewed by: kp
Differential Revision: https://reviews.freebsd.org/D16076
encap_lookup_t method can be invoked by IP encap subsytem even if none
of gif/gre/me interfaces are exist. Hash tables are allocated on demand,
when first interface is created. So, make NULL pointer check before
doing access to hash table.
PR: 229378
- In iflib_msix_init(), VMMs with broken MSI-X activation are trying
to be worked around by manually enabling PCIM_MSIXCTRL_MSIX_ENABLE
before calling pci_alloc_msix(9). Apart from constituting a layering
violation, this has the problem of leaving PCIM_MSIXCTRL_MSIX_ENABLE
enabled when falling back to MSI or INTx when e. g. MSI-X is black-
listed and initially also when disabled via hw.pci.enable_msix. The
later in turn was incorrectly worked around in r325166.
Since r310806, pci(4) itself has code to deal with broken MSI-X
handling of VMMs, so all of these workarounds in iflib(9) can go,
fixing non-working interrupts when falling back to MSI/INTx. In
any case, possibly further adjustments to broken MSI-X activation
of VMMs like enabling r310806 by default in VM environments need to
be placed into pci(4), not iflib(9). [1]
- Also remove the pci_enable_busmaster(9) call from iflib_msix_init(),
which is already more properly invoked from iflib_device_attach().
- When falling back to MSI/INTx, release the MSI-X BAR resource again.
- When falling back to INTx, ensure scctx->isc_vectors is set to 1 and
not to something higher from a device with more than one MSI message
supported.
- Make the nearby ring_state(s) stuff (static) const.
Discussed with: jhb at BSDCan 2018 [1]
Reviewed by: imp, jhb
Differential Revision: https://reviews.freebsd.org/D15729