Receive Segment Coalescing (RSC) in the vSwitch is a feature available in
Windows Server 2019 hosts and later. It reduces the per packet processing
overhead by coalescing multiple TCP segments when possible. This happens
mostly when TCP traffics are among different guests on same host.
This patch adds netvsc driver support for this feature.
The patch also updates NVS version to 6.1 as needed for RSC
enablement.
MFC after: 2 weeks
Sponsored by: Microsoft
Differential Revision: https://reviews.freebsd.org/D29075
Summary:
This fixes rtentry leak for the cloned interfaces created inside the
VNET.
PR: 253998
Reported by: rashey at superbox.pl
MFC after: 3 days
Loopback teardown order is `SI_SUB_INIT_IF`, which happens after `SI_SUB_PROTO_DOMAIN` (route table teardown).
Thus, any route table operations are too late to schedule.
As the intent of the vnet teardown procedures to minimise the amount of effort by doing global cleanups instead of per-interface ones, address this by adding a relatively light-weight routing table cleanup function, `rib_flush_routes()`.
It removes all remaining routes from the routing table and schedules the deletion, which will happen later, when `rtables_destroy()` waits for the current epoch to finish.
Test Plan:
```
set_skip:set_skip_group_lo -> passed [0.053s]
tail -n 200 /var/log/messages | grep rtentry
```
Reviewers: #network, kp, bz
Reviewed By: kp
Subscribers: imp, ae
Differential Revision: https://reviews.freebsd.org/D29116
If we hit an error during init, then we'll unwind our state and attempt
to detach the device -- don't block it.
This was discovered by creating a wg0 with missing parameters; said
failure ended up leaving this orphaned device in place and ended up
panicking the system upon enumeration of the dev.* sysctl space.
Reviewed by: gallatin, markj
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D29145
This structure is shared among multiple instances of a driver, so we
should ensure that it doesn't somehow get treated as if there's a
separate instance per interface. This is especially important for
software-only drivers like wg.
DEVICE_REGISTER() still returns a void * and so the per-driver sctx
structures are not yet defined with the const qualifier.
Reviewed by: gallatin, erj
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D29102
Drain the callbacks upon if_deregister_com_alloc() such that the
if_com_free[type] won't be nullified before if_destroy().
Taking fwip(4) as an example, before this fix, kldunload if_fwip will
go through the following:
1. fwip_detach()
2. if_free() -> schedule if_destroy() through NET_EPOCH_CALL
3. fwip_detach() returns
4. firewire_modevent(MOD_UNLOAD) -> if_deregister_com_alloc()
5. kernel complains about:
Warning: memory type fw_com leaked memory on destroy (1 allocations, 64 bytes leaked).
6. EPOCH runs if_destroy() -> if_free_internal()i
By this time, if_com_free[if_alloctype] is NULL since it's already
nullified by if_deregister_com_alloc(); hence, firewire_free() won't
have a chance to release the allocated fw_com.
Reviewed by: hselasky, glebius
MFC after: 2 weeks
This structure is only used by the kernel module internally. It's not
shared with user space, so hide it behind #ifdef _KERNEL.
Sponsored by: Rubicon Communications, LLC ("Netgate")
In some configurations we need more classes than ALTQ supports by
default. Increase the maximum number of classes we allow.
This will only cost us a comparatively trivial amount of memory, so
there's little reason not to do so.
If ever we find we want even more we may want to consider turning these
defines into a tunable, but for now do the easy thing.
Reviewed by: donner@
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D29034
Introduce convenience macros to retrieve the DSCP, ECN or traffic class
bits from an IPv6 header.
Use them where appropriate.
Reviewed by: ae (previous version), rscheff, tuexen, rgrimes
MFC after: 2 weeks
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D29056
For interfaces with admin completion queues, introduce a new devmethod
IFDI_ADMIN_COMPLETION_HANDLE and a corresponding flag IFLIB_HAS_ADMINCQ.
This provides an option for handling any admin cq logic, which cannot be
run from an interrupt context.
Said method is called from within iflib's admin task, making it safe to
sleep.
Reviewed by: mmacy
Submitted by: Artur Rojek <ar@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Differential Revision: https://reviews.freebsd.org/D28708
Commit 6dd69f0064 ("iflib: introduce isc_dma_width")
failed to build on powerpc due to implicit type conversion
error. Fix that.
Submitted by: Artur Rojek <ar@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Some DMA controllers are unable to address the full host memory space
and are instead limited to a subset of address range (e.g. 48-bit).
Allow the driver to specify the maximum allowed DMA addressing width
(in bits) for the NIC hardware, by introducing a new field in
if_softc_ctx.
If said field is omitted (set to 0), the lowaddr of DMA window bounds
defaults to BUS_SPACE_MAXADDR.
Submitted by: Artur Rojek <ar@semihalf.com>
Obtained from: Semihalf
Sponsored by: Amazon, Inc.
Differential Revision: https://reviews.freebsd.org/D28706
iflib_rxeof() was counting everything twice. This was introduced when
pfil hooks were added to the iflib receive path. We want to count rx
packets/bytes before the pfil hooks are executed, so remove the counter
adjustments that are executed after.
PR: 253583
Reviewed by: gallatin, erj
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28900
When the bridge is moved to a different vnet we must remove all of its
member interfaces (and span interfaces), because we don't know if those
will be moved along with it. We don't want to hold references to
interfaces not in our vnet.
Reviewed by: donner@
MFC after: 1 week
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D28859
VLAN devices have type IFT_L2VLAN, so the STP code mistakenly believed
they couldn't be used for STP. That's not the case, so add the
ITF_L2VLAN to the check.
Reviewed by: donner@
MFC after: 1 week
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D28857
The routing stack control depends on quite a tree of functions to
determine the proper attributes of a route such as a source address (ifa)
or transmit ifp of a route.
When actually inserting a route, the stack needs to ensure that ifa and ifp
points to the entities that are still valid.
Validity means slightly more than just pointer validity - stack need guarantee
that the provided objects are not scheduled for deletion.
Currently, callers either ignore it (most ifp parts, historically) or try to
use refcounting (ifa parts). Even in case of ifa refcounting it's not always
implemented in fully-safe manner. For example, some codepaths inside
rt_getifa_fib() are referencing ifa while not holding any locks, resulting in
possibility of referencing scheduled-for-deletion ifa.
Instead of trying to fix all of the callers by enforcing proper refcounting,
switch to a different model.
As the rib_action() already requires epoch, do not require any stability guarantees
other than the epoch-provided one.
Use newly-added conditional versions of the refcounting functions
(ifa_try_ref(), if_try_ref()) and fail if any of these fails.
Reviewed by: donner
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D28837
When we have an ifp pointer and the code is running inside epoch,
epoch guarantees the pointer will not be freed.
However, the following case can still happen:
* in thread 1 we drop to refcount=0 for ifp and schedule its deletion.
* in thread 2 we use this ifp and reference it
* destroy callout kicks in
* unhappy user reports a bug
This can happen with the current implementation of ifnet_byindex_ref(),
as we're not holding any locks preventing ifnet deletion by a parallel thread.
To address it, add if_try_ref(), allowing to return failure when
referencing ifp with refcount=0.
Additionally, enforce existing if_ref() is with KASSERT to provide a
cleaner error in such scenarios.
Finally, fix ifnet_byindex_ref() by using if_try_ref() and returning NULL
if the latter fails.
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D28836
rtsock message validation changes committed in 2fe5a79425
did not take llinfo messages into account.
Add a special validation case for RTA_GATEWAY llinfo messages.
MFC after: 2 days
In commit 38bfc6dee3 we added an IFDI_DETACH() call to
iflib_pseudo_deregister() since it looked like it was missing. One is
present in the error-handling path of iflib_pseudo_register(). However,
the detach actually comes from the DEVICE_DETACH() method for the
above-mentioned device_t, so now we're calling IFDI_DETACH() twice when
destroying a pseudo interface.
Fix the problem by not calling IFDI_DETACH() from the device detach
routine. This way we can ensure that iflib de-initialization always
happens in a consistent order. It also ensures that you can't do silly
things like "devctl detach <pseudo ifnet>", which would previously
detach the driver without tearing down the corresponding ifnet.
PR: 253541
Reviewed by: erj
MFC after: 1 week
Fixes: 38bfc6dee3
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D28774
Changes in the 2fe5a79425 moved dst sockaddr masking from the
routing control plane to the rtsock code.
It broke arp/ndp deletion.
It turns out, arp/ndp perform RTM_GET request first to get an
interface index necessary for the deletion.
Then they simply stamp the reply with RTF_LLDATA and set the
command to RTM_DELETE.
As a result, kernel receives request with non-empty RTA_NETMASK
and clears RTA_DST host bits before passing the message to the
lla code.
De facto, the only needed bits are RTA_DST, RTA_GATEWAY and the
subset of rtm_flags.
With that in mind, fix the interace by clearing RTA_NETMASK
for every messages with RTF_LLDATA.
While here, cleanup arp/ndp code a bit.
MFC after: 1 day
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D28804
This fixes a warning from GCC for kernels without netmap since the
return value is never used.
Reviewed by: vmaffione, erj
Differential Revision: https://reviews.freebsd.org/D28598
Traditionally routing socket code did almost zero checks on
the input message except for the most basic size checks.
This resulted in the unclear KPI boundary for the routing system code
(`rtrequest*` and now `rib_action()`) w.r.t message validness.
Multiple potential problems and nuances exists:
* Host bits in RTAX_DST sockaddr. Existing applications do send prefixes
with hostbits uncleared. Even `route(8)` does this, as they hope the kernel
would do the job of fixing it. Code inside `rib_action()` needs to handle
it on its own (see `rt_maskedcopy()` ugly hack).
* There are multiple way of adding the host route: it can be DST without
netmask or DST with /32(/128) netmask. Also, RTF_HOST has to be set correspondingly.
Currently, these 2 options create 2 DIFFERENT routes in the kernel.
* no sockaddr length/content checking for the "secondary" fields exists: nothing
stops rtsock application to send sockaddr_in with length of 25 (instead of 16).
Kernel will accept it, install to RIB as is and propagate to all rtsock consumers,
potentially triggering bugs in their code. Same goes for sin_port, sin_zero, etc.
The goal of this change is to make rtsock verify all sockaddr and prefix consistency.
Said differently, `rib_action()` or internals should NOT require to change any of the
sockaddrs supplied by `rt_addrinfo` structure due to incorrectness.
To be more specific, this change implements the following:
* sockaddr cleanup/validation check is added immediately after getting sockaddrs from rtm.
* Per-family dst/netmask checks clears host bits in dst and zeros all dst/netmask "secondary" fields.
* The same netmask checking code converts /32(/128) netmasks to "host" route case
(NULL netmask, RTF_HOST), removing the dualism.
* Instead of allowing ANY "known" sockaddr families (0<..<AF_MAX), allow only actually
supported ones (inet, inet6, link).
* Automatically convert `sockaddr_sdl` (AF_LINK) gateways to
`sockaddr_sdl_short`.
Reported by: Guy Yur <guyyur at gmail.com>
Reviewed By: donner
Differential Revision: https://reviews.freebsd.org/D28668
MFC after: 3 days
More and more code migrates from lock-based protection to the NET_EPOCH
umbrella. It requires some logic changes, including, notably, refcount
handling.
When we have an `ifa` pointer and we're running inside epoch we're
guaranteed that this pointer will not be freed.
However, the following case can still happen:
* in thread 1 we drop to 0 refcount for ifa and schedule its deletion.
* in thread 2 we use this ifa and reference it
* destroy callout kicks in
* unhappy user reports bug
To address it, new `ifa_try_ref()` function is added, allowing to return
failure when we try to reference `ifa` with 0 refcount.
Additionally, existing `ifa_ref()` is enforced with `KASSERT` to provide
cleaner error in such scenarious.
Reviewed By: rstone, donner
Differential Revision: https://reviews.freebsd.org/D28639
MFC after: 1 week
iflib_init_locked() assumes that iflib_stop() has been called, however,
it is not called for media changes.
iflib_if_init_locked() calls stop then init, so fixes the problem.
PR: 253473
MFC after: 3 days
Reviewed by: markj
Sponsored by: Juniper Networks, Inc., Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D28667
The case of adding interface route by specifying interface
address as the gateway was missed during code refactoring.
Re-add it back by copying non-AF_LINK gateway data when RTF_GATEWAY
is not set.
Reviewed by: donner
MFC after: 3 days
Traditionally *BSD routing stack required to supply some
interface data for blackhole/reject routes. This lead to
varieties of hacks in routing daemons when inserting such routes.
With the recent routeing stack changes, gateway sockaddr without
RTF_GATEWAY started to be treated differently, purely as link
identifier.
This change broke net/bird, which installs blackhole routes with
127.0.0.1 gateway without RTF_GATEWAY flags.
Fix this by automatically constructing necessary gateway data at
rtsock level if RTF_REJECT/RTF_BLACKHOLE is set.
Reported by: Marek Zarychta <zarychtam at plan-b.pwste.edu.pl>
Reviewed by: donner
MFC after: 1 week
Widen the ifnet_detach_sxlock to cover the entire vnet sysuninit code.
This ensures that we can't end up having the vnet_sysuninit free the UDP
pcb while the detach code is running and trying to purge the UDP pcb.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28530
Historically receive buffer overflows have been ignored and programs
could not tell if they missed messages or messages had been truncated
because of overflows. Since programs historically do not expect to get
receive overflow errors, this behavior is not the default.
This is really really important for programs that use route(4) to keep in sync
with the system. If we loose a message then we need to reload the full system
state, otherwise the behaviour from that point is undefined and can lead
to chasing bogus bug reports.
ROUTE_MPATH was added to the GENERIC kernel in r368648.
According to the plan in D27428, it was enabled with `net.route.multipath` sysctl set to 0.
Given enough time has passed, this change enables route multipath by default.
The goal is to ship FreeBSD 13 with multipath turned on.
Reviewed By: donner, olivier
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D28423
Memory and PCI resources are freed with no particular order. This could
cause use-after-frees when detaching following a failed attach. For
instance, iflib_tx_structures_free() frees ctx->ifc_txqs[] but
iflib_tqg_detach() attempts to access this array. Similarly, adapter
queues gets freed by IFDI_QUEUES_FREE() but IFDI_DETACH() attempts to
access adapter queues to free PCI resources.
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
Differential Revision: https://reviews.freebsd.org/D27634
Currently, if the immutable algorithm like bsearch or radix_lockless
receives rtable update notification, it schedules algorithm rebuild.
This rebuild is executed by the callout after ~50 milliseconds.
It is possible that a script adding an interface address and than route
with the gateway bound to that address will fail. It can happen due
to the fact that fib is not updated by the time the route addition
request arrives.
Fix this by allowing synchronous algorithm rebuilds based on certain
conditions. By default, these conditions assume:
1) less than net.route.algo.fib_sync_limit=100 routes
2) routes without gateway.
* Move algo instance build entirely under rib WLOCK.
Rib lock is only used for control plane (except radix algo, but there
are no rebuilds).
* Add rib_walk_ext_locked() function to allow RIB iteration with
rib lock already held.
* Fix rare potential callout use-after-free for fds by binding fd
callout to the relevant rib rmlock. In that case, callout_stop()
under rib WLOCK guarantees no callout will be executed afterwards.
MFC after: 3 days
subscriptions during RIB modifications.
Add new subscriptions to the beginning of the lists instead of
the end. This fixes the situation when new subscription is created
int the callback for the existing subscription, leading to the
subscription notification handler pick it.
MFC after: 3 days
* Move per-prefix debug lines under LOG_DEBUG2
* Create fib instance counter to distingush log messages between
instances
* Add more messages on rebuild reason.
MFC after: 3 days
The initial plan was to remove rib_lookup_info() before
FreeBSD 13. As several customers are still remaining,
fix rib_lookup_info() for the multipath use case.
D26436 introduced support for stacked vlans that changed the way vlans
are configured. In particular, this change broke setups that have
same-number vlans as subinterfaces.
Vlan support was initially created assuming "vlanX" semantics. In this paradigm,
automatic number assignment supported by cloning (ifconfig vlan create) was a
natural fit.
When "ifaceX.Y" support was added, allowing to have the same vlan number on
multiple devices, cloning code became more complex, as the is no
unified "vlan" namespace anymore. Such interfaces got the first spare
index from "vlan" cloner. This, in turn, led to the following problem:
ifconfig ix0.333 create -> index 1
ifconfig ix0.444 create -> index 2
ifconfig vlan2 create -> allocation failure
This change fixes such allocations by using cloning indexes only for
"vlanX" interfaces.
Reviewed by: hselasky
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D27505
Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address. Then, this was extended to array of
such pointers. Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change. The new name should be generic enough to avoid further
renaming.
tree that fix the ratelimit code. There were several bugs
in tcp_ratelimit itself and we needed further work to support
the multiple tag format coming for the joint TLS and Ratelimit dances.
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D28357