Finish 02e05b8fae:
* add gateway matcher function that can be used in rib_del_route_px()
or any rib_walk-family functions. It will be used in the upcoming
migration to the new KPI
* rename gw_fulter_func to match_gw_one() to better signal the
function purpose / semantic.
MFC after: 1 month
This makes routing socket implementation self contained and removes one
of the last dependencies on the raw socket code and pr_output method.
There are very subtle API visible changes:
- now routing socket would return EOPNOTSUPP instead of EINVAL on
syscalls that are not supposed to be called on a routing socket.
- routing socket buffer sizes are now controlled by net.rtsock
sysctls instead of net.raw. The latter were not documented
anywhere, and even Internet search doesn't find any references
or discussions related to these sysctls.
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36122
The old mechanism of getting them via domains/protocols control input
is a relict from the previous century, when nothing like EVENTHANDLER(9)
existed yet. Retire PRC_IFDOWN/PRC_IFUP as netinet was the only one
to use them.
Reviewed by: melifaro
Differential revision: https://reviews.freebsd.org/D36116
We must ensure that the fd provided by userspace is really for a UDP
socket. If it's not we'll panic in udp_set_kernel_tunneling().
Reported by: Gert Doering <gert@greenie.muc.de>
Sponsored by: Rubicon Communications, LLC ("Netgate")
route_ctl.c size has grown considerably since initial introduction.
Factor out non-relevant parts:
* all rtentry logic, such as creation/destruction and accessors
goes to net/route/route_rtentry.c
* all rtable subscription logic goes to net/route/route_subscription.c
Differential Revision: https://reviews.freebsd.org/D36074
MFC after: 1 month
This change adds public KPI to work with routes using pre-created
nexthops, instead of using data from addrinfo structures. These
functions will be later used for adding/deleting kernel-originated
routes and upcoming netlink protocol.
As a part of providing this KPI, low-level route addition code has been
reworked to provide more control over route creation or change.
Specifically, a number of operation flags
(RTM_F_<CREATE|EXCL|REPLACE|APPEND>) have been added, defining the
desired behaviour the the route already exists (or not exists). This
change required some changes in the multipath addition code, resulting
in moving this code to route_ctl.c, rendering mpath_ctl.c empty.
Differential Revision: https://reviews.freebsd.org/D36073
MFC after: 1 month
This change is required for the upcoming introduction of the next
nexhop-based operations KPI, as it will create rtentry and nexthops
at different stages of route table modification.
Differential Revision: https://reviews.freebsd.org/D36072
MFC after: 2 weeks
* Use same filter func (rib_filter_f_t) for nexhtop groups to
simplify callbacks.
* simplify conditional route deletion & remove the need to pass
rt_addrinfo to the low-level deletion functions
* speedup rib_walk_del() by removing an additional per-prefix lookup
Differential Revision: https://reviews.freebsd.org/D36071
MFC after: 1 month
This and the follow-up routing-related changes target to remove or
reduce `struct rt_addrinfo` usage and use recently-landed nhop(9)
KPI instead.
Traditionally `rt_addrinfo` structure has been used to propagate all necessary
information between the protocol/rtsock and a routing layer. Many
functions inside routing subsystem uses it internally. However, using
this structure became somewhat complicated, as there are too many ways
of specifying a single state and verifying data consistency is hard.
For example, arerouting flgs consistent with mask/gateway sockaddr pointers?
Is mask really a host mask? Are sockaddr "valid" (e.g. properly zeroed, masked,
have proper length)? Are they mutable? Is the suggested interface specified
by the interface index embedded into the sockadd_dl gateway, or passed
as RTAX_IFP parameter, or directly provided by rti_ifp or it needs to
be derived from the ifa?
These (and other similar) questions have to be considered every time when
a function has `rt_addrinfo` pointer as an argument.
The new approach is to bring more control back to the protocols and
construct the desired routing objects themselves - in the end, it's the
protocol/subsystem who knows the desired outcome.
This specific diff changes the following:
* add explicit basic low-level radix operations:
add_route() (renamed from add_route_nhop())
delete_route() (factored from change_route_nhop())
change_route() (renamed from change_route_nhop)
* remove "info" parameter from change_route_conditional() as a part
of reducing rt_addrinfo usage in the internal KPIs
* add lookup_prefix_rt() wrapper for doing re-lookups after
RIB lock/unlock
Differential Revision: https://reviews.freebsd.org/D36070
MFC after: 2 weeks
BPF might put an interface in promiscuous mode when handling the
BIOCSDLT ioctl. When this happens, a flag is set in the BPF descriptor
so that the old interface can be restored when the BPF descriptor is
destroyed.
The BIOCPROMISC ioctl can also be used to put a BPF descriptor's
interface into promiscuous mode, but there was nothing synchronizing the
flag. Fix this by modifying the ioctl handler to acquire the global BPF
mutex, which is used to synchronize ifpromisc() calls elsewhere in BPF.
Reviewed by: kp, melifaro
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D36045
ovpn_find_peer_by_ip() is not used if INET is not defined. Do not
define the function in that case. Same for ovpn_find_peer_by_ip6().
Fix these warnings:
/usr/src/sys/net/if_ovpn.c:1580:1: warning: unused function 'ovpn_find_peer_by_ip' [-Wunused-function]
ovpn_find_peer_by_ip(struct ovpn_softc *sc, const struct in_addr addr)
^
/usr/src/sys/net/if_ovpn.c:1599:1: warning: unused function 'ovpn_find_peer_by_ip6' [-Wunused-function]
ovpn_find_peer_by_ip6(struct ovpn_softc *sc, const struct in6_addr *addr)
^
Reported by: mjg
Sponsored by: Rubicon Communications, LLC ("Netgate")
* Make nhgrp_get_nhops() return const struct weightened_nhop to
indicate that the list is immutable
* Make nhgrp_get_group() return the actual group, instead of
group+weight.
MFC after: 2 weeks
Convert the last remaining pieces of old-style debug messages
to the new debugging framework.
Differential Revision: https://reviews.freebsd.org/D35994
MFC after: 2 weeks
Currently, rt_addrinfo(info) serves as a main "transport" moving
state between various functions inside the routing subsystem.
As all of the fields are filled in directly by the customers, it
is problematic to maintain consistency, resulting in repeated checks
inside many functions. Additionally, there are multiple ways of
specifying the same value (RTAX_IFP vs rti_ifp / rti_ifa) and so on.
With the upcoming nhop(9) kpi it is possible to store all of the
required state in the nexthops in the consistent fashion, reducing the
need to use "info" in the KPI calls.
Finally, rt_addrinfo structure format was derived from the rtsock wire
format, which is different from other kernel routing users or netlink.
This cleanup simplifies upcoming nhop(9) kpi and netlink introduction.
Reviewed by: zlei.huang@gmail.com
Differential Revision: https://reviews.freebsd.org/D35972
MFC after: 2 weeks
Mark dst/mask public API functions fields as const to clearly
indicate that these parameters are not modified or stored in
the datastructure.
Differential Revision: https://reviews.freebsd.org/D35971
MFC after: 2 weeks
Expiration time is actually a path property, not a route property.
Move its storage to nexthop to simplify upcoming nhop(9) KPI changes
and netlink introduction.
Differential Revision: https://reviews.freebsd.org/D35970
MFC after: 2 weeks
In the current implementation of altq_hfsc.c, whne new queues are being
added (by pfctl), each queue is added to the tail of the siblings linked
list under the parent queue.
On a system with many queues (50,000+) this leads to very long load
times at the insertion process must scan the entire list for every new
queue,
Since this list is unordered, this changes merely adds the new queue to
the head of the list rather than the tail.
Reviewed by: kp
MFC after: 3 weeks
Sponsored by: RG Nets
Differential Revision: https://reviews.freebsd.org/D35964
Lagg was broken by SIOCSIFCAPNV when all underlying devices
support SIOCSIFCAPNV. This change updates lagg to work with
SIOCSIFCAPNV and if_capabilities2.
Reviewed by: kib, hselasky
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D35865
With clang 15, the following -Werror warnings are produced:
sys/net/route/route_ctl.c:130:17: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
vnet_rtzone_init()
^
void
sys/net/route/route_ctl.c:139:20: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
vnet_rtzone_destroy()
^
void
This is because vnet_rtzone_init() and vnet_rtzone_destroy() are
declared with (void) argument lists, but defined with empty argument
lists. Make the definitions match the declarations.
MFC after: 3 days
With clang 15, the following -Werror warning is produced:
sys/net/route/nhop_ctl.c:508:21: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
alloc_nhop_structure()
^
void
This is alloc_nhop_structure() is declared with a (void) argument list,
but defined with an empty argument list. Make the definition match the
declaration.
MFC after: 3 days
vlan_remhash() uses incorrect value for b.
When using the default value for VLAN_DEF_HWIDTH (4), the VLAN hash-list table
expands from 16 chains to 32 chains as the 129th entry is added. trunk->hwidth
becomes 5. Say a few more entries are added and there are now 135 entries.
trunk-hwidth will still be 5. If an entry is removed, vlan_remhash() will
calculate a value of 32 for b. refcnt will be decremented to 134. The if
comparison at line 473 will return true and vlan_growhash() will be called. The
VLAN hash-list table will be compressed from 32 chains wide to 16 chains wide.
hwidth will become 4. This is an error, and it can be seen when a new VLAN is
added. The table will again be expanded. If an entry is then removed, again
the table is contracted.
If the number of VLANS stays in the range of 128-512, each time an insert
follows a remove, the table will expand. Each time a remove follows an
insert, the table will be contracted.
The fix is simple. The line 473 should test that the number of entries has
decreased such that the table should be contracted using what would be the new
value of hwidth. line 467 should be:
b = 1 << (trunk->hwidth - 1);
PR: 265382
Reviewed by: kp
MFC after: 2 weeks
Sponsored by: NetApp, Inc.
With clang 15, the following -Werror warning is produced:
sys/net/iflib.c:993:8: error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable]
u_int n;
^
The 'n' variable appears to have been a debugging aid that has never
been used for anything, so remove it.
MFC after: 3 days
With clang 15, the following -Werror warning is produced:
sys/net/if_lagg.c:2413:6: error: variable 'active_ports' set but not used [-Werror,-Wunused-but-set-variable]
int active_ports = 0;
^
The 'active_ports' variable appears to have been a debugging aid that
has never been used for anything (ref https://reviews.freebsd.org/D549),
so remove it.
MFC after: 3 days
It's currently not possible to change the vlan ID or vlan protocol (i.e.
802.1q vs. 802.1ad) without de-configuring the interface (i.e. ifconfig
vlanX -vlandev).
Add a specific flow for this, allowing both the protocol and id (but not
parent interface) to be changed without going through the '-vlandev'
step.
Reviewed by: glebius
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D35846
This is not completely exhaustive, but covers a large majority of
commands in the tree.
Reviewed by: markj
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35583
Combined changes to allow experimentation with net 0/8 (network 0),
240/4 (Experimental/"Class E"), and part of the loopback net 127/8
(all but 127.0/16). All changes are disabled by default, and can be
enabled by the following sysctls:
net.inet.ip.allow_net0=1
net.inet.ip.allow_net240=1
net.inet.ip.loopback_prefixlen=16
When enabled, the corresponding addresses can be used as normal
unicast IP addresses, both as endpoints and when forwarding.
Add descriptions of the new sysctls to inet.4.
Add <machine/param.h> to vnet.h, as CACHE_LINE_SIZE is undefined in
various C files when in.h includes vnet.h.
The proposals motivating this experimentation can be found in
https://datatracker.ietf.org/doc/draft-schoen-intarea-unicast-0https://datatracker.ietf.org/doc/draft-schoen-intarea-unicast-240https://datatracker.ietf.org/doc/draft-schoen-intarea-unicast-127
Reviewed by: rgrimes, pauamma_gundo.com; previous versions melifaro, glebius
Differential Revision: https://reviews.freebsd.org/D35741
If the link is down or we can't find a peer we do not transmit the
packet, but also don't fee it.
Remember to m_freem() mbufs we can't transmit.
Sponsored by: Rubicon Communications, LLC ("Netgate")
VNET_FOREACH() is a LIST_FOREACH if VIMAGE is set, but empty if it's
not. This means that users of the macro couldn't use 'continue' or
'break' as one would expect of a loop.
Change VNET_FOREACH() to be a loop in all cases (although one that is
fixed to one iteration if VIMAGE is not set).
Reviewed by: karels, melifaro, glebius
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D35739
If we receive a UDP packet (directed towards an active OpenVPN socket)
which is too short to contain an OpenVPN header ('struct
ovpn_wire_header') we wound up making m_copydata() read outside the
mbuf, and panicking the machine.
Explicitly check that the packet is long enough to copy the data we're
interested in. If it's not we will pass the packet to userspace, just
like we'd do for an unknown peer.
Extend a test case to provoke this situation.
Sponsored by: Rubicon Communications, LLC ("Netgate")
Some command definitions were forced to use DB_FUNC in order to specify
their required flags, CS_OWN or CS_MORE. Use the new macros to simplify
these.
Reviewed by: markj, jhb
MFC after: 3 days
Sponsored by: Juniper Networks, Inc.
Sponsored by: Klara, Inc.
Differential Revision: https://reviews.freebsd.org/D35582
Openvpn defaults to binding to IPv6 sockets (with
setsockopt(IPV6_V6ONLY=0)), which we didn't deal with.
That resulted in us trying to in6_selectsrc_addr() on a v4 mapped v6
address, which does not work.
Instead we translate the mapped address to v4 and treat it as an IPv4
address.
Sponsored by: Rubicon Communications, LLC ("Netgate")
OpenVPN Data Channel Offload (DCO) moves OpenVPN data plane processing
(i.e. tunneling and cryptography) into the kernel, rather than using tap
devices.
This avoids significant copying and context switching overhead between
kernel and user space and improves OpenVPN throughput.
In my test setup throughput improved from around 660Mbit/s to around
2Gbit/s.
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34340
The function's goal is to compare old/new nhop/nexthop group for the route
and decompose it into the series of RTM_ADD/RTM_DELETE single-nhop
events, calling specified callback for each event.
Simplify it by properly leveraging the fact that both old/new groups
are sorted nhop-# ascending.
Tested by: Claudio Jeker<claudio.jeker@klarasystems.com>
Differential Revision: https://reviews.freebsd.org/D35598
MFC after: 2 weeks
Nexthops in the nexthop groups needs to be deterministically sorted
by some their property to simplify reporting cost when changing
large nexthop groups.
Fix reporting by actually sorting next hops by their indices (`wn_cmp_idx()`).
As calc_min_mpath_slots_fast() has an assumption that next hops are sorted
using their relative weight in the nexthop groups, it needs to be
addressed as well. The latter sorting is required to quickly determine the
layout of the next hops in the actual forwarding group. For example,
what's the best way to split the traffic between nhops with weights
19,31 and 47 if the maximum nexthop group width is 64?
It is worth mentioning that such sorting is only required during nexthop
group creation and is not used elsewhere. Lastly, normally all nexthop
are of the same weight. With that in mind, (a) use spare 32 bytes inside
`struct weightened_nexthop` to avoid another memory allocation and
(b) use insertion sort to sort the nexthop weights.
Reported by: thj
Tested by: Claudio Jeker<claudio.jeker@klarasystems.com>
Differential Revision: https://reviews.freebsd.org/D35599
MFC after: 2 weeks
Rather than reject new bridge members because they have the wrong MTU
change it to match the bridge. If that fails, reject the new interface.
PR: 264883
Different Revision: https://reviews.freebsd.org/D35597
Use unified guidelines for the severity across the routing subsystem.
Update severity for some of the already-used messages to adhere the
guidelines.
Convert rtsock logging to the new FIB_ reporting format.
MFC after: 2 weeks
route.
Reporting logic assumed there is always some nhop change for every
successful modification operation. Explicitly check that the changed
nexthop indeed exists when reporting back to userland.
MFC after: 2 weeks
Reported by: Claudio Jeker <claudio.jeker@klarasystems.com>
Tested by: Claudio Jeker <claudio.jeker@klarasystems.com>
RTM_CHANGE operates on a single component of the multipath route (e.g. on a single nexthop).
Search of this nexthop is peformed by iterating over each component from multipath (nexthop)
group, using check_info_match_nhop. The problem with the current code that it incorrectly
assumes that `check_info_match_nhop()` returns true value on match, while in reality it
returns an error code on failure). Fix this by properly comparing the result with 0.
Additionally, the followup code modified original necthop group instead of a new one.
Fix this by targetting new nexthop group instead.
Reported by: thj
Tested by: Claudio Jeker <claudio.jeker@klarasystems.com>
Differential Revision: https://reviews.freebsd.org/D35526
MFC after: 2 weeks
BPF headers are word-aligned when copied into the store buffer. Ensure
that pad bytes following the preceding packet are cleared.
Reported by: KMSAN
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Some drivers will collect multiple mbuf chains, linked by m_nextpkt,
before passing them to upper layers. debugnet_pkt_in() didn't handle
this and would process only the first packet, typically leading to
retransmits.
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
In lacp_select_tx_port_by_hash(), we assert that the selected port is
DISTRIBUTING. However, the port state is protected by the LACP_LOCK(),
which is not held around lacp_select_tx_port_by_hash(). So this
assertion is racy, and can result in a spurious panic when links
are flapping.
It is certainly possible to fix it by acquiring LACP_LOCK(),
but this seems like an early development assert, and it seems best
to just remove it, rather than add complexity inside an ifdef
INVARIANTS.
Sponsored by: Netflix
Reviewed by: hselasky
Differential Revision: https://reviews.freebsd.org/D35396
The TLS receive tags are allocated directly from the receiving interface,
because mbufs are flowing in the opposite direction and then route change
checks are not useful, because they only work for outgoing traffic.
Differential revision: https://reviews.freebsd.org/D32356
Sponsored by: NVIDIA Networking
The TLS receive tags are allocated directly from the receiving interface,
because mbufs are flowing in the opposite direction and then route change
checks are not useful, because they only work for outgoing traffic.
Differential revision: https://reviews.freebsd.org/D32356
Sponsored by: NVIDIA Networking
Mbufs leak when manually removing incomplete NDP records with pending packet via ndp -d.
It happens because lltable_drop_entry_queue() rely on `la_numheld`
counter when dropping NDP entries (lles). It turned out NDP code never
increased `la_numheld`, so the actual free never happened.
Fix the issue by introducing unified lltable_append_entry_queue(),
common for both ARP and NDP code, properly addressing packet queue
maintenance.
Reviewed By: melifaro
Differential Revision: https://reviews.freebsd.org/D35365
MFC after: 2 weeks
We could insert proxy NDP entries by the ndp command, but the host
with proxy ndp entries had not responded to Neighbor Solicitations.
Change the following points for proxy NDP to work as expected:
* join solicited-node multicast addresses for proxy NDP entries
in order to receive Neighbor Solicitations.
* look up proxy NDP entries not on the routing table but on the
link-level address table when receiving Neighbor Solicitations.
Reviewed By: melifaro
Differential Revision: https://reviews.freebsd.org/D35307
MFC after: 2 weeks
In order to decrease ifdef INET/INET6s in the lltable implementation,
introduce the llt_post_resolved callback and implement protocol-dependent
code in the protocol-dependent part.
Reviewed By: melifaro
Differential Revision: https://reviews.freebsd.org/D35322
MFC after: 2 weeks
Provide sticky ARP flag for network interface which marks it as the
"sticky" one similarly to what we have for bridges. Once interface is
marked sticky, any address resolved using the ARP will be saved as a
static one in the ARP table. Such functionality may be used to prevent
ARP spoofing or to decrease latencies in Ethernet networks.
The drawbacks include potential limitations in usage of ARP-based
load-balancers and high-availability solutions such as carp(4).
The implemented option is disabled by default, therefore should not
impact the default behaviour of the networking stack.
Sponsored by: Conclusive Engineering sp. z o.o.
Reviewed By: melifaro, pauamma_gundo.com
Differential Revision: https://reviews.freebsd.org/D35314
MFC after: 2 weeks
We may call debugnet_free() before g_debugnet_pcb_inuse is true,
specifically in the cases where the interface is down or does not
support debugnet. pcb->dp_drv_input is used to hold the real driver
if_input callback while debugnet is in use, so we can check the status
of this field in the assertion.
This can be triggered trivially by trying to configure netdump on an
unsupported interface at the ddb prompt.
Initializing the dp_drv_input field to NULL explicitly is not necessary
but helps display the intent.
PR: 263929
Reported by: Martin Filla <freebsd@sysctl.cz>
Reviewed by: cem, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D35179
struct sockaddr is not sufficient for buffer that can hold any
sockaddr_* structure. struct sockaddr_storage should be used.
Test:
ifconfig epair create
ifconfig epair0a inet6 add 2001:db8::1 up
ndp -s 2001:db8::2 02:86:98:2e:96:0b proxy # this triggers kernel stack overflow
Reviewed by: markj, kp
Differential Revision: https://reviews.freebsd.org/D35188
If 'options RSS' is set we bind the epair tasks to different CPUs. We
must take care to not keep the current thread bound to the last CPU when
we return to userspace.
MFC after: 1 week
Sponsored by: Orange Business Services
When we destroy an interface while the jail containing it is being
destroyed we risk seeing a race between if_vmove() and the destruction
code, which results in us trying to move a destroyed interface.
Protect against this by using the ifnet_detach_sxlock to also covert
if_vmove() (and not just detach).
PR: 262829
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D34704
Supplement ifindex table with generation count and use it to
serialize & restore an ifnet pointer.
Reviewed by: kp
Differential revision: https://reviews.freebsd.org/D33266
Fun note: git show e6abef0918
(cherry picked from commit e1882428dc)
Now that ifindex is static to if.c we can unvirtualize it. For lifetime
of an ifnet its index never changes. To avoid leaking foreign interfaces
the net.link.generic.system.ifcount sysctl and the ifnet_byindex() KPI
filter their returned value on curvnet. Since if_vmove() no longer
changes the if_index, inline ifindex_alloc() and ifindex_free() into
if_alloc() and if_free() respectively.
API wise the only change is that now minimum interface index can be
greater than 1. The holes in interface indexes were always allowed.
Reviewed by: kp
Differential revision: https://reviews.freebsd.org/D33672
(cherry picked from commit 91f44749c6)
This reverts commit 91f44749c6.
Devirtualization of V_if_index and V_ifindex_table was rushed into
the tree lacking proper context, discussion, and declaration of intent,
so I'm backing it out as harmful to VNET on the following grounds:
1) The change repurposed the decades-old and stable if_index KBI for
new, unclear goals which were omitted from the commit note.
2) The change opened up a new resource exhaustion vector where any vnet
could starve the system of ifnet indices, including vnet0.
3) To circumvent the newly introduced problem of separating ifnets
belonging to different vnets from the globalized ifindex_table, the
author introduced sysctl_ifcount() which does a linear traversal over
the (potentially huge) global ifnet list just to return a simple upper
bound on existing ifnet indices.
4) The change effectively led to nonuniform ifnet index allocation
among vnets.
5) The commit note clearly stated that the patch changed the implicit
if_index ABI contract where ifnet indices were assumed to be starting
from one. The commit note also included a correct observation that
holes in interface indices were always allowed, but failed to declare
that the userland-observable ifindex tables could now include huge
empty spans even under modest operating conditions.
6) The author had an earlier proposal in the works which did not
affect per-vnet ifnet lists (D33265) but which he abandoned without
providing the rationale behind his decision to do so, at the expense
of sacrificing the vnet isolation contract and if_index ABI / KBI.
Furthermore, the author agreed to back out his changes himself and
to follow up with a proposal for a less intrusive alternative, but
later silently declined to act. Therefore, I decided to resolve the
status-quo by backing this out myself. This in no way precludes a
future proposal aiming to mitigate ifnet-removal related system
crashes or panics to be accepted, provided it would not unnecessarily
compromise the goal of as strict as possible isolation between vnets.
Obtained from: github.com/glebius/FreeBSD/commits/backout-ifindex
This reverts commit 703e533da5.
Revert "ifnet/mbuf: provide KPI to serialize/restore m->m_pkthdr.rcvif"
This reverts commit e1882428dc.
Obtained from: github.com/glebius/FreeBSD/commits/backout-ifindex
Panasas was seeing a higher-than-expected number of link-flap events.
After joint debugging with the switch vendor, we determined there were
problems on both sides; either of which might cause the occasional
event, but together caused lots of them.
On the switch side, an internal queuing issue was causing LACP PDUs --
which should be sent every second, in short-timeout mode -- to sometimes
be sent slightly later than they should have been. In some cases, two
successive PDUs were late, but we never saw three late PDUs in a row.
On the FreeBSD side, we saw a link-flap event every time there were two
late PDUs, while the spec says that it takes *three* seconds of downtime
to trigger that event. It turns out that if a PDU was received shortly
before the timer code was run, it would decrement less than a full
second after the PDU arrived. Then two delayed PDUs would cause two
additional decrements, causing it to reach zero less than three seconds
after the most-recent on-time PDU.
The solution is to note the time a PDU arrives, and only decrement if at
least a full second has elapsed since then.
Reported by: Greg Foster <gfoster@panasas.com>
Reviewed by: gallatin
Tested by: Greg Foster <gfoster@panasas.com>
MFC after: 3 days
Sponsored by: Panasas
Differential Revision: https://reviews.freebsd.org/D35070
Similar to ipfw rule timestamps, these timestamps internally are
uint32_t snaps of the system time in seconds. The timestamp is CPU local
and updated each time a rule or a state associated with a rule or state
is matched.
Reviewed by: kp
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34970
Allow tables to be used for the l3 source/destination matching.
This requires taking the PF_RULES read lock.
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34917
Allow udp tunnel functions to indicate they have not taken ownership of
the packet, and that normal UDP processing should continue.
This is especially useful for scenarios where the kernel has taken
ownership of a socket that was originally created by userspace. It
allows the tunnel function to pass through certain packets for userspace
processing.
The primary user of this is if_ovpn, when it receives messages from
unknown peers (which might be a new client).
Reviewed by: tuexen
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34883
Also convert raw epoch_call() calls to lltable_free_entry() calls, no
functional change intended. There's no need to asynchronously free the
LLEs in that case to begin with, but we might as well use the lltable
interfaces consistently.
Noticed by code inspection; I believe lltable_calc_llheader() failures
do not generally happen in practice.
Reviewed by: bz
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34832
Possibly one could assert that ret should always be 0 here (that is,
that there was always an index found in the bitmask). That should be
true since a bitmask index is allocated before the nhgrp is inserted
in the ctl->gr_head list in link_nhgrp.
with md5 sum used as key.
This gets rid of the quadratic rule traversal when "keep_counters" is
set.
Reviewed by: kp
Sponsored by: Rubicon Communications, LLC ("Netgate")
Makes it cheaper to compare rules when "keep_counters" is set.
This also sets up keeping them in a RB tree.
Reviewed by: kp
Sponsored by: Rubicon Communications, LLC ("Netgate")
For now only protects rule creation/destruction, but will allow
gradually reducing the scope of rules lock when changing the
rules.
Reviewed by: kp
Sponsored by: Rubicon Communications, LLC ("Netgate")
66acf7685b failed to build on riscv (and mips). This is because the
atomic_testandset_int() (and friends) functions do not exist there.
Happily those platforms do have the long variant, so switch to that.
PR: 262571
MFC after: 3 days
As an unwanted side effect of the performance improvements in
24f0bfbad5, epair interfaces stop forwarding traffic on higher
load levels when running on multi-core systems.
This happens due to a race condition in the logic that decides when to
place work in the task queue(s) responsible for processing the content
of ring buffers.
In order to fix this, a field named state is added to the epair_queue
structure. This field is used by the affected functions to signal each
other that something happened in the underlying ring buffers that might
require work to be scheduled in task queue(s), replacing the existing
logic, which relied on checking if ring buffers are empty or not.
epair_menq() does:
- set BIT_MBUF_QUEUED
- queue mbuf
- if testandset BIT_QUEUE_TASK:
enqueue task
epair_tx_start_deferred() does:
- swap ring buffers
- process mbufs
- clear BIT_QUEUE_TASK
- if testandclear BIT_MBUF_QUEUED
enqueue task
PR: 262571
Reported by: Johan Hendriks <joh.hendriks@gmail.com>
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D34569
Allow filtering based on the source or destination IP/IPv6 address in
the Ethernet layer rules.
Reviewed by: pauamma_gundo.com (man), debdrup (man)
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D34482
Symptom: when a single extmem memory region is provided to netmap
multiple times, for multiple interfaces, the memory region is
never released by netmap once all the existing file descriptors
are closed.
Fix the relevant condition in netmap_mem_drop(): release the memory
when the last user of netmap_adapter is gone, rather then when
the last user of netmap_mem_d is gone.
MFC after: 2 weeks
When filtering Ethernet packets allow rules to specify a mac address
with a mask. This indicates which bits of the specified address are
significant. This allows users to do things like filter based on device
manufacturer.
Sponsored by: Rubicon Communications, LLC ("Netgate")
Allow packets to be tagged with dummynet information. Note that we do
not apply dummynet shaping on the L2 traffic, but instead mark it for
dummynet processing in the L3 code. This is the same approach as we take
for ALTQ.
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D32222
Avoid the overhead of acquiring a (read) RULES lock when processing the
Ethernet rules.
We can get away with that because when rules are modified they're staged
in V_pf_keth_inactive. We take care to ensure the swap to V_pf_keth is
atomic, so that pf_test_eth_rule() always sees either the old rules, or
the new ruleset.
We need to take care not to delete the old ruleset until we're sure no
pf_test_eth_rule() is still running with those. We accomplish that by
using NET_EPOCH_CALL() to actually free the old rules.
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31739
This is the kernel side of stateless Ethernel level filtering for pf.
The primary use case for this is to enable captive portal functionality
to allow/deny access by MAC address, rather than per IP address.
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D31737
if_bridge duplicates broadcast packets with m_copypacket(), which
creates shared packets. In certain circumstances these packets can be
processed by udp_usrreq.c:udp_input() first, which modifies the mbuf as
part of the checksum verification. That may lead to incorrect packets
being transmitted.
Use m_dup() to create independent mbufs instead.
Reported by: Richard Russo <toast@ruka.org>
Reviewed by: donner, afedorov
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D34319
Allow multiple cores to be used to process if_epair traffic. We do this
(if RSS is enabled) based on the RSS hash of the incoming packet. This
allows us to distribute the load over multiple cores, rather than
sending everything to the same one.
We also switch from swi_sched() to taskqueues, which also contributes to
better throughput.
Benchmark results:
With net.isr.maxthreads=-1
Setup A: (cc0 - bridge0 - epair0a) (epair0b - bridge1 - cc1)
Before 627 Kpps
After (no RSS) 1.198 Mpps
After (RSS) 3.148 Mpps
Setup B: (cc0 - bridge0 - epaira0) (epair0b - vnet jail - epair1a) (epair1b - bridge1 - cc1)
Before 7.705 Kpps
After (no RSS) 1.017 Mpps
After (RSS) 2.083 Mpps
MFC after: 3 weeks
Sponsored by: Orange Business Services
Differential Revision: https://reviews.freebsd.org/D33731
6d4baa0d01 incorrectly rounded the lenght of the pflog header up to 8
bytes, rather than 4.
PR: 261566
Reported by: Guy Harris <gharris@sonic.net>
MFC after: 1 week
Sponsored by: Rubicon Communications, LLC ("Netgate")
There are some error paths in ioctl handlers that will call
pf_krule_free() before the rule's rpool.mtx field is initialized,
causing a panic with INVARIANTS enabled.
Fix the problem by introducing pf_krule_alloc() and initializing the
mutex there. This does mean that the rule->krule and pool->kpool
conversion functions need to stop zeroing the input structure, but I
don't see a nicer way to handle this except perhaps by guarding the
mtx_destroy() with a mtx_initialized() check.
Constify some related functions while here and add a regression test
based on a syzkaller reproducer.
Reported by: syzbot+77cd12872691d219c158@syzkaller.appspotmail.com
Reviewed by: kp
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D34115
netisr uses global workstreams and after dequeueing an mbuf it
uses rcvif to get the VNET of the mbuf. Of course, this is not
needed when kernel is compiled without VIMAGE. It came out that
routing socket does not set rcvif if compiled without VIMAGE.
Make this assignment not depending on VIMAGE option.
Fixes: 6871de9363
Supplement ifindex table with generation count and use it to
serialize & restore an ifnet pointer.
Reviewed by: kp
Differential revision: https://reviews.freebsd.org/D33266
Fun note: git show e6abef0918
Now that ifindex is static to if.c we can unvirtualize it. For lifetime
of an ifnet its index never changes. To avoid leaking foreign interfaces
the net.link.generic.system.ifcount sysctl and the ifnet_byindex() KPI
filter their returned value on curvnet. Since if_vmove() no longer
changes the if_index, inline ifindex_alloc() and ifindex_free() into
if_alloc() and if_free() respectively.
API wise the only change is that now minimum interface index can be
greater than 1. The holes in interface indexes were always allowed.
Reviewed by: kp
Differential revision: https://reviews.freebsd.org/D33672
Although send tags are strictly used for transmit, the name might be changed
in the future to be more generic.
The TLS receive tags support regular IPv4 and IPv6 traffic, and also over any
VLAN. If prio-tagging is enabled, VLAN ID zero, this must be checked in the
network driver itself when creating the TLS RX decryption offload filter.
TLS receive tags have a modify callback to tell the network driver about
the progress of decryption. Currently decryption is done IP packet by IP
packet, even if the IP packet contains a partial TLS record. The modify
callback allows the network driver to keep track of TCP sequence numbers
pointing to the beginning of TLS records after TCP packet reassembly.
These callbacks only happen when encrypted or partially decrypted data is
received and are used to verify the decryptions starting point for the
hardware. Typically the hardware will guess where TLS headers start and
needs help from the software to know if the guess was correct. This is
the purpose of the modify callback.
Differential Revision: https://reviews.freebsd.org/D32356
Discussed with: jhb@
MFC after: 1 week
Sponsored by: NVIDIA Networking
Try to live with cruel reality fact - if_vmove doesn't move an
interface from previous vnet cloning infrastructure to the new
one. Let's admit this as design feature and make it work better.
* Delete two blocks of code that would fallback to vnet0, if a
cloner isn't found. They didn't do any good job and also whole
idea of treating vnet0 as special one is wrong.
* When deleting a cloned interface, lookup its cloner using it's
home vnet.
With this change simple sequence works correctly:
ifconfig foo0 create
jail -c name=jj persist vnet vnet.interface=foo0
jexec jj ifconfig foo0 destroy
Differential revision: https://reviews.freebsd.org/D33942
* Do a single call into if_clone.c instead of two. The cloner
can't disappear since the interface sits on its list.
* Make restoration smarter - check that cloner with same name
exists in the new vnet.
Differential revision: https://reviews.freebsd.org/D33941
Adds a new function pointer to struct if_txrx in order to allow
drivers to set their own function that will determine which queue
a packet should be sent on.
Since this includes a kernel ABI change, bump the __FreeBSD_version
as well.
(This motivation behind this is to allow the driver to examine the
UP in the VLAN tag and determine which queue to TX on based on
that, in support of HW TX traffic shaping.)
Signed-off-by: Eric Joyner <erj@FreeBSD.org>
Reviewed by: kbowling@, stallamr@netapp.com
Tested by: jeffrey.e.pieper@intel.com
Sponsored by: Intel Corporation
Differential Revision: https://reviews.freebsd.org/D31485
In iflib_device_register(), the CTX_LOCK is acquired first and then
IFNET_WLOCK is acquired by ether_ifattach(). However, in netmap_hw_reg()
we do the opposite: IFNET_RLOCK is acquired first, and then CTX_LOCK
is acquired by iflib_netmap_register(). Fix this LOR issue by wrapping
the CTX_LOCK/UNLOCK calls in iflib_device_register with an additional
IFNET_WLOCK. This is safe since the IFNET_WLOCK is recursive.
MFC after: 1 month